KinDEL: DNA-Encoded Library Dataset for Kinase Inhibitors
Dataset and benchmark paper for a 81 million small molecule DNA-encoded library to find hits for drug discovery
摘要
评审与讨论
The paper introduces a novel dataset containing DEL screen data for two different proteins, and benchmarks the performance of several supervised ML methods in the context of learning from the provided data.
给作者的问题
N/A.
论据与证据
Yes.
方法与评估标准
Yes.
理论论述
N/A.
实验设计与分析
Yes.
补充材料
Yes.
与现有文献的关系
New dataset.
遗漏的重要参考文献
No.
其他优缺点
Strengths of the paper include the importance of the problem (DELs seem to be a useful screening technique in practice, and have received relatively little attention from ML community), the rigorousness of the conducted experiments (I appreciate e.g. testing different splitting conditions, and the choice of benchmarked ML method, which to my knowledge are fairly good coverage of SotA), and realistic discussion of limitations of the data.
In my opinion, the paper has two main weaknesses:
-
I suspect it will be difficult to follow by people with ML background (it definitely was for me at times, and I have some experience with DELs - some things only became clear on 2nd pass, because they rely on information from later sections). To give a few examples (not exhaustive):
- In introduction: “Multiple rounds of washing are conducted to remove any weak binders, and the DNA tags of surviving molecules are sequenced as a measure of binding affinity” - it’s not clear at this point what is the result of this sequencing, and how is it a measure of binding affinity (as opposed to be a binary binding/non-binding indication). Counting is clarified in section 2, but this caused confusion until then.
- Figure 2 uses the term library capture which wasn’t immediately obvious to me, similarly it mentions biotinylation of protein which doesn’t seem to be described at all.
- “Biophysical Assay Validation”: it isn’t clear how this subset of data was selected (in the next section the Appendix is eventually referenced, but again it leaves the reader hanging for some reason until then). The number of samples seems very small, given the discussed noisiness of data, is this a common practice to use so few data points for validation?
- “Pre-selection Data”: it isn’t clear what causes differences in abundance of molecules. Also, from the ML point of view, it’s also not clear how this information is typically used - should it involve count normalization?
- I would personally love to see some high-level overview of the data from the ML perspective, perhaps as a graphical abstract-type figure: summarized in a single place number of molecules, distribution of counts (with a note that it correlates with binding affinity), number of targets, information about estimated noise (baseline correlations?), information about test set, perhaps also splits. There is a lot of detail in the paper separating this information, and finding the (in my mind) important pieces takes some effort.
-
Experimental evaluation:
- The description of held-out test sets is very vague in terms of value range - would it be possible to demonstrate Kd’s in more detail than saying it’s a “range of Kd values” (and doing so in the appendix)? This is fairly meaningless statement, and it impacts the interpretation of the results - do ML methods discriminate binders and non-binders? Is it ranking within binders?
- “all models were trained using the top 1M compounds with the highest counts” - this seems like a very confusing design choice, could the authors elaborate on this? It goes back to the previous point: if the test set contains only binders, it seems like cheating to train only on the data with highest counts. Perhaps there are good reasons for this (e.g. non-binders being distinguishable by things like docking), but in that case discussion would be warranted.
- Also going back to the previous points, “We ultimately wish to rank molecules by binding affinity” - again, it isn’t clear to me that’s the case, and that the goal is not discriminating binders / non-binders - discussion would be great.
其他意见或建议
Minor remarks:
- Would it be possible to have a split in which different data partitions contain completely separate synthons? Would it make sense to do so?
- One source of DEL data that I think would be worth mentioning is https://www.aircheck.ai/; I am not sure if it has associated paper though.
Thank you for your careful review of our work and your valuable feedback.
Dataset description: We appreciate your specific comments on making our paper more accessible to machine learning researchers. We will revise Figure 2 by removing unnecessary technical details that do not contribute to the understanding of the data format. For instance, the biotinylation of proteins is detailed in the appendix and can be omitted from Figure 2 to avoid potential confusion. Additionally, we will add a new subplot that outlines the high-level data acquisition process and data format. This subplot will clarify the inputs and labels (pre- and post-selection count data) used in our models. We will also make sure to define all new terms that might be confusing for the machine-learning audience.
Biophysical Assay Validation: We will relocate the appendix reference concerning the selection of validation compounds to an earlier section of the text for clarity. The limited number of validation points is due to the high cost associated with these experiments, as each molecule requires resynthesis, either on-DNA or off-DNA, along with separate biophysical assays for each.
Pre-selection Data: During the synthesis of the DNA-encoded library, not all molecules are produced in equal amounts. Variability can arise from differences in synthesis efficiency, coupling yields, or synthesis errors. Additionally, some molecules may precipitate or adhere to storage containers, resulting in loss and lower observed counts. While this data is typically used to normalize counts, there are important caveats. Sequencing inaccuracies can lead to error propagation, especially for low-abundance molecules. DEL-Compose is a model that addresses these issues by incorporating pre-selection counts as inputs, which serve as a multiplier when modeling the zero-inflated Poisson distribution in the loss function. We will add this discussion in the appendix.
Range of Kd values in the held-out test data: We will provide histograms of Kd values in Appendix C. Our approach involves selecting both binders and non-binders with varying Kd values to evaluate our model's ability to accurately rank these molecules using Spearman’s correlation. Although we propose a regression problem where molecules are ranked from most to least promising, the Kd values can also be categorized into two groups for testing classification models, as the dataset includes both binders and non-binders.
Only 1M top molecules are used for training: We observe that the baseline models struggle with sparse DEL data and require modifications to the loss function or additional inductive biases to handle label sparsity. To simplify learning, we train on the top 1M compounds. This introduces some bias relative to the evaluation set, but since we aim to rank order the molecules rather than make binary predictions, we believe our model can effectively rank molecules in validation sets. Nonetheless, we provide access to the full dataset for researchers to explore better results with more training points.
Ranking molecules by binding affinity: All the evaluated models predict continuous values, employing regression, and we use Spearman's correlation as the evaluation metric. Unlike Pearson's correlation, which identifies linear relationships, Spearman’s correlation effectively assesses ranking quality. We will add a brief discussion on the ranking issue and our rationale behind selecting this evaluation metric. In short, these models are often used to score available molecules to maximize their binding affinity. It is challenging to establish a firm threshold to define a binder because priorities change depending on the target and stage of drug development. Measuring model performance via ranking ensures that good models retrieve the best options for any given target and screening pool combination.
Other data splits: Regarding the splitting of KinDEL, we explored several methods and determined that the disynthon split is optimal. Although we considered splitting datasets by single synthon positions, this approach presented two challenges. Firstly, only one part of each molecule is unique in the testing data, while the model encounters all possible disynthons at the remaining two positions. Secondly, some synthons may feature motifs essential for binding, such as kinase hinge-binding motifs, which could end up entirely in the testing data. We believe that the disynthon approach strikes a sensible balance. Please see Figure 4, where the visible line patterns relate to enriched disynthons, and the absence of clear planes indicates the lack of enriched (mono)synthons.
AIRCHECK datasets: Thank you very much for this suggestion. We will reference this resource in the related work.
Thank you once again for your invaluable review and suggestions. Please let us know if you have any further questions.
I feel that my comments were addressed appropriately, and I decided to increase my review score.
The KinDEL paper provides a new dataset of laboratory measurements related to DNA-encoded libraries (DEL) together with baseline models that analyze these data. The experiments used a library of 81 million small molecules interacting with two kinases: MAPK14 and DDR1. The authors further extended the dataset with additional on-DNA repeat measurements of the top hits, binding affinity measurements for selected molecules off-DNA molecules, and predicted 3D structures of the interactions of these off-DNA molecules with the targets. This dataset opens up an opportunity for important work related to denoising DEL experiments and for incorporating the DEL experimental results in downstream applications.
给作者的问题
No questions to the authors.
论据与证据
The lack of public datasets related to DEL experiments is a known issue that has plagued the public development of denoising algorithms for DEL datasets. The huge quantity of data generated from these experiments is ideally suited for downstream applications that incorporate machine learning. This is the first such dataset to link the raw DEL measurements with on-DNA repeats and proposed 3D structures. These additional off-DNA measurements are invaluable for testing the generality of the small molecule models. The chosen targets are well-studied proteins that have additional public datasets available, so one could extend this work in the future to test the generality and the limitations of DEL datasets. I found no problematic claims in the submission, though I haven't inspected the dataset itself carefully beyond the analyses presented herein.
方法与评估标准
Everything in this paper makes sense. The authors are certainly very familiar with the problem domain and provide strong baselines and well-documented datasets. The division of the training datasets in 3 different splits and the documentation of baseline models on all categories follow the best current ideas in this domain.
A totally minor, mostly aesthetic point: the numbers of molecules for each of the on-DNA, off-DNA, and extended on-DNA datasets are only mentioned in the top of the tables (presumably written as n=30, 33, 41 mean in Table 1, and similar in Table 2). The authors might consider listing these numbers more prominently or point to the entries in the table from somewhere in the text.
理论论述
There were no major or novel theoretical claims in this work. The ideas around possible combination of 3D structures with DEL datasets have been previously published. The choice of baseline models make sense and the most advanced of them was also previously published last year (Chen et al, 2024). The observation of improved behavior for the reasonable baselines on off-DNA data compared to the count-based analyses is also meaningful and is the main reason for the development of non-trivial analysis methods for datasets from DEL selection experiments.
实验设计与分析
The experiments were sound and valid. Depending on the library composition, the use of Avi tags and streptavidin beads might be suboptimal for immobilization during a DEL selection experiment. However, in this specific case, the additional measurements without targets, the fluorescence measurements (where streptavidin binding is an optimal choise), and the validation of the baseline models outside the DEL setup strongly suggest that the dataset is clean.
补充材料
Yes, I did review sections, A, B, C.
与现有文献的关系
This work adds to a rather limited set of public datasets of DEL experiments. The authors did a good job of reviewing them in section 5.1.
遗漏的重要参考文献
A detailed understanding of this paper requires understanding of the broader scientific literature related to DNA-encoded library selection experiments, and a reasonable understanding of the industrial process of drug discovery, some of which is unfortunately not well documented and differs in different companies.
The use of such data by academics and non-experts in the future is of course possible, which makes this particular contribution important enough to warrant consideration in this conference.
其他优缺点
The paper is clear and concise. The dataset is valuable. It would have been useful to test the generalization of the models on other known molecules for these targets, as well as to evaluate the ability/suitability of these models to score decoys or discriminate molecules for these specific kinases vs other kinases. However, such extensions are best left as follow-up work to this contribution and would have needlessly extended the length and scope of this work.
其他意见或建议
No additional suggestions.
Thank you for dedicating your time to review our paper and for your positive feedback and valuable suggestions. We will include the number of data points for the testing sets in the main text as recommended. Furthermore, we agree that evaluating the generalization of the models on external datasets, including compounds with activity against other kinases and decoys, is a compelling research direction. We see this as a promising avenue for future work. Once again, we appreciate your thorough review and insightful feedback! Please let us know if you have any further questions.
Thank you. I have no additional questions or comments.
This paper introduces a dataset of DNA encoded libraries for two kinases, MAPK14 and DDR1, with 81 million compounds. This dataset will be of use in drug discovery applications and modeling of related biological processes.
给作者的问题
None
论据与证据
The claims of the paper are well supported by the tables shown in the article.
方法与评估标准
The evaluation and benchmark of the data in terms of Kd predictions is sensible and the results look reasonable.
理论论述
There are no proofs.
实验设计与分析
I read through the experimental design and analysis and see no obvious issues.
补充材料
I reviewed the appendices and they make sense.
与现有文献的关系
This article introduces a useful dataset that contains high chemical diversity, and will be useful as a reference for the development of new methods.
遗漏的重要参考文献
I don't have any references to add.
其他优缺点
This paper does not really fit within the scope of the main conference track. It would be better suited to a workshop, or a journal that is better suited to evaluate the experimental assays included in this study. There are no model or algorithm developments of significance.
其他意见或建议
None
We appreciate the Reviewer's time and effort in evaluating our paper and are pleased with the overall positive feedback. The only concern raised pertains to our choice of venue for publication. We would like to address this by outlining our reasons for selecting this machine-learning conference over a journal.
Firstly, we believe that presenting our dataset and benchmark at such a conference maximizes its impact by directly reaching its target audience—machine learning researchers. The lack of high-quality DEL datasets is a bottleneck for developing advanced machine learning models. By contributing this work, we aim to advance research in analyzing the outputs of innovative compound screening technologies.
Furthermore, this year ICML has specifically called for application-driven submissions. The Call For Papers seeks “Application-Driven Machine Learning (innovative techniques, problems, and datasets that are of interest to the machine learning community and driven by the needs of end-users in applications such as healthcare, physical sciences, biosciences)”. We believe our paper aligns well with this focus, as it provides a valuable new dataset for an important area of science that currently suffers from a lack of high-quality data.
Thank you once more for your feedback. Please let us know if you have any further questions.
I appreciate the authors reply to my review.
I do not really see why a dataset would need to be published in the premier conference for machine learning practitioners for it to reach this audience. ChEMBL, PDB, UniRef, or more recently PLINDER, OAS, PDBind are all datasets used by ML researchers interested in applications to chemistry and biology, and none of these datasets themselves have been published as main track ML conference paper.
Nonetheless, it is clear this is a valuable contribution to the community. Given enthusiasm of some of the other reviewers as well, I have bumped the score to 3.
This paper presents KinDEL, a DNA-encoded library (DEL) dataset for two kinases (MAPK14 and DDR1), comprising 81 million compound-target interactions with associated on-DNA repeats, off-DNA binding affinities, and predicted 3D structures. It includes baseline models and multiple evaluation splits.
All reviewers acknowledge the dataset's value. Reviewer wZnX strongly supports acceptance, noting the dataset's novelty and utility for modeling DEL experiments. Reviewer Yxwq highlights the rigorous evaluation and the importance of DELs as a largely untapped area for ML, raising minor concerns about clarity, which the authors addressed well in the rebuttal. Reviewer 2jpD initially questioned venue fit but ultimately recognized the contribution’s merit and raised their score to a weak accept.
The authors make a convincing case for ICML as the appropriate venue, citing the conference’s call for application-driven ML in biosciences. The rebuttal directly addresses reviewer concerns and proposes clear manuscript improvements.
This is a well-executed, impactful contribution that introduces a high-quality benchmark dataset with clear relevance to the ML community.