6.3

/10

Poster4 位审稿人

最低6最高7标准差0.4

3.5

置信度

正确性3.3

贡献度3.0

表达2.8

NeurIPS 2024

S-MolSearch: 3D Semi-supervised Contrastive Learning for Bioactive Molecule Search

Gengmo Zhou,Zhen Wang,Feng Yu,Guolin Ke,Zhewei Wei,Zhifeng Gao

OpenReview PDF

提交: 2024-05-15更新: 2024-11-06

TL;DR

S-MolSearch, a semi-supervised framework for ligand-based virtual screening that leverages molecular 3D information. S-MolSearch efficiently processes labeled and unlabeled data with inverse optimal transport, achieving SOTA on LIT-PCBA and DUD-E

摘要

关键词

semi-supervised learning; 3D molecule search; contrastive learning

评审与讨论

审稿意见

评分: 7置信度: 42024-07-09

The paper introduces S-MolSearch, a framework for ligand-based virtual screening in drug discovery that addresses the challenges of limited and noisy binding affinity data. By utilizing molecular 3D information and semi-supervised contrastive learning, S-MolSearch processes both labeled and unlabeled data to train molecular structural encoders and generate soft labels for the unlabeled data, drawing on inverse optimal transport principles. The framework outperforms existing structure-based and ligand-based virtual screening methods, as evidenced by its superior performance on the LIT-PCBA and DUD-E benchmark datasets.

优点

Well-written
Well-organized experimental settings and comparison methods

缺点

There is a lack of discussion on the reasons behind the performance differences and improvements, with only numerical comparisons of the experimental results.
There is insufficient experimentation and consideration regarding the time required for virtual screening.
There are no results for experimental metrics such as AUROC or BEDROC, which were used in previous studies.

问题

Both S-MolSearch and existing methods experience a decline in performance as the % of EF increases. Additional discussion on the reasons for this phenomenon is needed.
Why do soft labels based on inverse optimal transport seem to have a significant impact on the DUD-E dataset but a lesser impact on the LIT-PCBA dataset?
What aspects of the semi-supervised approach in Table 4 do the authors think primarily contributed to the performance improvement compared to fine-tuning?
Is it possible to extend this method from a zero-shot setting to a few-shot setting? If so, how do the authors think its performance would compare to existing methods in that case?
In virtual screening, not only performance but also processing time is important. How does the screening time compare to that in existing studies?
How does the performance compare to existing models when using measurements like AUROC or BEDROC instead of EF #%?

局限性

The paper addressed limitations and potential social impacts in Appendix.

作者回复

2024-08-07

We appreciate the thoughtful questions and feedback provided by you. We have carefully considered your queries and provide detailed responses below.

Consideration of Screening Time

We have conducted additional experiments to measure the screening time of S-MolSearch compared to traditional methods. The results, as shown in the table below, indicate that S-MolSearch has a significant advantage in search time consumption compared to traditional molecular search methods even if the molecular embeddings are not precomputed.

Notably, if the molecule database is fixed and all molecules are pre-encoded and stored in vector database such as FAISS, S-MolSearch is expected to achieve a search speed of tens of millions of molecules per second.

Method	Molecules/sec on DUD-E
Shape-it [1]	70
ROSHAMBO [2]	440
S-MolSearch	1316

Lack of Results for Experimental Metrics Such as AUROC or BEDROC:

We have expanded our experimental evaluation to include AUROC and BEDROC metrics. These additional results, found in table 1 and 2 in the addtional PDF, provide a more comprehensive comparison of S-MolSearch with existing models and highlight its robust performance across different evaluation criteria. The results show that S-MolSearch demonstrates superior performance on AUROC and BEDROC.

On DUD-E, S-MolSearch trained on data with a 0.4 similarity achieves AUROC and BEDROC results of 84.61% and 54.22%, respectively, surpassing all baselines. S-MolSearch trained on data with a 0.9 similarity achieves AUROC and BEDROC results of 92.56% and 75.37%, respectively, exceeding the best baseline by 50% in BEDROC.

On Lit-PCBA, S-MolSearc $h_{0.4}$ achieves AUROC and BEDROC results of 57.34% and 7.58%, respectively, surpassing all baselines in BEDROC and being comparable to the best baseline in AUROC. S-MolSearch $_{0.9}$ achieves AUROC and BEDROC results of 61.78% and 8.48%, respectively, achieving the best results in both AUROC and BEDROC.

Decline in Performance as the % of EF Increases:

We provide the specific calculation formula for EF in appendix section B. The upper limit of EF decreases with the increasing top x%. The theoretical maximum value of EF is calculated as 100 divided by x. For example, the maximum value of EF 1% is 100, and the maximum value of EF 5% is 20.

From the perspective of virtual screening tasks, this can be attributed to the increased difficulty in distinguishing between active and inactive molecules as more molecules are considered, which tends to dilute the enrichment factor. Increasing the screening percentage implies that a more diverse array of active molecules needs to be identified, which is often more challenging.

Impact of Soft Labels on Different Datasets

We think this is caused by differences in the benchmarks. The molecules in DUD-E have more analogues and decoy biases, making it crucial for the model to use soft labels to effectively distinguish between closely related molecules. The LIT-PCBA dataset ensures diversity in data representation, offers a broad distribution across chemical space, and effectively minimizes inherent biases, so the impact of soft labels is less pronounced.

Contributions of the Semi-Supervised Approach in Table 4 Compared to Fine-Tuning:

We would like to clarify once again that the fine-tuning we refer to in Table 4 involves initial pre-training with contrastive learning on an unsupervised dataset, followed by contrastive learning-based fine-tuning using active supervised data. Compared to S-MolSearch, we believe that the fine-tuning approach does not optimally integrate the information from both the unsupervised and supervised datasets. The performance improvements observed in Table 4 can be primarily attributed to the integration of unlabeled data through our semi-supervised approach, which enhances the model's ability to generalize beyond the limitations of fine-tuning.

Extending the Method from Zero-Shot to Few-Shot Setting

Extending S-MolSearch from a zero-shot to a few-shot setting is feasible. We explored two few-shot settings. In terms of data division, we randomly selected 70% of the data from each target in DUD-E as the training set for few-shot learning. The remaining 30% of active molecules and all inactive molecules were used as test data. With the training dataset, we employed two settings. The first setting(random) considers the query molecule as part of the active molecule set bound to the target, then combines it with data bound to the same target as positive pairs and data bound to different targets as negative pairs. The second setting(fix query) fixes one molecule in the positive pair as the query molecule and selects another molecule from the active molecules to form positive pairs. The construction method for negative pairs is the same as in the first setting. ZS represents zero-shot in this table, and FS represents few-shot. The results are all obtained by S-MolSearch trained on data with a 0.4 similarity.

Because there is no universal setup for few-shot in this scenario, we do not compare it with other methods.

Configuration	AUROC (%)	EF 0.5%	EF 1%	EF 5%
0.4,ZS, fix query	84.87	79.07	46.44	11.70
0.4,FS, fix query	98.32	165.09	19.06	9.66
0.4,ZS, random	85.38	79.08	47.11	11.82
0.4,FS, random	97.21	154.90	86.00	18.48

Thank you very much for reading. We hope that our responses adequately address your concerns.

Reference:

[1] Taminau J, Thijs G, De Winter H. Pharao: pharmacophore alignment and optimization.

[2] Atwi R, Wang Y, Sciabola S, et al. ROSHAMBO: Open-Source Molecular Alignment and 3D Similarity Scoring.

2024-08-12

I appreciate the authors' time and effort. Their rebuttal has addressed all of my concerns. As a result, I would like to raise my score to 7: Accept.

2024-08-12

We appreciate your insights and support in improving the quality of this work. Thank you for your valuable feedback and for raising our score.

审稿意见

评分: 6置信度: 32024-07-11

The paper introduces "S-MolSearch," a semi-supervised contrastive learning framework designed for ligand-based virtual screening in drug discovery. This framework uniquely leverages labeled binding affinity information to produce soft labels for unlabeled molecules, integrating 3D molecular structures and binding affinity data. The paper also proposes a novel semi-supervised learning paradigm that combines contrastive learning with Inverse Optimal Transport (IOT).

优点

The supervision idea is novel and useful, and the target application is very impactful with broad implications.
The paper is well-written and the experiments are comprehensive.

缺点

Memory Consumption Concerns: The model employs a parallel architecture with two f_\theta encoders and one g_\phi encoder, based on the Uni-Mol framework. Although utilizing pretrained models has shown significant performance benefits, the paper should address potential memory management strategies, especially for future applications involving molecules with a greater number of atoms.
Utilization of 3D Structures: The paper promotes a novel semi-supervised contrastive learning paradigm, yet the core contribution does not seem to revolve around the innovative use of 3D structures, as this capability primarily stems from the Uni-Mol architecture. It would be beneficial if the authors could clarify any specific enhancements made to ensure the effective preservation and utilization of geometric information within the model. Absent such enhancements, clearer distinctions should be made regarding the role of 3D structures to prevent misconceptions about the paper presenting a new geometric deep learning technique.
Clarity in Section 3.4: The explanation of how $\Gamma$ , which approximates the distribution of $C$ under constraints from $U(p,q)$ , relates to the continuous optimal transport problem is not clear. Moreover, the motivation and necessity of soft labels, beyond experimental justifications, needs further elaboration. The section would benefit from additional visual aids or high-level descriptions, akin to the clarifications provided in sections 3.3 and 3.5, to aid in comprehension.
Component Efficacy in Table 3: There appears to be a discrepancy in the impact of model components across different benchmarks—soft labels are pivotal for DUD-E, whereas pretraining is more crucial for LIT-PCBA, with soft labels showing minimal importance. Insights into this inconsistency would be valuable. Furthermore, an evaluation of how the Uni-Mol encoder alone performs on these tasks would provide additional context on the effectiveness of the proposed enhancements

Minor points and typos: L153-154 is not clear. L162: It would be beneficial to include illustrations of $M_{sup}$ in the figures for clarity. Formula 2 and L 168: It is better to give intuitive explanations of $1_N$ . L184: Inconsistent notation. $g(\psi)$ or $g_\psi$ ? L281: Misplaced comma.

问题

The same as weakness.

局限性

The focus of the paper is predominantly on molecule binding affinity, heavily relying on a pretrained encoder. This reliance could limit the model's applicability across a broader spectrum of bioinformatics data. A more detailed discussion on the dependency on pretrained models and potential strategies to mitigate this limitation would enhance the paper's breadth and applicability.

作者回复

2024-08-07

We appreciate your thorough review and constructive feedback. Below, we address each of your comments and questions, aiming to clarify and enhance the understanding of our work.

Memory Consumption Concerns:

We measure memory consumption under different scenarios, as shown in the table below. Here, supervised learning corresponds to using one encoder, and S-MolSearch corresponds to using two encoders. The table shows that increasing the number of encoders leads to a linear increase in memory consumption. Since Uni-Mol is relatively lightweight, this does not significantly impact regular use.

Considering the potential applications involving large-scale data or large molecules in the future, memory consumption poses a challenge. In future work, we plan to explore update strategies, for example, similar to [1], to decrease memory consumption.

Model	Memory used
Supervised (Single Encoder)	9.5 G
S-MolSearch (Two encoder)	22.4 G

Utilization of 3D Structures:

We believe that utilizing the 3D information of molecules is advantageous in ligand-based virtual screening scenarios. While we do not introduce a new geometric deep learning technique, we ensure the preservation of geometric information by leveraging Uni-Mol. We want to point out that one of the primary contributions of this work is the combined use of both 3D molecular information and binding affinity information for virtual screening, as described in lines 77-79, rather than providing a new backbone for molecule pretraining.

Clarity in Section 3.4:

We realize that our current phrasing may lead to some misunderstandings. In fact, the optimal transport form we introduce is derived from [2], which presents a smooth and sparse form of optimal transport. This form not only effectively manages the computational cost but also maintains consistency with traditional optimal transport problems.

In S-MolSearch, $\Gamma$ can be interpreted as a joint probability distribution under given marginal probabilities. Through the design of the cost matrix $C$ , we ensure that $\Gamma_{i,j}$ is positively correlated with $x_{i}x_{j}$ . The additional $L2$ norm regularization ensures the sparsity of $\Gamma$ . This implies that for labels with low confidence from the supervised model, the pseudo-labels are heavily punished. Overall, the introduced smooth optimal transport form guarantees that signals from the supervised model are more effectively transferred to the unsupervised model, while handling high uncertainty with appropriate regularization. We have revised the wording in the manuscript and added relevant citations accordingly.

Component Efficacy in Table 3:

We conduct an intuitively analysis to provide insights into why soft labels are pivotal for DUD-E, while pretraining is more crucial for LIT-PCBA. We believe this is related to the inherent biases of each dataset.

Regarding the pivotal role of soft labels for DUD-E [3], the molecular distribution in DUD-E has more analogues and decoy biases, making it crucial for the model to use soft labels to effectively distinguish between closely related molecules. The LIT-PCBA dataset ensures diversity in data representation, offers a broad distribution across chemical space, and effectively minimizes inherent biases, so the impact of soft labels is less pronounced. The importance of pretraining for LIT-PCBA [4] arises because the broader molecular distribution seen during pretraining allows the model to capture a wider variety of molecular features, thereby enhancing performance on this dataset.

As for the performance of original Uni-Mol, results in the tables demonstrate that the addition of our proposed components significantly enhances performance. The improvements underscore the value of integrating soft labels and pretraining into our framework, particularly in achieving better results across different benchmarks.

DUDE	EF 0.5%	EF 1%	EF 5%
Uni-Mol	9.82	7.97	4.22
S-MolSearch 0.4	40.85	34.60	11.44

LIT-PCBA	EF 0.5%	EF 1%	EF 5%
Uni-Mol	3.22	1.94	1.40
S-MolSearch 0.4	10.93	6.28	2.47

Minor Points and Typos:

Thank you for pointing these out. L162: We emphasize $M_{sup}$ in Figure 2 to help readers understand our work more clearly. Formula 2 and L168: 1N is an N-dimensional vector of all ones. We add this in line 170. And we correct other points and typos in the manuscript

Limitations:

Pre-trained models are versatile and can be utilized for a variety of tasks and Uni-Mol demonstrates strong performance across several applications. In virtual screening tasks, the model benefits from exposure to a broader molecular space, which we believe makes the use of pre-trained models reasonable and advantageous.

We also conducted ablation studies on pre-training, as shown in Table 3, which demonstrate that pre-training provides improvements.

Thank you very much for reading. We hope that our responses address your concerns and demonstrate the enhancements made to our work.

Reference:

[1] He K, Fan H, Wu Y, et al. Momentum contrast for unsupervised visual representation learning.

[2] Blondel M, Seguy V, Rolet A. Smooth and sparse optimal transport[C]//International conference on artificial intelligence and statistics. PMLR, 2018: 880-889.

[2] Mysinger M M, Carchia M, Irwin J J, et al. Directory of useful decoys, enhanced (DUD-E): better ligands and decoys for better benchmarking.

[3] Tran-Nguyen V K, Jacquemard C, Rognan D. LIT-PCBA: an unbiased data set for machine learning and virtual screening.

2024-08-12

Thanks for your response. I maintain my score.

审稿意见

评分: 6置信度: 42024-07-12

This paper proposed a Ligand-based Virtual Screening method S-MolSearch. which can leverages molecular 3D information and affinity information in semi-supervised contrastive learning.

优点

The method is able to leverage both labeled and unlabeled data simultaneously and achieves excellent performance on DUDE and Lit-PCBA benchmarks.
The approach of using the principles of inverse optimal transport for semi-supervised learning is quite innovative and worth adopting.
The ablation experiments are sufficient, and the experimental section is quite robust.

缺点

In the method section, it is unclear to me whether during inference only encoder $g_{\psi}$ is used, or both $\psi$ and $f_{\theta}$ are used simultaneously?
If the application scenario involves a newly provided protein without reference molecules, how should ligand-based virtual screening methods handle this situation?

问题

Refer to weakness.

局限性

S-MolSearch predominantly focuses on the molecular affinity data, omitting broader biochemical interactions, which suggests a potential area for improvement.

作者回复

2024-08-07

Thank you very much for supporting our work and careful review! We have considered each of your questions, and we provide detailed responses below.

Inference Process with Encoders

During inference, only the encoder gψg_{\psi}gψ is used to generate the molecular embeddings. The encoder fθf_{\theta}fθ is primarily used during the training phase for generating pseudo-labels. We have clarified this in the revised manuscript in Section 3.1, line131.

Handling New Proteins Without Reference Molecules

Our current work is based on known query molecules. If the query molecules are unknown, some existing techniques might be helpful for this situation. One possible approach is to construct a pseudo-ligand based on the shape of the protein pocket, as demonstrated in [1]. Another feasible approach is to combine S-MolSearch with structure-based methods, such as [2]. We believe this can serve as an excellent direction for further enhancing our future work.

Thank you for their helpful suggestions. We hope that our responses adequately address your concerns.

Reference:

[1] Gao B, Jia Y, Mo Y, et al. Self-supervised pocket pretraining via protein fragment-surroundings alignment.

[2] Zhang X, Gao H, Wang H, et al. Planet: A multi-objective graph neural network model for protein-ligand binding affinity prediction.

2024-08-08

Thanks for your response. I maintain my score.

审稿意见

评分: 6置信度: 32024-07-13

The paper introduces a new method for ligand-based virtual screening based on contrastive learning and inverse optimal transport. Two molecule encoders are trained. The first encoder is trained using a contrastive loss function on the ChEMBL data by pairing compounds that are active toward the same protein, and compounds active toward different targets are treated as negative pairs. Next, the second encoder is trained by using the pseudo-labels produced by the first model. The proposed model is tested on two benchmark datasets, DUD-E and LIT-PCBA. Additionally, an ablation study is conducted, and the impact of the labeled data scale is visualized.

优点

Originality:

The approach seems to be novel. I have not found any similar papers that use optimal transport for the ligand-based virtual screening task.

Quality:

The theory described in the paper is formally proven in the Appendix.
The proposed method obtains excellent results in both tested benchmarks.
The quality of the learned representation is demonstrated in Figure 2.

Clarity:

The paper is written clearly and is easy to follow.
Figure 1 shows the idea of the model very clearly.

Significance:

The presented method is an interesting and effective way to utilize all the available public data to build a strong model for ligand-based virtual screening.

缺点

Quality:

It would be interesting to see some qualitative examples of molecules that were found to be similar to the active compounds in the virtual screening process. Do the trained similarities correlate with the Tanimoto similarity?

Clarity:

Does the “sup” subscript in Section 3.4 correspond to the “label” subscript in Proposition 1? What is the difference between these two sets?

Minor comments:

A typo in line 151, “we employs InfoNCE.”
In line 183, something is missing before “1”.

问题

How do you solve the cases in contrastive learning where one molecule binds to multiple targets? In this example, you need to be careful not to create a negative pair, where one molecule is the one binding to both targets and the other molecule binds to only one of them.
How do you avoid treating two molecules as a negative pair if both can bind to the same target? For example, they are inhibitors of two similar proteins, which increases the chance of them binding to both at the same time.

局限性

The limitations have been described.

作者回复

2024-08-07

Thank you very much for supporting our work and careful review! We appreciate the detailed review and constructive feedback. We have addressed each of your comments and questions below, aiming to clarify and enhance the understanding of our work.

Qualitative Examples and Similarities

We agree that providing qualitative examples would enhance the understanding of our model's capabilities. We have added two examples of molecules identified as similar to query molecules in DUD-E. These molecules are all active molecules. The results indicate that molecules with high Tanimoto similarity tend to have high embedding similarity. These qualitative examples can be found in Figure 1 in the attached PDF.

Clarification on Subscripts in Section 3.4

The “sup” subscript in Section 3.4 indeed corresponds to the “label” subscript in Proposition 1. The two subscripts have the same meaning, representing supervised learning on labeled data. For clarity, we have standardized the subscript to "sup" throughout the manuscript.

Minor Comments

Thank you for pointing out the typographical errors. We have corrected the typo in line 151 and added the missing context in line 183 (now it's in line 186). These corrections are reflected in the revised manuscript.

About Negative Pairs

In our view, both of your questions pertain to the potential occurrence of false negative pairs. False negatives may arise in scenarios where a single ligand binds to multiple targets or multiple ligands bind to the same target.

The way we construct the training data, as described in lines 60-64 of the manuscript, considers molecules binding to the same target as positive pairs, while molecules binding to different targets are considered negative pairs. This approach can mitigate the occurrence of false negatives to some extent.

In the raw ChEMBL data, approximately 21% of the ligands can bind to two or more targets. To further analyze the situation of false negatives during training, we count the false negative pairs that appear in each batch of the sampled training data over one epoch and find that false negative pairs account for an average of 0.76% of all negative pairs. This is a relatively small number, and considering the robustness of contrastive learning during training, we believe these false negative pairs will not significantly impact the current results [1].

Nonetheless, the presence of false negatives remains a concern. In the future, we plan to address this issue by designing more robust contrastive learning objectives and constructing more refined datasets to minimize the occurrence of false negatives.

We believe our response effectively addresses your concerns. If you have additional questions or need further clarification, we are open to further discussion.

Reference:

[1] Wu J, Chen J, Wu J, et al. Understanding contrastive learning via distributionally robust optimization.

2024-08-13

Thank you for providing qualitative examples and further clarifications. This addresses my concerns. I will keep my positive score.

作者回复

2024-08-07

We sincerely appreciate the valuable feedback and insightful comments provided by each of you. Your input has been instrumental in refining our work and enhancing the clarity and depth of our manuscript.

In response to your suggestions, we prepare an additional PDF document. It includes qualitative examples of molecular embedding similarity and Tanimoto similarity, as well as experimental results tables containing AUROC and BEDROC.

最终决定Accept (poster)

2024-09-25

All reviewers argued to accept this paper, with one raising their score up from weak accept after the rebuttal. There were a number of clarifying comments, and additional results and metrics, presented by the authors during the rebuttal period. The authors should please make sure to incorporate these elements of the discussion into the final revision of the paper.