Piloting Structure-Based Drug Design via Modality-Specific Optimal Schedule
Twisted probability path induced by noise schedule with the optimal VLB
摘要
评审与讨论
This paper focuses on modeling twisted probability path of multimodalities in structure-based drug design. It analyzes the theoretical link between noise schedules and VLB in multimodality probabilistic modeling and further proposes a VLB-optimal scheduling strategy to address this bottleneck.
By integrating the VLB-optimal scheduling with the BFN framework, the proposed MolPilot achieves state-of-the-art performance in de novo design and competitive performance in local docking, demonstrating the effect of the proposed VLB-optimal scheduling.
给作者的问题
-
Why not integrate the proposed VOS with more frameworks? It's necessary to demonstrate its practical applicability within various frameworks.
-
Is the ID CrossDock test set identical to those used in previous studies?
论据与证据
The claims made in the submission are supported by clear and convincing evidence.
方法与评估标准
This paper proposes a VLB-optimal scheduling strategy to model twisted probability path of multimodalities, which makes sense for structure-based drug design.
Furthermore, it introduces an OOD subset of PoseBusters to serve as a more reliable held-out test, which also makes sense for this problem.
理论论述
I checked the correctness of the theoretical analysis between noise schedules and VLB in multi-modality probabilistic modeling.
实验设计与分析
I checked the soundness/validity of the experimental designs and analyses. I think it's necessary to demonstrate the effect of the proposed VOS by integrating it with more frameworks.
补充材料
I reviewed the supplementary material.
与现有文献的关系
This paper aims to overcome the twisted probability path bottleneck of multimodalities, thus related to various tasks involving multimodal generative modeling.
遗漏的重要参考文献
Many essential references are not discussed in this paper, such as IPDiff [1], DecompOpt [2], IRDiff [3], DecompDPO [4], and BINDDM[5].
[1] Huang Z, Yang L, Zhou X, et al. Protein-ligand interaction prior for binding-aware 3d molecule diffusion models[C]//The Twelfth International Conference on Learning Representations. 2024.
[2] Zhou X, Cheng X, Yang Y, et al. DecompOpt: Controllable and Decomposed Diffusion Models for Structure-based Molecular Optimization[C]//The Twelfth International Conference on Learning Representations. 2024.
[3] Huang Z, Yang L, Zhou X, et al. Interaction-based retrieval-augmented diffusion models for protein-specific 3d molecule generation[C]//Forty-first International Conference on Machine Learning. 2024.
[4] Cheng X, Zhou X, Yang Y, et al. Decomposed direct preference optimization for structure-based drug design[J]. arXiv preprint arXiv:2407.13981, 2024.
[5] Huang Z, Yang L, Zhang Z, et al. Binding-adaptive diffusion models for structure-based drug design[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2024, 38(11): 12671-12679.
其他优缺点
Strengths:
-
This paper provides the theoretical analysis between noise schedules and VLB in multimodality probabilistic modeling, showing the importance of designing optimal noise schedules for this twisted probability path.
-
This paper introduces an OOD subset of PoseBusters to serve as a more reliable held-out test, addressing these potential problems with CrossDock evaluation.
Weakness:
- The description of the proposed MolPilot is not clear, lacking many implementation details.
- This paper ignores many essential references, limiting its impact and connection to existing research.
其他意见或建议
It's optional for authors to decide whether to provide their code. However, providing an anonymous repository link but with an empty codebase is not respectful.
伦理审查问题
This paper has modified the template to delete the template content in the bottom left corner of the first page. Considering the deleted content, this paper should exceed the 8-page limit by over two lines.
According to ICML 2025 Peer Review FAQ (https://icml.cc/Conferences/2025/PeerReviewFAQ), it's not allowed to modify the template to gain an unfair "space" advantage relative to other authors.
Therefore, please check if such an operation is a violation of the ICML formatting requirements.
We sincerely thank the reviewer for the careful reading and insightful feedback. Below, we address each question and concern to improve the clarity and completeness of our work, as well as demonstrate its generality.
Questions
Q1: Integration with More Frameworks
Why not integrate the proposed VOS with more frameworks? It's necessary to demonstrate its practical applicability within various frameworks.
Thank you for this valuable suggestion. To demonstrate VOS's broader applicability, we integrated it with the diffusion-based framework TargetDiff. Specifically, we reimplement TargetDiff with our generalized training objective and train 140k steps following the default training configuration (TargetDiff*, with code provided), and then derive the test-time optimal schedule that resembles the shape in Figure 4B.
The results on CrossDock show that VOS successfully enhances conformation quality for diffusion models too, with generated poses achieving Vina Scores closer to Vina Min values, indicating near-optimal realistic poses.
| Metric | PB-Valid | Vina Score Avg | Vina Score Med | Vina Min Avg | Vina Min Med | Vina Dock Avg | Vina Dock Med | scRMSD<2 | Energy Passed | Connected | QED | SA |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TargetDiff | 50.5% | -5.47 | -6.30 | -6.64 | -6.83 | -7.80 | -7.91 | 37.1% | 69.8% | 90.4% | 0.48 | 0.58 |
| TargetDiff* | 53.7% | -6.27 | -6.31 | -6.82 | -6.78 | -7.87 | -7.90 | 36.8% | 70.3% | 89.2% | 0.50 | 0.62 |
| TargetDiff* + VOS | 58.1% | -6.46 | -6.53 | -7.04 | -7.09 | -8.04 | -8.12 | 40.2% | 73.2% | 93.4% | 0.49 | 0.59 |
Q2: CrossDock Test Set Consistency
Is the ID CrossDock test set identical to those used in previous studies?
Yes, we confirm that we used the identical CrossDock test set as proposed in previous studies, ensuring fair and consistent comparison with existing methods.
Weaknesses
W1: Clarity of Implementation Details
The description of the proposed MolPilot is not clear, lacking many implementation details.
We appreciate this feedback and will enhance the clarity of our description by adding more detailed description to Section 4, Algorithm 1, and Appendix A. We will move key implementation details from the Appendix to the main text where appropriate. Thanks again for the advice, and we will revise our manuscript accordingly.
W2: Essential References
This paper ignores many essential references, limiting its impact and connection to existing research. For example, many essential references are not discussed in this paper, such as IPDiff [1], DecompOpt [2], IRDiff [3], DecompDPO [4], and BINDDM[5].
Thank you for highlighting these important references. We agree that IPDiff, DecompOpt, IRDiff, DecompDPO, and BINDDM represent meaningful advancements in controllability and practicality by incorporating guidance signals (e.g. binding affinity, interactions) to enhance molecular properties of generated ligands.
We will expand our related work section to include thorough discussion of these approaches and their contributions. Additionally, we report the results of IPDiff and BindDM on CrossDock, where we calculated PB-Valid and other metrics based on the samples obtained by using the official code, showing that they indeed enhance affinities through binding signals from pretrained predictors, yet also with the issue of conformation plausibility.
| PB-Valid | Vina Score Avg | Vina Score Med | Vina Min Avg | Vina Min Med | Vina Dock Avg | Vina Dock Med | scRMSD<2 | Energy Passed | Connected | QED | SA | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| IPDiff | 37.1% | -6.41 | -7.01 | -7.45 | -7.48 | -8.57 | -8.51 | 33.7% | 51.1% | 90.4% | 0.52 | 0.59 |
| BindDM | 31.3% | -5.92 | -6.81 | -7.29 | -7.34 | -8.41 | -8.37 | 38.6% | 42.3% | 91.5% | 0.51 | 0.58 |
W3: Code Repository
We sincerely apologize for the oversight regarding our code repository. We had uploaded our code to GitHub before the reviewing process began, but failed to verify the auto-update mechanism for the anonymous repository. This unintentional mistake has been rectified now, and we have force-updated the repository with all necessary code to ensure full reproducibility, including the newly added experiment with TargetDiff.
W4: Ethical Review Concerns
We appreciate the attention to formatting requirements. The missing content was due to that we have commented out printAffiliationsAndNotice{}, yet it was not intended to create extra space. As noted in the ICML 2025 Peer Review FAQ (March 18, 2025 update), the Program Chairs have acknowledged that papers with this specific issue will not be desk-rejected "due to confusion in the LaTeX template." We will restore this element in our revision to ensure full compliance with formatting requirements.
We thank the reviewer again for the constructive feedback that helps us improve our manuscript, and we welcome further discussion.
This paper presents a novel method for finding optimal noise scheduling in structure-based generative models. Typically, deep generative networks used for structure-based drug design generate molecules directly within the binding pocket of a protein, allowing the problem to be defined as 3D molecular graph generation conditioned on the protein binding pocket. Diffusion models that have been applied to this problem generally diffuse continuous atom positions and discrete graph topology separately. The authors of this paper contend that designing a joint noise scheduler may enhance generation results for multi-modal data, where the variational lower bound is not solely dependent on the endpoints, as is the case in unimodal scenarios. A dynamic programming algorithm is proposed to find the optimal scheduler on the discretized grid. The experiments demonstrate the advantages of the proposed model in conditional molecule generation and docking scenarios.
update after rebuttal
The Authors addressed all my comments. I decided to maintain my positive score.
给作者的问题
- Are there any architectural changes that have improved the performance of MolPilot compared to other models like MolCRAFT and DecompDiff? The difference is quite significant, especially regarding the docking experiment. If there are any modifications in the architecture mentioned in Section A.3 compared to previous models, it would be advisable to include this model, trained similarly to the other models, as a baseline.
- Have you tested how sensitive the results are to the choice of the optimal path, e.g. by interpolating between linear and optimal time-rescaling functions?
论据与证据
The claims in the paper are supported by the experimental results.
方法与评估标准
The methods make sense for the problem at hand, but I noticed two problems with the method description and reproducibility:
- The linked repository is empty, making this work irreproducible.
- Some potential abuses of notation can complicate the understanding of the methods section. For example, in Equation 2 is constant, and later it is a function of time. is not introduced near that equation, only earlier and are introduced. is not defined.
The benchmarks are carefully selected to showcase the effectiveness of the methods.
理论论述
I read the proofs but did not check them thoroughly for correctness.
实验设计与分析
The experimental design and analyses are sound. However, the molecular docking study should include more baselines because MolPilot was trained on more data (CrossDocked), and models such as DiffDock were trained on PDBbind. Additionally, examining the novelty of the generated compounds would be valuable, such as through measuring their distance from the training set.
补充材料
I read the entire supplementary material. There’s a possibility I might have missed something during my initial reading.
与现有文献的关系
The results presented in this paper demonstrate that careful optimization of the noise schedule enhances the outcomes of structure-based molecule generation. The proposed theory could have a wider impact on other multi-modal generation scenarios in chemistry and beyond.
遗漏的重要参考文献
The key references have been described.
其他优缺点
The strengths and weaknesses have been described in the other sections.
其他意见或建议
N/A
We sincerely thank the reviewer for the thorough evaluation and valuable feedback. Below, we address each point raised to improve the clarity, reproducibility, and depth of our work.
Questions
Q1: Architectural Changes
The backbone of MolPilot aligns with DecompDiff, where we only replaced the MLP in the q, k, v calculation with LinearNoBias to reduce GPU memory consumption. This decreases memory usage from 42177MB to 34699MB for batch size 8 in the first 500 steps, enabling more efficient training without compromising performance. We report this baseline trained by default loss in Figure 6.
Q2: Sensitivity to Optimal Path Choice
Thank you for this insightful question. We conducted additional experiments interpolating between linear (coef ) and our optimal time-rescaling functions (), by setting time functions . Our findings show a clear trend of improving performance as we move toward the optimal schedule on both datasets.
For CrossDock:
| c | Vina Score Avg | Vina Score Med | Vina Min Avg | Vina Min Med | QED | SA | Connected | PB-Valid |
|---|---|---|---|---|---|---|---|---|
| 0 (Linear) | -6.63 | -6.94 | -7.06 | -7.06 | 0.55 | 0.76 | 0.95 | 0.95 |
| 0.25 | -6.66 | -6.88 | -7.02 | -7.01 | 0.55 | 0.76 | 0.94 | 0.96 |
| 0.5 | -6.70 | -6.82 | -7.04 | -6.94 | 0.55 | 0.76 | 0.95 | 0.97 |
| 0.75 | -6.87 | -6.97 | -7.21 | -7.11 | 0.55 | 0.76 | 0.96 | 0.96 |
| 1 (Optimal) | -6.92 | -7.02 | -7.23 | -7.18 | 0.55 | 0.75 | 0.96 | 0.95 |
For PoseBusters:
| c | Vina Score Avg | Vina Score Med | Vina Min Avg | Vina Min Med | QED | SA | Connected | PB-Valid |
|---|---|---|---|---|---|---|---|---|
| 0 (Linear) | -7.35 | -7.38 | -7.64 | -7.48 | 0.48 | 0.73 | 0.94 | 0.79 |
| 0.25 | -7.42 | -7.37 | -7.66 | -7.50 | 0.48 | 0.73 | 0.94 | 0.79 |
| 0.50 | -7.20 | -7.36 | -7.56 | -7.51 | 0.48 | 0.73 | 0.95 | 0.78 |
| 0.75 | -7.44 | -7.43 | -7.75 | -7.54 | 0.48 | 0.73 | 0.95 | 0.79 |
| 1 (Optimal) | -7.52 | -7.52 | -7.79 | -7.65 | 0.49 | 0.73 | 0.95 | 0.80 |
Additional Clarifications
Q3: Repository and Reproducibility
We sincerely apologize for the oversight regarding the code repository. We had uploaded our code to GitHub before the reviewing process, but failed to verify the auto-update mechanism for the anonymous repository. This unintentional mistake has been rectified, and we have now force-updated the repository with necessary code to ensure reproducibility.
Q4: Notation Clarifications
We appreciate your careful reading that identified these notation inconsistencies. in Equation 2 is actually the discretized for time step in the sequence, derived from time-dependent function , not a constant. is defined as the number of classes for one-hot encoding (with and denoting atom and bond types, respectively). is kronecker function, i.e. the projection from a class index to a one-hot vector with the -th value equal to 1. We will revise the manuscript to improve clarity as requested.
Q5: Molecular Docking Study and Baselines
We thank the reviewer for this important point about training data differences. Following this suggestion, we trained a version of our model on PDBbind from scratch (120 epochs over 1.5 days) to enable a more direct comparison.
| DiffDock (850 epochs) | Ours (120 epochs) | |
|---|---|---|
| RMSD < 2Å (%) | 38.0 | 17.1 |
| RMSD < 2Å & PB-Valid (%) | 12.7 | 15.4 |
Although the comparison remains imperfect, the results are still informative, for our primary goal was to demonstrate that ours is the first with genuine docking capability that can repurpose an SBDD model for the docking task. This capability stems from our decoupled training strategy rather than architectural changes. We will add clarification to the manuscript.
Q6: Novelty of Generated Compounds
Thank you for this valuable suggestion. We have calculated comprehensive metrics to assess the novelty. As shown in the table below, MolPilot achieves the highest combined metric (0.864) among all methods, demonstrating its ability to generate novel, valid, and unique compounds.
| Uniq | Valid | Novel | Combined (U×V×N) | |
|---|---|---|---|---|
| AR | 0.924 | 0.385 | 1 | 0.356 |
| Pocket2Mol | 0.910 | 0.677 | 1 | 0.616 |
| TargetDiff | 1 | 0.812 | 1 | 0.812 |
| DecompDiff | 0.899 | 0.829 | 1 | 0.745 |
| MolCRAFT | 0.841 | 0.995 | 0.983 | 0.823 |
| Ours | 0.898 | 0.974 | 0.987 | 0.864 |
In addition to these metrics, we have calculated the Fréchet ChemNet Distance (FCD), which measures distribution similarity between generated molecules and training distribution (similar to FID for images, lower is better). MolPilot achieves competitive FCD scores that demonstrate its ability to generate novel compounds while faithfully capturing the training distribution.
| Method | AR | Pocket2Mol | TargetDiff | DecompDiff | MolCRAFT | Ours |
|---|---|---|---|---|---|---|
| FCD | 23.5 | 23.0 | 21.0 | 18.1 | 14.2 | 12.9 |
Thank you for your responses and clarifications. I have no further questions.
We are glad to know that we have successfully addressed the concerns raised in your careful review, and we will incorporate these responses and clarifications into our revised manuscript. Thank you for your time and helpful comments, which have contributed to improving our paper.
The present paper is concerned about structure-based drug design (SBDD) using a Bayesian Flow Network (BFN). One of the issues when applying BFN to SBDD is that the model has to generate a molecular graph, which is a discrete object, as well as its 3D structure, which is a continuous object. A numerical example in Section 2 demonstrates that many of the related models favor 3D structure than 2D structure, and it leads to an unrealistic 3D structure that is incompatible with the 2D structure. Theoretically, when the data has both continuous and discrete parts, the variational lower bound depends on the entire noise schedule, which motivates the authors to find the best noise scheduling. The authors have successfully derived a dynamic programming based algorithm to obtain the best noise schedule, which is the main theoretical contribution of the paper.
The authors investigate the effectiveness of the proposed method by numerical experiments. One of the highlights is that the proposed method achieves the best PB-Valid score, which suggests that the generated 2D and 3D structures are consistent.
给作者的问题
N/A
论据与证据
Most of the claims in this paper are clear and well supported by both empirical and theoretical statements.
方法与评估标准
This paper uses a standard benchmark for deep learning-based SBDD, which makes sense to me.
理论论述
I have followed the theoretical statements, but not followed their proofs.
实验设计与分析
The experiments are well designed and are very informative for readers. I have no addition to the experimental protocol.
补充材料
No.
与现有文献的关系
This paper is related to drug discovery, and in this sense, is connected to the broader scientific literature.
遗漏的重要参考文献
No
其他优缺点
N/A
其他意见或建议
N/A
We sincerely thank the reviewer for the thorough reading of our manuscript and the positive evaluation. We are pleased that the reviewer recognized the key challenges and contributions of our work on deriving optimal schedules for structure-based drug design, and we certainly welcome further discussion.
Regarding Theoretical Contributions
We appreciate the reviewer's acknowledgment of one of our main contributions - the dynamic programming algorithm for VLB-optimal noise scheduling when dealing with the twisted discrete-continuous probability path. We believe the theoretical foundation is sound and provides an important advancement for handling the challenging joint generation of molecular graphs (discrete) and their 3D structures (continuous).
Regarding Experimental Results
We are particularly encouraged by the reviewer's recognition of our experimental design and results, especially highlighting our improvement in PB-Valid. As the reviewer correctly noted, this metric demonstrates our method's ability to generate consistent 2D and 3D molecular structures - addressing a key limitation in previous approaches that often favored 3D structure at the expense of 2D topology.
We thank the reviewer again for the supportive evaluation. We remain committed to further developing and refining our method to contribute to the field of structure-based drug design.
The paper introduces a novel approach, VLB-Optimal Scheduling (VOS), for improving generative modeling in structure-based drug design.
The theoretical contribution is well-recognized, particularly in how noise schedules affect variational lower bounds in multimodal generative modeling.
The experimental results demonstrate strong performance, with a remarkable PoseBusters passing rate of 95.9% on CrossDock.
Concerns
- Initially, the linked repository lacked code, raising concerns about reproducibility.
- Reviewers identified inconsistencies in notation that could affect readability.
- One reviewer suggested that more baselines should be evaluated, particularly since the model was trained on different datasets than its competitors.
- One reviewer questioned why VOS was not tested across different frameworks to demonstrate broader applicability.
Rebuttal & Discussions
- The authors provided missing code in the repository to ensure reproducibility.
- Clarified notation inconsistencies by explaining definitions and revising documentation for readability.
- Conducted additional experiments to validate the sensitivity of their approach to optimal noise scheduling.
- Expanded comparisons to include other frameworks (e.g., integrating VOS into TargetDiff).
This is a borderline paper. I recommend accepting it if there is available space.