Equivariant Scalar Fields for Molecular Docking with Fast Fourier Transforms
摘要
评审与讨论
The paper discusses a machine learning approach to speed up molecular docking in virtual screening. The authors introduce a scoring function using fast Fourier transforms and cross-correlation of ligand and protein scalar fields, proving more efficient on benchmark tasks than prior methods. It performs quicker on crystal structures and more reliably on computationally predicted structures.
优点
- The speedup mainly comes from factorization and amortization. Factorizing the interaction into two scalar fields is a novel and effective idea. By computing cross-correlation, the cost of pre-computations can be amortized over a group of operators.
- The manuscript is well organized, clearly written and easy to follow.
- The experimental results are impressive.
缺点
- The selection of hyper-parameters seems opaque in this document. There's no clear indication of most hyper-parameters being fine-tuned, or it's not showcased, leaving readers unsure about which hyper-parameter has the most substantial influence on the final outcome. For instance, how were decisions made regarding the number of channels or the quantity of spherical harmonics?
问题
- Could you please clarify what type of information is captured by the different channels?
- Is the method predominantly dependent on geometric data, such as shape complementarity? Is the chemical data thoroughly utilized?
- Ellen (cryoDRGN2) lays out a coarse-to-fine approach that subdivides the grid into finer segments. Would it be feasible to incorporate such a technique into this project?
Thank you for the review and the assessment of the strengths of our work. To address your questions and concerns:
The selection of hyper-parameters seems opaque
We have written a new section (Appendix E; pages 22–23) on hyperparameters, which hopefully clarifies the selection of both training-time and inference-time hyperparameters.
-
Our present settings for training-time hyperparameters provides a good balance between performance and runtime, and all the results in the main text were done with these settings. We also present new results on a more thorough hyperparameter search and find that in certain cases the performance can be further improved.
-
Inference-time hyperparameters can indeed have an influence on the final outcome, and we provide a well-organized and visual analysis of this influence, based on existing full results from Appendix G.2.
Questions
"Could you please clarify what type of information is captured by the different channels?"
- As with most deep learning models, it is difficult to interpret the semantic content of learned representations. We note that all the channels are treated identically, so the segmentation of different types of information is governed solely by training dynamics. We have provided visualizations in Appendix C (pages 19–20) which illustrate that the channels appear to differ from each other, but are hesitant to speculate about specific interpretations.
"Is the method predominantly dependent on geometric data, such as shape complementarity? Is the chemical data thoroughly utilized?"
- While it is not clear how to rigorously assess if a model is "thoroughly utilizing" chemical data, we have conducted two ablation experiments to try to probe this question.
- We run the default ESF-N model without providing chemical identities of any atoms. We call this test-time chemical ablation. The drawback of this approach is that the training and testing distributions are misaligned.
- We have trained from scratch a new model which has never seen any chemical identities. We call this train-time chemical ablation. The drawback of this approach is that the chemical identity of many atoms can be very easily deduced (for example, the identity of almost all sidechain atoms) and used implicitly.
- We evaluate these two approaches on the conformer docking benchmark with PDBBind crystal structures, and compare them to the baseline. We can ascertain that the chemical information is contributing to the model performance, although the extent of this contribution is likely between the two extremes.
| TF %<2 A RMSD | TF Med. RMSD | RF %<2 A RMSD | RF Med. RMSD | |
|---|---|---|---|---|
| Baseline (ESF-N) | 72 | 1.10 | 73 | 1.00 |
| Test-time ablation | 30 | 5.44 | 29 | 5.43 |
| Train-time ablation | 65 | 1.21 | 65 | 1.20 |
"Ellen (cryoDRGN2) lays out a coarse-to-fine approach that subdivides the grid into finer segments. Would it be feasible to incorporate such a technique into this project?"
- This is an excellent suggestion, and one of many modifications or refined search procedures that could be incorporated into the method. This progressive subdivision would make the most sense as an improvement to the grid search over in the TF procedure.
Please let us know if this resolves your concerns; if not, we would welcome the opportunity to address them further.
Thanks for your response to resolve my concerns.
This paper proposes learning scalar fields for proteins and ligands using equivariant neural networks. The fields are cross-correlated to define a scoring function that enables rapid optimization over rigid-body degrees of freedom with fast Fourier transforms (FFTs).
优点
- The idea of learning cross-correlation based scoring functions to enable FFT optimization is interesting.
- The experiments demonstrate the speed advantage over traditional tools
缺点
- Too many model settings to choose from in practice.
- The method is only evaluated on simplified tasks, not a complete docking benchmark. Performance in a full virtual screening workflow remains to be tested.
- Lack of baselines.
问题
- Are there any ablation experiments on FFT?
- Why is there no significant improvement in the scoring of crystal structures?
Thank you for the review and the assessment of the strengths of our work. To address your questions and concerns:
Too many model settings to choose from in practice
At a high level, we agree that there is a balance between the customizability of a method and its ease of use. In our case, however, our method is actually very straightforward to use—the workflow will dictate the model settings:
- If the user has a set of candidate poses generated by some other software, we use the Scoring procedures, and the Optimization procedures are not relevant.
- If the user does not have a set of candidate poses and wishes to generate them by optimizing our ESF scoring function, we use the Optimization procedures, and the Scoring procedures are not relevant.
The decision between Rotational and Translational modes is also simpler than it may appear. In the case of PDBBind, where we have a unique protein and ligand per complex, it is true that either mode could be used. However, in reality one almost always has an imbalance between the number of proteins and ligands. In this case, the runtime amortization considerations will additionally dictate the mode as the table below displays:
| Scoring | Optimization | |
|---|---|---|
| More proteins than ligands | TS | TF |
| More ligands than proteins | RS | RF |
For example, the virtual screening-like PDE10A setting corresponds to the lower right corner, and correspondingly we see in Table 3 (page 9) that the RF mode is preferable. To illustrate this more systematically, we have included a new section (Appendix F; pages 24–25) discussing how runtime amortization guides the selection of inference procedure. We hope this discussion makes the selection of model settings clear.
The method is only evaluated on simplified tasks
We have focused on these simplified tasks because they evaluate the quality of the scoring function and its optimization over SE(3), which is the focus of our contribution. Integration with the various design choices possible for other aspects of the virtual screening workflow can be quite subtle and would significantly expand the technical scope of the paper, and we think are best left to future work. With the present experiments, we believe our work is ready to serve as an upstream technical idea that can spur new directions of research in ML for docking.
Lack of baselines
We provide new experimental results (Table 2; page 8) comparing our method against recent deep learning baselines DiffDock and TANKBind. We adapt these methods to be used in scoring function modes:
- For DiffDock, we use the confidence model to score all poses.
- For TankBind, we compute the matrix of predicted interatomic distances with the given pocket, and use the so-called “generation loss” of TankBind to score all poses.
We do not evaluate these methods on pocket-level conformer docking as they cannot be easily adapted for this task.
The new results are excerpted below (Table 2; page 8):
- Relative to DiffDock, we obtain similar or better performance with much faster runtime (3–5 orders of magnitude)
- Relative to TANKBind, we obtain much better performance and similar per-pose runtime, although TANKBind is faster per-complex due to the lack of preprocessing steps (in the limit of many protein-complex pairs, this advantage vanishes).
| Crystal Top RMSD | Crytal % <2 A | ESMFold Top RMSD | ESMFold % <2 A | Runtime per-pose | Runtime per-complex | |
|---|---|---|---|---|---|---|
| TANKBind | 4.01 | 10 | 4.22 | 9 | 1.1 s | 62 ms |
| DiffDock | 0.66 | 87 | 2.01 | 50 | 62 ms | 2041 s |
| Ours (ESF-TS) | 0.59 | 87 | 1.38 | 57 | 1.0 s | 3.2 s |
Are there any ablation experiments on FFT?
While we do not report results using direct computation of the cross correlation in lieu of FFT, the significant asymptotic and practical runtime benefits of FFT are well-known. A brief conceptual analysis can illustrate this point:
- In Table 1 (page 7), we see that the cost of a direct evaluation of the translational cross-correlation at a single point is 1.0 s. With FFTs, we evaluate such cross correlations in 160 s, whereas a direct computation would take 64 ms. Hence, the FFT procedure gives a 400x speedup.
- Similarly, the cost of a direct evaluation of the rotational cross-correlation at a single orientation is 8.2 s. With FFTs, we evaluate 4608 such cross correlations in 650 s, whereas a direct computation would take 38 ms. Hence, the FFT procedure gives a 58x speedup.
Recall that in both the TF and RF procedures, we evaluate on the order of 1000 FFTs per complex, so substituting these FFTs with direct computation would result in docking runtimes on the order of 1 minute per complex, even in the limit with runtime amortization. Hence, without the FFT, our method compares unfavorably to existing runtimes and the advantage of learning a scalar-field based scoring function is definitively lost.
Why is there no significant improvement in the scoring of crystal structures?
Traditional pairwise scoring functions aim to accurately model physical interactions; thus, they are expected to perform very well on crystal structures in which the protein is perfectly positioned to interact with the ligand. Indeed, the performance on scoring crystal structures (Table 2; page 8) is nearly perfect and offers little room for improvement. However, this does not reflect the use case in practice, where the ground-truth protein coordinates are unknown and must be approximated. Data-driven methods have the most to offer in the latter setting, represented by our strong improvements on ESMFold structures.
On the other hand, while the traditional scoring functions are accurate on crystal structures, they remain expensive to optimize. Our aim on crystal structures is to accelerate this optimization, rather than increase the accuracy of the scoring function itself. We accomplish this objective, learning a scoring function of comparable accuracy to existing pairwise scoring functions (Table 2; page 8) but allowing for significantly faster optimization (Table 3; page 9).
Please let us know if this resolves your concerns; if not, we would welcome the opportunity to address them further.
As the discussion period nears its end, we would love the opportunity to learn if these changes resolve your concerns or if they can be addressed further. Thank you!
This paper explores the use of machine learning to accelerate the process of molecular docking, specifically by learning a scoring function that allows for more rapid optimization over rigid-body degrees of freedom using fast Fourier transforms. The scoring function is defined as the cross-correlation of multi-channel ligand and protein scalar fields, and is parameterized by equivariant graph neural networks. The method is benchmarked on two simplified docking-related tasks: decoy pose scoring and rigid conformer docking. The authors demonstrate that their method achieves similar performance to existing scoring functions but with faster runtime. They also highlight the advantages of runtime amortization in virtual screening settings with a common binding pocket.
优点
-
The paper proposes a novel approach for learning protein-ligand scoring functions based on cross-correlations of scalar fields. This approach enables the use of Fast Fourier Transforms (FFTs) for rapid search and optimization, which can significantly accelerate pose optimization in molecular docking.
-
The paper introduces runtime amortization techniques that lead to significant speed improvements in the scoring function. By precomputing certain coefficients and leveraging the common pocket structure in docking, the total runtime of the method is accelerated, making it more efficient for large-scale docking tasks.
-
The proposed scoring function shows comparable performance to existing scoring functions on two simplified docking-related tasks: decoy pose scoring and rigid conformer docking. In some cases, the proposed method even outperforms existing scoring functions, particularly on ESMFold structures.
缺点
-
Lack of Detailed Hyperparameter Analysis: The paper mentions that hyperparameters are detailed in Appendix D, but the specific hyperparameters used in the experiments are not provided in the extracted content. Without this information, it is difficult to understand the impact of different hyperparameter settings on the performance of the proposed method.
-
Limited Discussion on Runtime Amortization: The paper mentions the benefits of runtime amortization with the proposed method, but there is limited discussion on the practical implications and potential limitations of this approach. Further analysis and discussion on the tradeoff between runtime improvements and accuracy would provide a more comprehensive understanding of the method's performance.
问题
-
The evaluation of the proposed method is limited to simplified docking-related tasks, such as decoy pose scoring and rigid conformer docking. While the results on these tasks are promising, it is unclear how the method would perform in more complex and realistic docking scenarios.
-
The paper compares against Vina and Gnina scoring functions in experiments. I would like to know how this method compares with other SOTA methods/scoring functions [1, 2, 3, 4].
[1] Corso, Gabriele, et al. "Diffdock: Diffusion steps, twists, and turns for molecular docking." arXiv preprint arXiv:2210.01776 (2022).
[2] Stärk, Hannes, et al. "Equibind: Geometric deep learning for drug binding structure prediction." International conference on machine learning. PMLR, 2022.
[3] Koes, David Ryan, Matthew P. Baumgartner, and Carlos J. Camacho. "Lessons learned in empirical scoring with smina from the CSAR 2011 benchmarking exercise." Journal of chemical information and modeling 53.8 (2013): 1893-1904.
[4] Lu, Wei, et al. "Tankbind: Trigonometry-aware neural networks for drug-protein binding structure prediction." Advances in neural information processing systems 35 (2022): 7236-7249.
Thank you for the review. We are currently preparing a more detailed reply, but want to ask for clarification about:
Lack of Detailed Hyperparameter Analysis: … the specific hyperparameters used in the experiments are not provided in the extracted content.
We describe hyperparameters used in the experiments in Appendix D.2 and provide additional results for varying hyperparameters in Appendix E. Could you please let us know if this is what you meant, or would you like to see something else?
Yes, this is basically what I meant but not all. I would like to see a better organized section of this, e.g., a table including all tunable hyperparameters with their ranges and best values, and a set of 3D surface plots showing how the HPs affect the metric more straightforward.
Thank you for the review and the assessment of the strengths of our work. To address your questions and concerns:
Lack of Detailed Hyperparameter Analysis
We have written a new section (Appendix E; pages 22–23) on hyperparameter analysis (both inference-time and training-time hyperparameters).
- We provide a more organized analysis of the impact of various inference time hyperparameters with supporting tables and figures similar to the format requested. Thanks for the suggestion!
- Our present settings for training-time hyperparameters provides a good balance between performance and runtime, and all the results in the main text were done with these settings. We also present new results on a more thorough hyperparameter search and find that in certain cases the performance can be further improved.
Limited Discussion on Runtime Amortization
We have written a new section (Appendix F; pages 24–25) discussing the implications of runtime amortization. We would also like to clarify a few points:
- The amortization is merely the re-use of quantities that would otherwise be identically computed multiple times. Hence, there is no associated runtime-performance tradeoff (performance is unaffected).
- A more general discussion of the runtime-performance tradeoff of our method relative to other methods is provided in Appendix G.3.
- We illustrate the practical implications of runtime amortization via the PDE10A test set, which was collected specifically for this purpose. While it may appear limited, this is actually the largest possible illustration because PDB structures with different druglike molecules binding to otherwise identical protein structures are rare (to our knowledge PDE10A is the largest such set).
Evaluation on simplified docking-related tasks
We have focused on these simplified tasks because they evaluate the quality of the scoring function and its optimization over SE(3), which is the focus of our contribution. Integration with the various design choices possible for other aspects of the virtual screening workflow can be quite subtle and would significantly expand the technical scope of the paper, and we think are best left to future work. With the present experiments, we believe our work is ready to serve as an upstream technical idea that can spur new directions of research in ML for docking.
Comparisons with other SOTA/scoring functions
We provide new experimental results (Table 2; page 8) comparing our method against DiffDock and TANKBind. We adapt these methods to be used in scoring function modes:
- For DiffDock, we use the confidence model to score all poses.
- For TankBind, we compute the matrix of predicted interatomic distances with the given pocket, and use the so-called “generation loss” of TankBind to score all poses.
EquiBind cannot be configured to either score a given pose or to dock a rigid conformer into a given pocket, making a direct comparison impossible.
Re: SMINA: we make a terminology distinction between the “scoring function” and the “docking program.” The Smina docking program is the parent program of the Gnina docking program which we use to run baselines (the two are identical except the latter also supports the CNN-based Gnina scoring function). The older Smina docking program by default optimizes the Vina scoring function: hence, the results from our Vina rows are equivalent to those reported as “Smina” elsewhere.
The new results are excerpted below (Table 2; page 8):
- Relative to DiffDock, we obtain similar or better performance with much faster runtime (3–5 orders of magnitude)
- Relative to TANKBind, we obtain much better performance and similar per-pose runtime, although TANKBind is faster per-complex due to the lack of preprocessing steps (in the limit of many protein-complex pairs, this advantage vanishes).
| Crystal Top RMSD | Crytal % <2 A | ESMFold Top RMSD | ESMFold % <2 A | Runtime per-pose | Runtime per-complex | |
|---|---|---|---|---|---|---|
| TANKBind | 4.01 | 10 | 4.22 | 9 | 1.1 s | 62 ms |
| DiffDock | 0.66 | 87 | 2.01 | 50 | 62 ms | 2041 s |
| Ours (ESF-TS) | 0.59 | 87 | 1.38 | 57 | 1.0 s | 3.2 s |
Please let us know if this resolves your concerns; if not, we would welcome the opportunity to address them further.
As the discussion period nears its end, we would love the opportunity to learn if these changes resolve your concerns or if they can be addressed further. Thank you!
- Claims and findings:
The paper introduces an approach to speed up molecular docking in virtual screening. The authors introduce a scoring function using fast Fourier transforms and cross-correlation of ligand and protein scalar fields, proving more efficient on benchmark tasks than prior methods. The method is benchmarked on two simplified docking-related tasks: decoy pose scoring and rigid conformer docking. The authors demonstrate that their method achieves similar performance to existing scoring functions but with faster runtime.
- Strengths:
Reviewers have highlighted that the submission introduce a novel approach for learning protein-ligand scoring functions based on cross-correlations of scalar fields. This approach enables the use of Fast Fourier Transforms (FFTs) for rapid search and optimization, which can significantly accelerate pose optimization in molecular docking. In addition, reviewers also note that the speedup mainly comes from factorization and amortization and that experimental results are impressive.
-
Weaknesses: Reviewers have noted that the method is only evaluated on simplified tasks, not a complete docking benchmark. Performance in a full virtual screening workflow remains to be tested. In addition, there seems to be a large number of hyper-parameters to tune in order to get customizability, finding these hyper-parameter requires careful experimentation.
-
Missing in submission:
Experiments in full virtual screen workflow would set this paper as an outstanding contribution.
为何不给更高分
End-to-end evaluation of the proposed approach in a full virtual screening workflow has not been tested.
为何不给更低分
Reviewers collectively agree that this paper is borderline/leaning to acceptance. I believe authors have carefully addressed the concerns raised by reviewer BpA5, thus I recommend the paper for acceptance.
Accept (poster)