PaperHub
6.3
/10
Poster3 位审稿人
最低5最高8标准差1.2
8
5
6
4.0
置信度
ICLR 2024

Learning Over Molecular Conformer Ensembles: Datasets and Benchmarks

OpenReviewPDF
提交: 2023-09-22更新: 2024-03-06
TL;DR

We present the MoleculAR Conformer Ensemble Learning (MARCEL) benchmark that comprehensively evaluates the potential of learning on conformer ensembles across a diverse set of molecules, datasets, and models.

摘要

关键词
conformer ensemblesgeometric learning

评审与讨论

审稿意见
8

This work presents MARCEL, a novel dataset and benchmark for studying molecular conformer ensemble learning. MARCEL curates multiple datasets in which every molecule has many molecular conformers, and benchmark several baseline methods for predicting molecular properties from multiple molecular conformers.

优点

Originality: This work curates novel datasets and benchmarks for an under-explored problem of molecular conformer ensemble learning.
Quality: Detailed information about dataset curation, baseline experiment settings and results are clearly elaborated.
Clarify: The writing of this paper is excellent and well-organized.
Significance: The presented MARCEL benchmark will be useful and impactful for researchers to develop novel molecule representation learning methods on multiple molecular conformers.

缺点

(1) For 3D models, it is recommended to add at least one 3D graph transformer models as baseline, such as Equiformer [1].
(2) It is recommended to add discussions about [2] as [2] proposes a molecular conformer ensemble learning module named ConfDSS. Also, it is recommended to add it as a baseline if it can be applied to the task in MARCEL.

[1] Equiformer: Equivariant Graph Attention Transformer for 3D Atomistic Graphs. ICLR 2023.
[2] Fast Quantum Property Prediction via Deeper 2D and 3D Graph Networks. Arxiv 2106.08551.

问题

In Table 2, which molecular conformers are used as inputs to 3D graph neural network models?

评论

Thank you very much for your constructive feedback and positive comments! For your main concerns, we would like to make the following clarifications.

W1. For 3D models, it is recommended to add at least one 3D graph transformer models as baseline, such as Equiformer [1].

A: Thank you for pointing out this baseline model. We have conducted experiments of Equiformer on the Kraken dataset, where the results are shown in the following table. It is demonstrated that in most cases our ensemble learning strategies can improve the performance of the vanilla Equiformer. Given the time constraints for the rebuttal, we may not be able to extend these tests to all 12 tasks. However, we commit to including a comprehensive analysis and discussion of Equiformer in the camera-ready version of our paper.

ModelSterimol B5Sterimol burB5Sterimol LSterimol burL
Equiformer0.23630.17750.34680.1249
Equiformer + Sampling0.21540.15280.33450.1170
Equiformer + Mean0.19000.16270.28400.1103
Equiformer + DeepSets0.20400.15480.31050.1185
Equiformer + Attention0.24400.17020.33580.1339

W2. It is recommended to add discussions about [2] as [2] proposes a molecular conformer ensemble learning module named ConfDSS. Also, it is recommended to add it as a baseline if it can be applied to the task in MARCEL.

A: Thank you for bringing this paper to our attention! ConfDSS integrates 2D and 3D GNN models for learning from conformer ensembles. This approach is indeed more complex compared to our current ensemble learning models (Mean, DeepSets, and Attention). Acknowledging the significance of ConfDSS in the context of our work, we will include a detailed discussion about it in the final version of our paper.

Q1. In Table 2, which molecular conformers are used as inputs to 3D graph neural network models?

A: For both training and inference on Drug-75K, Kraken, and EE datasets, all the single-conformer 3D models operate on the lowest-energy conformer of each conformer ensemble, which has the largest Boltzmann weight. Since imprecise conformers from Open Babel are encoded for the BDE task, we use a fixed, randomly sampled conformer for each unbound- and bound-catalyst during training and inference. It is worth noting that using the lowest-energy conformer during the test stage is actually somewhat unrealistic: without extensive geometric optimization across the entire conformer space, it is often not possible to precisely determine the lowest energy conformer, particularly for large flexible molecules relevant to the drug discovery and computational chemistry community.

评论

I appreciate authors' hard work in rebuttal. All my concerns and questions have been well addressed. I will keep my rating.

评论

I am happy with the revisions made. Thanks to the authors for their efforts.

审稿意见
5

This paper introduces a new benchmark named MARCEL, which consists of four tasks: Drugs-75K, Kraken, EE, and BDE. The goal is to evaluate the learning with multiple conformers for each molecule. In traditional evaluations of molecular machine learning, the dynamic nature of molecules taking on various possible conformers has been somewhat overlooked. MARCEL addresses this by setting up a possible set of conformers for each molecule and preparing a task to predict the Boltzmann average of molecular properties over the set of conformers, i.e. conformer ensembles. Using this benchmark data, the paper also provides comprehensive empirical evaluations of widely-used 1D, 2D, and 3D GNNs, also examining two strategies of ensemble learning in situations where a set of conformations can be used.

优点

In molecular machine learning, considering the dynamic structural transitions of molecules is an extremely important point. While there are studies predicting molecular dynamics simulations through machine learning, and existing research examining the impact and significance of conformers on machine learning predictions, the data and tasks are extremely limited. Thus, objectively comparing multiple methods on the same foundation is challenging. In this context, the benchmark proposed in this paper is extremely intriguing. Moreover, even if one wishes to consider multiple conformers for each molecule in machine learning evaluations, preparing it can be difficult without specialized knowledge. Considering these points, establishing such a benchmark and sharing it within the molecular machine learning community has the potential to enable more constructive methodological research and analysis.

In this paper, not only is a dataset provided, but comprehensive baseline evaluations are also given that would be useful for researchers looking to enter this field of study. In particular, comprehensive evaluation is conducted using multiple popular GNN models in 1D, 2D, and 3D. These results offer insight into how machine learning methods at each representation level are affected by actual conformation changes, providing very valuable knowledge.

缺点

In the four tasks developed in this study, the objective is defined as predicting the Boltzmann average of various properties over multiple conformers. Under this goal setting, it seems intuitive that using information from multiple conformers would naturally improve prediction accuracy. Therefore, it has not been proven that 'considering multiple conformers contributes to machine learning predictions of real data (e.g., actual experimental measurements of molecules rather than computed values).' In this sense, the utility of this benchmark remains a bit artificial, and practical values would be unclear.

It is possible that machine learning predictions based solely on the ground-state structure, as traditionally done, are already practically useful. Regarding how considering multiple conformers contributes to molecular machine learning, the contribution of this study might be limited.

Additionally, since all four datasets prepared are secondary data from referenced primary data, it's unclear how challenging it would be for researchers to prepare them on their own. While the paper mentions quality control and the removal of redundancies, it's unclear whether there is any original information added to this study.

问题

Q1. While Drugs-75K, a subset of the GEOM-Drugs [27], Kraken [33], EE [34], BDE [36] all have associated citations, indicating they are secondary data, it was unclear whether the presented datasets are a simple curated version of these primary data, or if any original information was generated in this study. If there is any original information, it would be very helpful to be clarified.

Q2. Regarding 'Dataset preparation' in the Supplementary Material, why are different methods used to generate conformers depending on the data? (Auto3D for Drugs-75k, Q2MM for EE, Open Babel with DTF?) Is this point not problematic for benchmarking several methods?

Q3. While I understand that the 'two conformer ensemble learning strategies' are useful for predicting the 'Boltzmann-averaged value of each property across the conformer ensemble,' can they be said to be generally useful for predicting molecular properties? Could you provide any supported evidence for this claim?

Q4. Is the BDE task also about predicting the Boltzmann-averaged value?

评论

Thank you for the informative and detailed responses! They helped a lot to understand this work more.

As I emphasized in "Strengths", I agree with the point that we should consider conformer ensembles to understand the essentially dynamic nature of molecules. Also, I understood that the prediction of Boltzmann-averaged properties can still be useful in computational chemistry research, and I agree with the claim on why the paper uses simulated (noiseless) data and it was very nice to make sure that developing MARCEL involves a lot of computational effort by full DFT or required simulations.

However, the answer also told us "that historically it is not yet shown that conformer ensembles could improve performance on real-world molecular tasks." The presented example of Gómez-Bombarelli [1] was thought-provoking. This also makes me ambivalent or unclear when considering the contribution of this work that aims to present a new benchmark environment.

So the paper's standpoint is like, considering multiple conformers have not been proven to be useful in real-world predictive tasks, but we should have a benchmark setup independent from that potential practical usefulness, just to quantitatively evaluate ensembling methods when we can explicitly access multiple conformers as well?

Since this paper presents a benchmarking environment, it is still a bit unclear to me what we can claim if we can develop a good method that performs well on these proposed benchmarks. (if the goal is separated from the practical usefulness)

It'll be helpful if you can provide any further implications on this point.

评论

Q2. Regarding ‘Dataset preparation’ in the Supplementary Material, why are different methods used to generate conformers depending on the data? (Auto3D for Drugs-75k, Q2MM for EE, Open Babel with DTF?) Is this point not problematic for benchmarking several methods?

A: Thank you for your careful reviews! In our benchmark, the choice of method for generating conformers was carefully considered to best suit the specific characteristics of each dataset. For example, Auto3D was chosen for Drugs-75K because of its efficiency in generating high-quality conformers. Similarly, Q2MM was employed for the EE dataset due to its ability to accurately generate Transition State Force Fields (TSFFs). It is important to note that these tools are not only well-established but have also been extensively validated within the chemistry community. It is also important to note that many real-world tasks in the chemistry community employ different conformer generation workflows, and hence our benchmark properly reflects this diversity of conformer generation strategies.

Also, all baseline models are evaluated using the same set of datasets. This ensures a fair and direct comparison across different methods. The diversity in conformer generation does not detract from the comparability of results but rather enriches the benchmark by covering a broader spectrum of realistic scenarios in molecular machine learning.

Q3. While I understand that the ‘two conformer ensemble learning strategies’ are useful for predicting the ‘Boltzmann-averaged value of each property across the conformer ensemble,’ can they be said to be generally useful for predicting molecular properties? Could you provide any supported evidence for this claim?

A: Please refer to our response to W1.

Q4. Is the BDE task also about predicting the Boltzmann-averaged value?

A: Thank you for your valuable feedback regarding our paper. You raised an important question about whether the Binding Energy Difference (BDE) task in our study also involves predicting the Boltzmann-averaged value. Indeed, a Boltzmann-averaged BDE would more holistically capture physical reality, as it would account for the full distribution of thermodynamically-accessible conformers under experimental conditions. However, in our case, the BDE dataset was computed with a simpler approach to manage the computational cost of DFT calculations. In the BDE dataset, binding energy difference is computed from single-point energy calculations (at a high level of DFT) between the two lowest-energy conformations of the unbound/bound complexes. These lowest-energy structures (one for each of the unbound and bound structures) were identified following a conformer search with OpenBabel. The lowest energy conformers then underwent further geometry optimization with DFT at the B3LYP/3-21G and B3LYP/def2-SVP levels, followed by a single-point energy calculation at the more expensive def2-TZVP level. The regression label is the (single-point) energy difference between the unbound and bound structures. The input conformers encoded by our models, however, are the original OpenBabel conformers prior to conformer selection and geometry optimization. Please note that the BDE is still a function (albeit indirect) of this conformer ensemble, and thus nicely fits within the MARCEL benchmark. The use of OpenBabel conformers as input to the models more realistically represents the real-world scenario during inference in which expensive DFT calculations are not practical.

We will revise the relevant sections to ensure this distinction is explicitly stated and understood, thus preventing any potential confusion.

评论

W2. It is possible that machine learning predictions based solely on the ground-state structure, as traditionally done, are already practically useful. Regarding how considering multiple conformers contributes to molecular machine learning, the contribution of this study might be limited.

A: Thank you for your valuable comments. We agree that traditional machine learning predictions based on the ground-state structure have been useful, especially for benchmarking 3D graph neural networks (e.g., with the QM9 dataset). However, we emphasize that few experimentally observable properties are exactly dependent on a single static conformer: most experimental measurements are implicitly a thermodynamic average over the accessible conformer distribution. Moreover, it is often not practical to precisely determine the global minimum energy conformer of relatively complex molecules in high-throughput property prediction scenarios, since determining the lowest-energy conformer of all but the tiniest molecules requires extensive geometry optimizations across a large conformational space. In contrast, averaging properties over a representative conformer ensemble can be more robust (in addition to being more accurate) than relying on the lowest-energy conformer, as missing the global-minimum conformer will have less of an influence on the Boltzmann-average. Therefore, our approach can better align with real-world use-cases, where predicting properties based on conformer ensembles more holistically represents physical phenomena and better represents the feasibility of real-world studies.

Q1. Additionally, since all four datasets prepared are secondary data from referenced primary data, it’s unclear how challenging it would be for researchers to prepare them on their own. While the paper mentions quality control and the removal of redundancies, it’s unclear whether there is any original information added to this study.

While Drugs-75K, a subset of the GEOM-Drugs [27], Kraken [33], EE [34], BDE [36] all have associated citations, indicating they are secondary data, it was unclear whether the presented datasets are a simple curated version of these primary data, or if any original information was generated in this study. If there is any original information, it would be very helpful to be clarified.

A: Thank you for your constructive reviews. First of all, we would like to clarify that historically it is not yet shown that conformer ensembles could improve performance on real-world molecular tasks. Therefore, our benchmark should consist of well-defined, simulated tasks without additional confounding variables to ensure a fair comparison with existing baseline models, and to assist in the development of new models that can more effectively make use of extra structural information contained in conformer ensembles.

Secondly, regarding the added information to the datasets, the Drugs-75K dataset requires significant computational effort. The original GEOM-Drugs dataset was constructed using semi-empirical Density Functional Theory (DFT) methods, which is less accurate than full DFT. To curate the Drugs-75K subset, we generate the conformer ensembles with Auto3D and compute their corresponding energies using AIMNET-NSE. This process took approximately 600 CPU hours on an AMD EPYC 7763 server.

Last, the datasets compiled in MARCEL have been reorganized and standardized to form a unified benchmark, requiring substantial effort as well. We have developed a Python interface for easy access of these datasets into PyTorch, facilitating their use and extension by the research community.

评论

Thanks for your detailed and constructive reviews! We would like to make the following clarifications regarding your main concerns. Due to the length constraint on the comment box, we split our response into several parts.

W1. In the four tasks developed in this study, the objective is defined as predicting the Boltzmann average of various properties over multiple conformers. Under this goal setting, it seems intuitive that using information from multiple conformers would naturally improve prediction accuracy. Therefore, it has not been proven that ‘considering multiple conformers contributes to machine learning predictions of real data (e.g., actual experimental measurements of molecules rather than computed values).’ In this sense, the utility of this benchmark remains a bit artificial, and practical values would be unclear.

A: Regarding the practical relevance of considering multiple conformers, we note that to the best of our knowledge, the utility of encoding conformer ensembles in deep learning models for real-world molecular property prediction tasks has not yet been convincingly demonstrated. For instance, Axelrod and Gómez-Bombarelli [1] trained multiple models that encode conformer ensembles from the GEOM-DRUGS dataset in order to predict experimental protein-ligand biological activity. However, their ensemble models did not improve significantly upon models that encode a single 3D conformer. This emphasizes the need for a machine learning benchmark that is explicitly designed to help conformer ensemble models reach their full potential, so that they can eventually be deployed in real-world tasks relevant to drug discovery, computational chemistry, etc.

Although our benchmark includes simulated data, we view this as necessary to ensure the quality of the benchmark so that model analysis is not obfuscated by the noise in real-world tasks involving heterogeneous experimental data. Our benchmark is thus crafted to curate a set of well-defined, carefully controlled tasks to enable rigorous model benchmarking.

Crucially, the failure of prior deep learning models to effectively exploit the extra structural information contained in conformer ensembles does not mean that conformer ensembles are not useful for real-world tasks. We would like to emphasize that even simulated Boltzmann-averaged properties, while not typically included in GNN benchmarks (e.g., MoleculeNet), are still of substantial utility for the drug discovery and computational chemistry communities. For instance, simulated Boltzmann-averaged properties are often employed in computational chemistry to predict chemical reactivity [2-3], to aid in the discovery of new catalysts [4], and to computationally approximate protein-ligand binding affinities [5].

[1] Molecular Machine Learning with Conformer Ensembles. https://iopscience.iop.org/article/10.1088/2632-2153/acefa7/meta

[2] AARON: An Automated Reaction Optimizer for New Catalysts. https://pubs.acs.org/doi/10.1021/acs.jctc.8b00578

[3] Multi-Instance Learning Approach to the Modeling of Enantioselectivity of Conformationally Flexible Organic Catalysts. https://pubs.acs.org/doi/10.1021/acs.jcim.3c00393

[4] Conformational Effects on Physical-Organic Descriptors: The Case of Sterimol Steric Parameters. https://pubs.acs.org/doi/pdf/10.1021/acscatal.8b04043

[5] End-Point Binding Free Energy Calculation with MM/PBSA and MM/GBSA: Strategies and Applications in Drug Design. https://pubs.acs.org/doi/10.1021/acs.chemrev.9b00055

评论

We thank the reviewer for their quick response and for prompting this important discussion. We hope our answer below provides greater clarity on our previous comments and the goals of our benchmark, MARCEL.

We believe that the reviewer may have somewhat misinterpreted our previous comments and analysis on [1] (Axelrod and Gómez-Bombarelli, Molecular Machine Learning with Conformer Ensembles, 2023). We do not claim that “it is not yet been shown that conformer ensembles can improve performance on real-world molecular tasks,” as suggested by the reviewer. In fact, conformer ensembles have been successfully used across various problems in computational chemistry and drug discovery to improve in silico predictions of experimental phenomena. For instance, we reiterate that averaging over conformer ensembles, or Boltzmann-averaging over thermodynamic microstates more generally (e.g., as in molecular dynamics), have been used to improve in silico estimations of chemical reaction enantioselectivity ([2], [4]) and improve estimations of protein-ligand binding affinity ([5]), amongst other high-impact applications in (bio)chemistry. Moreover, crafting structural descriptors of conformer ensembles has proved to be beneficial in very simple statistical learning models for chemical property prediction (see [3]). We do claim that despite the well-established utility of considering conformer ensembles in computational chemistry/biology, deep learning approaches (particularly those employing 3D graph neural networks) have not yet been designed to effectively capitalize on the extra structural information contained in conformer ensembles. [1] is an example of one such deep learning approach that illustrates the current limitations of deep learning models when encoding conformer ensembles, especially compared to traditional 3D GNNs that only encode a single conformer. Because Boltzmann-averaging over conformer ensembles has been demonstrated to improve in silico predictions across scientific problems, our central thesis is that deep learning models should be able to exploit the structural information contained in conformer ensembles in order to improve their learned molecular representations. Because existing deep learning models have not convincingly shown this ability, we have designed a benchmark (MARCEL) to explicitly help enable the development of new models that can effectively learn from conformer ensembles. Our benchmark is not just a way to evaluate the merits of existing deep learning models. Our benchmark is intended to ease and encourage the development of fundamentally novel modeling approaches to learn representations of conformer ensembles. From this perspective, our benchmark has enormous practical usefulness, because it can enable the design of new models that are more adept at solving representation learning tasks relevant to real-world scientific problems, like those highlighted previously. In our view, performing well on our well-curated benchmark is a necessary first step in the development of new conformer-ensemble models that can push the frontier of molecular representation learning for real-world tasks.

评论

Thank you for the quick response! I appreciate it.

But this response was very confusing to me. That was not what I said, but what you claimed in your response: "the utility of encoding conformer ensembles in deep learning models for real-world molecular property prediction tasks has not yet been convincingly demonstrated. (for W1)" and "we would like to clarify that historically it is not yet shown that conformer ensembles could improve performance on real-world molecular tasks. (for Q1)".

In the special situation where the goal is to predict the Boltzmann average AND a set of conformers is also given for all target molecules, the presented results are totally understandable to me.

What I was asking is on the value as a new benchmark because the paper is proposing a new benchmark. Why my rating is 5 is I still felt the task goal was too narrow and too artificial.

My original question can be rephrased into the following three points.

  1. If we develop an algorithm based on this benchmark, will it be useful for a wide range of tasks (like drug discovery as the paper claimed) other than the artificial task of predicting the Boltzmann mean? Is it a sufficient goal to focus on the prediction of this single average value?

  2. Considering the time it actually takes to compute the conformer set, is this prediction task of such value that it should be used as a basis for method development?

  3. As for the task of molecular deep learning, it appears to be a simple task of (adaptive) global pooling of the embeddings of conformers. Is it not sufficient to just add any existing permutation-invariant pooling layer such as a set pooling layer or more general attention-based layers? Can this benchmark trigger further new method development beyond these existing design patterns...?

This would be the last comment from my side because the author-reviewer discussion phase seems to be ending. But I would appreciate further clarification on these points. I definitely consider your comments in the reviewer's discussions.

评论

Thank you for your continued engagement. Please find our responses below.

Q1. If we develop an algorithm based on this benchmark, will it be useful for a wide range of tasks (like drug discovery as the paper claimed) other than the artificial task of predicting the Boltzmann mean? Is it a sufficient goal to focus on the prediction of this single average value?

A: Yes, creating models adept at capturing and interpreting the conformational ensembles and thus the Gibbs Free Energy of the molecules is the physically correct representation of molecular behavior. By representing the underlying physics through accurate representation of the conformational flexibility of molecules is crucial in tasks that depend on the Gibbs Free Energy, including the four tasks discussed in our paper. Such models are more likely to provide better insights and more precise predictions in these fields.

For example, it is well understood that the prediction of binding affinities between a protein and a ligand in drug discovery requires the treatment of conformational flexibility, an area where significant improvements are needed [1]. Likewise, in modeling stereoselective catalysis, the use of conformer-based embedding is essential for more accurately depicting molecular interactions. This approach, which considers the crucial role of multiple conformers instead of a single-state representation, has proven successful in numerous studies and serves as a promising tool for enhancing the efficiency of high-throughput experimentation in real-world applications. Conversely, it is well-known that single-state representations lead to a systematic overestimation of the stereoselectivity of reactions [2,3]. Finally, a recent study of a reactivity threshold for ligands in Pd-catalyzed reactions based on their %Vbur(min) values demonstrates that reactions are influenced by an ensemble of conformations rather than a single rigid structure and that the conformation-dependent steric bulk changes the oligomerization states of the catalyst and thus its activity [4].

There are a lot of examples that demonstrate that predicting the Boltzmann mean is not an artificial task but a reflection of real-world experimental observations: essentially all experimentally observable properties are Boltzmann averages over the conformational space; the “singular average value” is oftentimes what is actually measured in the experiments.

Q2. Considering the time it actually takes to compute the conformer set, is this prediction task of such value that it should be used as a basis for method development?

A: As outlined above, the prediction task is of extremely high value so we believe that it should indeed be used as a basis for method development. Also, it is important to note that creating a conformer ensemble is cost-effective and usually much easier than conducting experimental measurements of molecular properties. To address this computational challenge, a wide range of highly efficient algorithms such as Monte-Carlo, accelerated MD or CREST for conformer generation and methods such as the ML potential AIMNET for the accurate evaluation of their energies have been developed. The key consideration here is the level of accuracy required in generating these conformers, which varies depending on the application scenario. These models, by offering quick and efficient solutions, are actively driving progress in the field [5,6].

Furthermore, in this context it is noteworthy that our benchmark does provide the BDE dataset with imprecise conformers initially generated with Open Babel followed by further geometry optimization, offering a balance between computational demand and practicality. This setup mirrors real-world scenarios where conformer ensembles during training and inference may not be precisely known. The results show that our approach is able to handle the inherent noise present in Open Babel conformers.

References:

[1] Calculation of Protein-Ligand Binding Affinities https://www.annualreviews.org/doi/10.1146/annurev.biophys.36.040306.132550

[2] Rapid virtual screening of enantioselective catalysts using CatVS https://www.nature.com/articles/s41929-018-0193-3

[3] Prediction of Stereochemistry using Q2MM https://pubs.acs.org/doi/10.1021/acs.accounts.6b00037

[4] Univariate classification of phosphine ligation state and reactivity in cross-coupling catalysis https://www.science.org/doi/10.1126/science.abj4213

[5] Equivariant Graph Neural Networks for Toxicity Prediction https://pubs.acs.org/doi/full/10.1021/acs.chemrestox.3c00032

[6] Modeling of spin–spin distance distributions for nitroxide labeled biomacromolecules https://pubs.rsc.org/en/content/articlelanding/2020/CP/D0CP04920D

评论

Q3. As for the task of molecular deep learning, it appears to be a simple task of (adaptive) global pooling of the embeddings of conformers. Is it not sufficient to just add any existing permutation-invariant pooling layer such as a set pooling layer or more general attention-based layers? Can this benchmark trigger further new method development beyond these existing design patterns...?

A: Although we cannot exclude the possibility that future methods with permutation-invariant pooling layers could achieve promising performance, the currently available evidence suggests that this is likely not the case, because the task goes well beyond simple adaptive global pooling of conformer embeddings. Firstly, conformers differ in their geometric structures, necessitating advanced encoders that capture ensemble-level geometric symmetries beyond simple permutation invariance. Secondly, learning to pool instance-level embeddings of individual conformers is not the only way to leverage conformer ensembles in a deep learning model. There are certainly other possible approaches one could take that go beyond this simple paradigm. For example, one can develop frameworks that model the dynamic transitions between conformer states, moving beyond our conformer ensemble models that encode sets of independently-generated static conformers. Our benchmark, by providing both conformer ensembles and regression labels, enables the exploration of other approaches as well.

Finally, as a concluding remark, we wish to clarify a key point in our previous responses: the machine learning community historically does not invest in model architectures designed to leverage conformer ensembles due to the absence of curated benchmarks and datasets devoted to exploring these capabilities. MARCEL is designed to fill this gap. By introducing datasets and benchmarks, we aim to encourage the development and refinement of models that can collectively utilize conformer ensembles.

审稿意见
6

This paper discusses the use of graph neural networks for ensemble-based learning of molecular representations. Specifically, the paper introduces a molecular conformer ensemble learning benchmark, with the aim of evaluating the potential of learning on conformer ensembles. The idea behind casting this problem as an ensemble-based learning problem is that this could help take into account the dynamic aspects of molecules. Generally, the paper is well-written and contains a thorough comparison to state of the art results in the field.

优点

The paper's main strength, in this reviewer's view, is that it thoroughly compares its approach to the state of the art. Its main value is likely that it can serve as a benchmarking basis for various approaches in the field. The main original aspect is the use of an ensemble-based approach, which affords to incorporate the dynamical aspect of molecules. The paper is also well written and meticulous at comparing to the state of the art,

缺点

None

问题

None

评论

We greatly appreciate your thoughtful assessment of our paper and your recognition of its contributions to the field of molecular machine learning. We are delighted that you recognize the value of our ensemble-based approach and its potential as a benchmark in molecular machine learning. We are committed to further developing this benchmark to remain relevant and useful for the community; we would welcome any short-term or long-term suggestions that you may have so that we can further add to and improve upon our benchmark going forward.

AC 元评审

This paper introduces both datasets of molecular conformations for small molecules for a number of know molecule datasets as well as a benchmark on a number (of Boltzmann weighted) observables. A number of machine learning mostly GNN based approaches are investigated. The motivation for the paper is to formalise the use of conformational ensembles for these kind of tasks.

The paper is well-executed and the aim is well-founded. Perhaps the weakest point of the paper is that it is not consider any alternative ensembling strategies. As also noted by the referees, it is not surprising that when the targets are based on Boltzmann averaged quantities, then an ensemble method will be the best. What the paper does not investigate is alternative potentially more computationally efficient ways to generate and use ensembles. That being said, the paper can still serve as useful starting point for future research along those lines.

为何不给更高分

This is a dataset and benchmark paper in an important but still specialised subfield.

为何不给更低分

Could be rejected due to missing consideration of other ensemble approaches.

最终决定

Accept (poster)