MUBen: Benchmarking the Uncertainty of Molecular Representation Models
摘要
评审与讨论
The paper serves as a systematic benchmark of different uncertainty quantification (UQ) methods on various deep learning models for molecular machine learning tasks. With thorough and carefully-designed metrics and training protocol, the authors compared UQ methods for their effect on models' regression/classification accuracy. The paper also provides some analysis of UQ methods from other angles such as the extra computational cost. Overall, I believe that this manuscript is a high-quality, well-executed benchmark paper of UQ methods for molecular machine learning field. However, the paper can be improved by comparing UQ methods on areas besides their effect on model accuracy (detailed in the weakness section).
优点
-
The training and evaluation protocols are well-designed. The experiments are conducted consistently to ensure fair comparison.
-
Benchmarks are thorough as models that take different molecular representations are included. Datasets are carefully selected such that various tasks are included.
-
The manuscript is written with high clarity, offering significant value to readers who are interested in this field.
缺点
-
Uncertainty quantification of machine learning or deep learning models has been an important topic as it aids the explainability of those models. Some of the influential works are missing from the reference of this paper. One example is (pubs.acs.org/doi/abs/10.1021/acs.jcim.0c00502), which should be discussed and included in the reference.
-
Besides the fact that UQ methods can improve the accuracy of models, it is also valuable in the domain of molecular machine learning because it can aid many other machine learning applications such as [1] identifying the activity cliff and adding explainability to model (e.g. pubs.acs.org/doi/full/10.1021/acs.jcim.2c01073), and [2] Active learning virtual screening of large compound libraries (pubs.rsc.org/en/content/articlelanding/2021/sc/d0sc06805e). The uncertainty value (e.g. variance predicted by model or obtained from ensemble) can provided explainability to deep learning methods, especially in areas where explainability is important (such as drug discovery). The authors did not mention those advantages of UQ in molecular machine learning field.
-
Some of the models benchmarked are pretrained on larger dataset (ChemBERTA, Uni-mol). Some other models in this work bear the potential of being pretrained (e.g. GIN can be trained using contrastive learning on molecular graph, and TorchMD-net can be pre-trained in a denoising manner). It would be very interesting if we can understand the better performance of pretrained models from the UQ perspective. Unfortunately, the paper did not discuss how pretraining can affect the uncertainty of models.
问题
-
If I am understanding it correctly, the "Deterministic" method actually predicts a mean and a variance like mean-variance estimation, and is trained using gaussian negative log-likelihood loss. If so, it is confusing to call it “deterministic” because it can be misunderstood as non-UQ baseline. Can authors rename to avoid confusion? Also, I do not find any non-UQ baseline in this work. Can authors add a non-UQ baseline to show improvements brought by UQ methods?
-
For the deep ensemble methods, how does the number of models in the ensemble affects the ensemble variance on different benchmarks?
-
MoLFormer (www.nature.com/articles/s42256-022-00580-7) is a very strong model that takes SMILES as input based on my experience in the drug discovery industry. Understanding the uncertainty associate with MoLFormer should be very interesting to researchers in the industry. What is the motivation of the authors to choose ChemBERTa over MoLFormer to be included in the benchmark of this work?
Dear reviewer,
We highly appreciate your recognition to work and the detailed feedback. We will first address your questions and then provide some discussion on the weakness.
Regarding your question, some explanations are as below:
-
Terminology confusion: For classification, the deterministic method directly predicts the probabilities. So it is considered as a non-UQ baseline. For regression, your understanding is totally correct. But we find that the mean () trained in this way performs almost identical to a non-UQ model with a single output head and trained with MSE loss. Therefore, we can directly consider the RMSE and MAE metrics of this method as the performance of a non-UQ model. Regarding the method name, we refer to it as “deterministic” in contradict to the “distributed” outputs of BNNs. Specifically, “deterministic” methods predict the mean and variance as single values, whereas for BNNs, both mean and variance are sampled from distributions parameterized by the trained network. We acknowledge that this name might be misleading, but we did not find any precise name for such a method in previous works. We’d highly appreciate some suggestions on this topic.
-
How n_ensembles affects variance: In theory, we anticipate that the variance associated with ensemble models initially increases with the number of models, especially when this number is relatively small. This increase tends to stabilize once the number of ensemble models reaches a certain threshold. This threshold is not a fixed value but varies depending on the specific datasets and models under consideration. Based on our experience, this threshold typically ranges from 7 to 15.
However, we did not explicitly conduct experiments to quantify this threshold in our current work. The primary reason for this is our methodological approach, wherein we calculate the average of logits before applying the output activation functions, i.e., Sigmoid for classification tasks and SoftPlus for regression variance estimation. This approach does not directly yield a measure of ensemble variance. -
Why choosing ChemBERTa over MoLFormer: In our research, we considered incorporating MoLFormer. However, during our testing and adaptation phase using the provided script, we encountered challenges similar to the documented issue of a missing model checkpoint. This led us to believe that the pre-trained model checkpoint was not made available by the authors, and we subsequently ceased further investigation into this model.
As an alternative, we identified ChemBERTa as a widely-recognized model also with SMILES input, which also offered ease of implementation. This practical consideration influenced our decision to opt for ChemBERTa over MoLFormer.
Upon double-checking, we noticed the updates in the MoLFormer GitHub repository that address the checkpoint issue. We plan to include MoLFormer in our future benchmarking efforts and discussions. In fact, while we were investigating the pre-trained models, we noticed that many works did not open-source their models, and we assumed the same situation for MoLFormer.
Regarding the weakness, we find the materials you provided are very helpful in strengthening our paper, and we will include them and related discussions in an updated draft, which will be posted a few days later.
For investigating the effect of pre-training on UQ, we find it could be an interesting future direction, but may also be a bit tricky in practice. The main concern is the model size. To incorporate as much knowledge as possible, pre-trained models are generally large, containing millions of parameters (Table 5). However, such a huge model can severely overfit the training data, even with the aid of UQ methods, when their parameters are initialized randomly instead of from the pre-training, which is exactly what happened in our experiments. But for smaller models, pre-training does not function well: the knowledge gained from pre-training is soon forgotten after a few steps of fine-tuning. This is also the reason why few papers directly compare the performance of pre-trained and non-pre-trained models with the same size on prediction tasks. Nonetheless, we will think about it more carefully and see if we can come up with a solution.
We hope our responses address your concerns and resolve your questions regarding our work.
Best regards,
MUBen authors
We did some experiments on how the number of ensembles affects the variance of the predicted values. The columns are the variances of the predicted means and variances (only for regression) on different datasets and backbone models, and the rows are the number of ensembles. The table largely aligns with our discussion on this topic in the previous comment. We hope you find the result helpful!
| DNN-rdkit-lipo-mean | DNN-rdkit-lipo-var | Uni-Mol-lipo-mean | Uni-Mol-lipo-var | DNN-rdkit-tox21-mean | Uni-Mol-tox21-mean | |
|---|---|---|---|---|---|---|
| 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0.04634 | 0.00051 | 0.01311 | 0.00027 | 0.00287 | 0.00224 |
| 3 | 0.06198 | 0.0008 | 0.0163 | 0.00031 | 0.00345 | 0.00279 |
| 4 | 0.06801 | 0.00124 | 0.01741 | 0.00026 | 0.00357 | 0.00374 |
| 5 | 0.06867 | 0.00255 | 0.02351 | 0.00284 | 0.00356 | 0.00421 |
| 6 | 0.07439 | 0.00352 | 0.02559 | 0.00281 | 0.00401 | 0.00521 |
| 7 | 0.07661 | 0.00319 | 0.02545 | 0.00244 | 0.00394 | 0.00533 |
| 8 | 0.07548 | 0.00472 | 0.02526 | 0.00218 | 0.004 | 0.00542 |
| 9 | 0.07658 | 0.00638 | 0.0253 | 0.00214 | 0.00393 | 0.00565 |
| 10 | 0.07713 | 0.00628 | 0.02555 | 0.00194 | 0.00384 | 0.00561 |
Dear Reviewer,
We're appreciative of your constructive feedback on our work. As the rebuttal session draws to a close, we would like to kindly remind you to check out our responses and updates of the draft. We would greatly value your insights on these updates, and welcome any further suggestions or concerns you might have.
Thank you for the effort in the revision and clearing my questions. I will keep my rating as I have rated the manuscript as acceptance in my first review.
We are pleased to learn that your queries have been satisfactorily resolved. Thank you once again for your support and for affirming the value of our work!
Best regards,
MUBen Authors
The authors describe MUBen, a benchmark for assessing the performance of uncertainty estimation methods on molecular prediction models. MUBen employs four pretrained "backbone" models with near SOTA performance, one for each of four different input molecular representations: ChemBERTa for SMILES, GROVER for 2D graphs, Uni-Mol for 3D conformations, and fully-connected NNs for RDKit features. Two other pretrained models are included to provide additional insights: TorchMD-NET, a transformer pretrained on QM properties, and GIN, a powerful GNN. The authors fine-tune these models on MoleculeNet datasets, providing a mixture of classification and regression tasks across a variety of physiological, biophysical, physical chemical, and quantum properties. In the paper, the authors benchmark a number of uncertainty estimation methods: focal loss, temperatures scaling, deep ensembles, Bayes for Backprop (BBP), Stochastic Gradient Langevin Dynamics (SGLD), MC Dropout, and Stochastic Weight Average-Gaussian. For model performance, the authors report AUC for classification tasks, and RMSE and MAE for regression tasks. For uncertainty estimation, the authors report the negative log likelihood (NLL), Brier score, and expected calibration error for classification tasks, and Gaussian NLL and regression calibration error for regression tasks. By computing the performance and uncertainty scores for all models across all tasks, the authors qualitatively conclude that Deep Ensembles perform best overall, though are computationally expensive, while temperature scaling and MC Dropout are good choices for classification tasks, and BBP and SGLD are good for regression tasks. They also conclude that Uni-Mol, trained on 3D conformations, performs best among backbones.
优点
- The paper is well written and easy to follow.
- The authors use a comprehensive selection of tasks and datasets.
- The appendix is an incredible resource, being a combination of textbook and comprehensive results.
缺点
- While interesting, MUBen is a straightforward product of models X datasets/tasks X uncertainty estimation methods, and does not rise to the level of a significant contribution suited for the main track of the conference
- The benchmark is really many benchmarks, and decisions about which methods or backbones are best are made by qualitatively assessing large tables of numbers
- The benchmark includes only one architecture per molecular input type, it would likely be more informative to train many architectures per input type and report distributions/aggregations of results instead.
- The authors train models with 3 different random seeds and report the average result, but it would be informative to report some notion of the spread amongst the three runs as well.
- When reviewing the benchmark results, the authors often provide hypotheses about why certain model-uncertainty estimator combinations perform the way they do, without further evidence. It would be very valuable to test some of these hypotheses (e.g. via ablations) and report on the findings.
- Some of the stated conclusions do not appear justified by the data, e.g. "Figure 3 shows that deterministic prediction tends to be over-confident, and Temperature Scaling mitigates this issue", however it is not at all obvious from Fig 3 that TS does mitigate. Also "As presented in the tables and Figure 5, TorchMD-NET’s performance is on par with Uni-Mol when predicting quantum mechanical properties but falls short in others", however TorchMD-NET seems to be as on par for biophysical properties too. This is a result of the lack of statistics from which to draw these conclusions.
- MoleculeNet is dated, and has been replaced by other benchmark suites like TDC.
问题
- Instead of making only qualitative assessments, compute summary statistics of the results and attempt to determine significance.
- Substantiate some/all of the hypotheses made about why the results are the way they are.
Dear reviewer,
We highly appreciate you spending time reviewing our work and providing detailed feedback. We address your concerns below.
-
Limited Contribution: In our introduction, we provided a discussion on the limitations of previous works, the motivation of this research, and the key points of our contribution. We believe that our work goes beyond simply reporting the results from different combinations. It also provides a detailed discussion of the behaviors of different backbone/UQ methods under different situations. Therefore, we consider our work to contribute fairly to the community and meet the requirements of ICLR’s benchmark track.
-
Including more backbones per molecular descriptor: We acknowledge the potential insights gained from testing various backbone models with identical molecular descriptor inputs. However, conducting such expansive experiments is not feasible within the scope of our current research, primarily due to the significant increase in required resources and efforts. Our experimental scale already surpasses that of prior studies [1, 2] by several magnitudes. Incorporating multiple backbone models for each descriptor would necessitate an exponential increase in these resources, making it impractical. Furthermore, it's important to note that while comparing backbone models holds value, our study's primary objective is to explore and evaluate Uncertainty Quantification (UQ) methods and their performances. We believe that the scale and depth of our current experiments are sufficient to substantiate the claims made in our paper. While we are open to integrating additional backbones like MoLFormer, as suggested by other reviewers, a comprehensive investigation of molecular descriptors is beyond our current project's scope. We encourage the scientific community to further this line of inquiry in future research endeavors.
-
Variance among random runs: We presented the variances of multiple runs in the Appendix, from Table 9 to Table 38. We find did not include them in the main article due to space limitations.
-
Unsupported hypothesis: We think our discussions are well-grounded and supported by the figures and tables in both the main article and the Appendix. It would be helpful if you could point out the unrooted hypothesis so that we may have a chance it make it clearer.
-
Conclusion unjustified: In Figure 3, the line indicating temperature scaling is closer to the line y=x, which indicates a better calibration and mitigated overconfidence. In Figure 5, there’s a gap between TorchMD-NET’s and Uni-Mol’s ranks for biophysical properties on ROC-AUC, which is the metric evaluating the property prediction performance.
-
Should use TDC: We acknowledge TDC might be newer, but MoleculeNet is so far the most popular benchmark for modeling molecular representation methods and the first choice of many recent works, such as [3]. We can expand our MUBen to include some TDC datasets in future works, but we still consider MoleculeNet as our primary testbed to present our studies and discoveries.
We hope these clarifications address your concerns effectively. We would be grateful if you could reconsider your evaluation of our manuscript in light of these responses.
Best regards, MUBen authors
[1] Wollschläger, Tom, et al. "Uncertainty Estimation for Molecules: Desiderata and Methods." (2023).
[2] Varivoda, Daniel, et al. "Materials property prediction with uncertainty quantification: A benchmark study." Applied Physics Reviews 10.2 (2023).
[3] Qian, Chen, et al. "Can large language models empower molecular property prediction?." arXiv preprint arXiv:2307.07443 (2023).
Dear Reviewer,
We're appreciative of your constructive feedback on our work. As the rebuttal session draws to a close, we would like to kindly remind you to check out our responses and updates of the draft. We would greatly value your insights on these updates, and welcome any further suggestions or concerns you might have.
I want to thank the authors for their replies and to forgive my slow response. The new Figure 1 is well done and adds to this paper, as do the updated citations. Responses to your replies below:
- Limited contribution: It is true that there is a datasets and benchmarks track. I have typically thought of this track as being for new datasets and the benchmarks that accompany them, but I could very well be wrong. I would look for guidance from the AC on this point.
- More backbones: Adding an additional backbone per method would roughly double the required compute, not increase it exponentially. Moreover, it would help the reader understand whether the conclusions can be generalized to networks and tasks of a particular type, or if they are one-off results for the particular architectures chosen. While I understand that even twice as much compute can be prohibitive, this is an important question that shouldn't be brushed aside.
- Variance: Thank you for pointing out the tables in the appendix. In looking carefully through them, it is clear that at least some of the results reported as best are within one standard deviation of others (e.g. Tox21 ECE). These kinds of things should be noted when reporting benchmarks in the main paper as they signal equivalence between the methods that is otherwise hard to detect from scanning many large tables.
- Unsupported hypothesis: E.g. page 6 " The superior molecular representation capability of Uni-Mol is attributed to the large network size, the various pre-training data and tasks, and the integration of results from different conformations of the same molecule." Is it? There are a number of possible confounders here given the very different architectures being compared. One would need at least a systematic ablation study of the architecture to make these claims. Again on page 9, comparing to TorchMD-Net "In contrast, Uni-Mol stands out as a versatile model, benefiting from diverse pre-training objectives that ensure superiority across various tasks." Does it? Again there are a number of possible confounding factors that are not controlled for in the comparison's chosen.
- Conclusion unjustified: In Figure 3, there is only the higher probability region of Fig3c that seems to be clearer closer, and the opposite could be said for Fig3d. Fig3a the numbers are close and may not be significant. Fig3b is mixed at best. In Figure 5, the NLL gap for quantum properties is similar to the ROC gap for biophysical properties, calling one similar and the other not is surprising. This is symptomatic of a larger issue that there is simply a lot of eyeballing of comparisons rather than a more rigorous approach.
- TDC resolves some issues that exist in MoleculeNet with regard to systematic splitting of the data, but this is a minor issue so I will drop it.
The authors did not address:
- my point about this being a lot of benchmarks, and that it is unclear how one would systematically choose a particular combo of representation, task and UQ from amongst all such possibilities explored.
- my point about there being a lack of statistics in assessing the differences between methods or combining the scores into something comprehensive.
While this paper offers some interesting insight into how UQ changes with difference in the underlying representation, task and UQ method chosen, I maintain my rating since even as a benchmark paper it lacks enough rigor to offer a practical path for represenation, task, and UQ method selection.
Dear Reviewer,
Thank you for your reply! We would like to provide further explanations for some points.
-
More backbones: Yes, you are right about the computation complexity. We apologize for our mistake on that. Still, our study focus is not on how molecular descriptors affect the model performance, so we had to be selective while choosing backbone models to maintain the experiments within a manageable scale. Currently, we have GIN and TorchMD-NET as second backbones for 2D and 3D graphs, and we plan to add MoLFormer for SMILES. We have to seek contributions from the community for more backbone models.
-
Variance: Your observation is accurate, and we did notice such cases. However, as our discussions are mostly based on the ranks averaged on backbones, UQ methods, or datasets, we would assume that the 1- issue you mentioned is largely mitigated and does not have a significant impact on our conclusions.
-
Unsupported hypothesis: Thanks for the examples, we can address these concerns as follows. 1) The page 6 statement is a summarization of the contributions claimed by the Uni-Mol paper [1]. Such claims are scattered across their Introduction, Methods, Conclusion, and Appendix, so forgive us for not being able to give a precise reference. 2) The page 9 statement compares the performance of Uni-Mol and TorchMD-NET---both trained on 3D graphs but on different data. We believe Figure 5 shows that Uni-Mol largely outperforms TorchMD-NET in property prediction (not uncertainty estimation) for all properties except for Quantum Mechanics (QM) and performs similarly for QM. Combining the results from Figure 2 and Tables 9--22, we believe that our statement is reasonable.
-
Conclusion unjustified: 1) Figure 3. The Temperature Scale (TS) plots are indeed closer to the line for all 4 subfigures, which can be proven by the ECE numbers in the tables. As mentioned in Appendix D1, Paragraph ECE, ECE calculates the average gaps between the calibration plot (Figure 3) and . We observe smaller TS ECE values for the models on the presented datasets, as shown in Table 12 and Table 14. Even though Figure 3 may not be extremely obvious, the conclusion stands. 2) NLL, as mentioned in Section 3, serves as a UQ metric rather than an evaluation of the prediction, which is only represented by ROC-AUC for classification and RMSE & MAE for regression. We are uncertain what the sentence "the NLL gap for quantum properties is similar to the ROC gap for biophysical properties" implies. We do not entirely understand how Uni-Mol's better performance in UQ undermines our statement about Uni-Mol's performance being much better than TorchMD-NET in biophysical property prediction. It would be helpful to clarify the statement. 3) The assertion that "there is simply a lot of eyeballing of comparisons rather than a more rigorous approach" challenges our commitment to academic integrity. We kindly ask that you furnish more substantial evidence to support this claim.
About the unaddressed points, we think they are discussed in Section 5 Experiment, Section 6 Conclusion, and Appendix E.
Let us know if you have further questions or concerns.
Best regards,
MUBen Authors
[1] Zhou, Gengmo, et al. "Uni-Mol: a universal 3D molecular representation learning framework." (2023).
This paper proposes a benchmarking platform for evaluating Uncertainty Quantification (UQ) methods in fine-tuning pretrained models for downstream tasks in molecular property prediction. The platform provides datasets and baselines in an open-source environment. For the dataset, it includes classification and regression tasks from MoleculeNet (widely used for assessing predictive performance in downstream tasks of pretrained models), namely BBBP, ClinTox, Tox21, ToxCast, SIDER, BACE, HIV, and MUV for classification; ESOL, FreeSolv, Lipophilicity, QM7, QM8, and QM9 for regression. These datasets are provided with scaffold splitting to evaluate the out-of-distribution (OOD) characteristics of UQ. The predictive performance in each downstream task is provided with ROC-AUC for classification, and RMSE and MAE for regression. Furthermore, the evaluation metrics for UQ include ECE, NLL, Brier score, and CE for classification; and Gaussian NLL and CE for regression.
In addition, the paper comprehensively reports baseline performances on fine-tuning six pretrained models combined with UQ: ChemBERTa, GROVER, Uni-Mol, Fully-connected Neural Network with RDKit features, TorchMD-NET, and GIN. The UQ methods examined, namely Focal Loss, BBP, SGLD, MC Dropout, SWAG, Temperature Scaling, and Deep Ensembles, show that the combination of Uni-Mol model and Deep Ensemble performs exceptionally well.
优点
Molecular representations pre-trained primarily through self-supervised learning on vast amounts of data have shown success in predicting molecular properties. However, when fine-tuning these models for downstream tasks, there's a risk of overfitting. They particularly tend to make overly confident predictions on test data that deviates from the training distribution. Consequently, the quantification of uncertainty (UQ) is recognized as an extremely crucial issue. Yet, there is a lack of standard benchmark platform to comprehensively and systematically investigate such methods. The benchmark proposed in this paper provides researchers with a practical testing platform to exhaustively assess both predictive and UQ performance across numerous downstream tasks. Moreover, the paper reports comprehensive benchmark results using several widely-used pre-trained molecular models. It provides empirically valuable insights into how the combination of pre-trained models and UQ methods can potentially impact both predictive and UQ performances.
缺点
The reported superior performance of the Uni-Mol model combined with Deep Ensembles raises several questions about the merits of conducting validation research based on this benchmark. Firstly, while Deep Ensembles are fundamentally distinct from other UQ methods being an intrinsic ensemble learning approach and it's intriguing that they perform well even with M=3, it can be intuitively expected that they would likely be the most stable. Although Deep Ensembles necessitate the training of multiple models, which can be resource and time-intensive, this aspect has not been thoroughly explored in this paper. Furthermore, the standout performance of Uni-Mol, which correlates highly with predictive capability, may suggest a straightforward interpretation that later models simply yield better results in UQ. This can be attributed to the fact that the MoleculeNet tasks utilized here have been widely employed for validating molecular pre-training. The original Uni-Mol paper already demonstrated its superior predictive performance compared to traditional methods, and it might be practically sufficient to have stable UQ by using deep ensembles of Uni-Mol. Therefore, if the UQ evaluation highly correlates with this, the implications provided by the validation using this benchmark might appear limited. If the focus is on UQ validation during fine-tuning of pre-trained molecular models, it might also suggest the need for a research environment that can conduct evaluations more broadly beyond just MoleculeNet.
问题
Q1. Methods like MC Dropout and SWAG are ultimately designed to efficiently extract UQ information. If one has the time and resources to implement Deep Ensembles, it is clear that it would be a preferred option. The paper concludes that Deep Ensembles come "with significant computation cost," but are there any actual comparative results on computation cost presented?
Q2. When using MoleculeNet as the downstream task based on this benchmark, how much of a difference is there in terms of computation cost? How long does it take to fine-tune? (I mean when comparing deep ensembles vs others)
Q3. I understand that scaffold splitting for train/test division is beneficial for evaluating out-of-distribution (OOD) characteristics. However, scaffold splitting has been widely used in evaluating the performance of existing pre-trained molecular representations for downstream tasks. Hence, the pre-trained models adopted as baselines in this benchmark have already been confirmed to perform well, assuming scaffold splitting (with MolecularNet data as the downstream tasks). In this regard, I feel it might be more appropriate, both for a realistic evaluation of UQ and in consideration of OOD characteristics, to base the benchmark on data other than MoleculeNet. Do you have any additional comments on this?
Dear reviewer,
We highly appreciate you spending time reviewing our work and providing detailed feedback. We noticed an overlap in your discussion on the weaknesses and questions, and we address them collectively below:
- Q1&Q2 Computation cost: We indeed included a theoretical analysis of the computation cost of each UQ method. Due to the page limitation of the main article, we presented the results in Table 6 in the Appendix and provided a short discussion in subsection C.2. In addition, if you are also interested in the efficiency of backbone models, we also provided some numbers in Table 5 in the Appendix.
- Q3 Datasets other than MoleculeNet: From our perspective, we don’t think adopting pre-trained models that have been proven effective on MoleculeNet would have a significant impact on the fairness of our evaluation. The reason is that 1) these models are not pre-trained on the MoleculeNet data, 2) MoleculeNet is still a popular choice to benchmark molecular representation learning methods. We think that the good performance on MoleculeNet is not a result of dataset over-fitting or arduous hyper-parameter tuning, but rather a fair indicator of their general abilities, at least within the data distributions represented by MoleculeNet, which are comprehensive enough to cover most aspects of the material properties. Therefore we don’t reckon that testing UQ methods based on the backbones already having good performance on MoleculeNet would cause biased results.
In addition, we also provided test results on the randomly split datasets, which deviates from the setting of the previous works and indeed triggered some interesting discoveries, as discussed in Appendix E.3. Therefore, we believe that we have provided a fair and comprehensive analysis of both the backbones and UQ methods in our work. However, we also made it convenient to conduct experiments on any new datasets, as indicated in our Code README. We will continuously improve our benchmark to make it more inclusive.
We hope these clarifications address your concerns effectively. We would be grateful if you could reconsider your evaluation of our manuscript in light of these responses.
Best regards,
MUBen authors
Thank you for the feedback! I appreciate that the computational cost aspects are clarified in Figure 6, which is very informative.
On the other hand, I still felt that if deep ensembles can work well for the UQ purpose even with M=3, then it sounds manageable in practice, which might make the needs for other (efficient but compromise-version of) UQ methods unclear.
For the second point, the paper raises a problem on "over-confident predictions on test data that fall outside of the training distribution," which is emphasized in the abstract as the motivation for this benchmarking. So, I felt we need some datasets that include examples that "fall outside of the training distribution".
But most existing pretrained models already show a nice performance for the MoleculeNet datasets, so I'm wondering if the use of this data set really fit for that purpose. In other words, if we can develop some UQ methods that work well for this benchmarking, then can we regard it a reliable UQ method for reducing the risk of any "over-confident predictions on test data that fall outside of the training distribution" as intended initially...?
Dear Reviewer,
Thank you for your reply! We would like to address your concerns on data distribution first in the following discussion and then talk about the impact of ensemble numbers in another comment due to character limitation.
Regarding the data distribution, we briefly introduced it in Section 4.3, and we apologize if we may have not made it very clear. In our experiments, all primary results come from the setup of “out-of-distribution (OOD) test data”. Specifically, scaffold splitting first arranges the molecules according to their scaffold and splits the dataset such that the training, validation, and test data points are as dissimilar as possible in their molecular scaffold. Therefore, we may consider the training, validation, and test data points are not from the same distribution, i.e., the test data are OOD.
Under this condition, one major exploration of this work, as stated in the abstract and the last paragraph in the Introduction, is investigating whether UQ methods would help the well-performed pre-trained backbones to better realize their uncertainty to their predictions. The previous benchmarks of such pre-trained models mainly focus on their prediction metrics (e.g., ROC-AUC or RMSE), neglecting a proper evaluation of their UQ performance. We hence propose MUBen to fill the gap. As mentioned in our previous response, none of the backbone models is pre-trained using the MoleculeNet data. So we believe that the test data are OOD for all backbone models, which forms a good and reliable testbed for our purpose.
In addition, we also included two simple models that are not pre-trained: DNN and GIN. As their performance when combined with different UQ methods largely correlates with the pre-trained models, we may reach the conclusion that our assessment of the UQ methods is reliable.
Dear Reviewer,
Thank you for your clarification and apology for our previous misinterpretation of your statement. We would like to address your concern about the pre-training data from several perspectives:
-
To our knowledge, there is no strict evaluation of how MoleculeNet data overlap with the pre-training dataset of each pre-trained model. Therefore, we cannot know concretely whether their good performance comes from the overlapping of pre-training and test data. Still, we cannot directly build the causal relation of “they perform well on this dataset, so their pre-training data probably have already covered the test data distribution”. There are many factors that can affect the performance of molecular representation models, such as the network structure, input data format, pre-training objective, etc. Even if we include new datasets in MUBen, there is no guarantee that the new datasets certainly fall outside of the distribution of the pre-training data.
-
In fact, many backbones are not originally evaluated on the entire MoleculeNet. For example, ChemBERTa is evaluated only on BACE, Lipo, BBBP, and ClinTox [1], and GROVER on BBBP, SIDER, ClinTox, BACE, Tox21 and ToxCast [2]. Under this circumstance, even if we do assume the aforementioned causal relationship is correct, we suppose our evaluation of these models on the Quantum Mechanics properties is unbiased. We can also make a similar claim that TorchMD-NET, which is only pre-trained on Quantum Mechanics data, has an unbiased evaluation of other Biophysics, Physiology, and Physical Chemistry properties. In addition, as mentioned in the previous response, we also include DNN and GIN, which are not pre-trained on any data. Therefore, the test data are guaranteed to be OOD for them, and their evaluations are unbiased. As we observe a similar trend in the performance of UQ methods as the pre-trained backbones, we may conclude that our evaluation is largely accurate.
-
Comparing the results from scaffold split datasets (Table 9 – Table 22) and random split datasets (in-distribution test data, Table 23 – Table 30), we notice that there are performance gaps between the pre-trained models when they are tested on OOD or in-distribution data. This indicates that the models are less familiar with the former dataset, and UQ methods can help improve their evaluation in such situations.
-
We developed this benchmark and codebase for practical purposes, and we have applied MUBen to our own data (apologize for not being able to reveal these data—they are so far still confidential). Our experiments demonstrate the same backbone-wise and UQ-wise performance. Therefore we can claim with certain confidence that our evaluation of MoleculeNet is faithful and able to guide practical backbone and UQ selection.
We hope our response addresses your concern. We would be grateful if you raise your evaluation of our manuscript in light of these responses. Let us know if you have further questions or suggestions!
Best regards,
MUBen authors
[1] Ahmad, Walid, et al. "Chemberta-2: Towards chemical foundation models." arXiv preprint arXiv:2209.01712 (2022).
[2] Rong, Yu, et al. "Self-supervised graph transformer on large-scale molecular data." Advances in Neural Information Processing Systems 33 (2020): 12559-12571.
[3] Zaidi, Sheheryar, et al. "Pre-training via denoising for molecular property prediction." arXiv preprint arXiv:2206.00133 (2022).
Thank you for the prompt response! Your answers are well taken, and I will consider them in the reviewers' discussions.
Just for a remark: I just felt that if our goal is to evaluate methods for UQ (uncertainty quantification), we might need both "clearly certain" query points as well as "clearly uncertain (provably out of training distribution)" points, and then we should make sure how each method can detect certain points as certain and uncertain points as uncertain. This would be what we expect for UQ methods. However, guaranteeing this would be a very hard technical problem, like we need UQ for developing UQ benchmarks. Furthermore, as the authors pointed out, it is unclear that MoleculeNet test sets by scaffold splitting have some overlaps with the training sets in the latent space of pretrained model. Also, I understand that this problem cannot be avoided just by using a dataset different from MoleculeNet. Anyway, thank you for promptly providing the detailed discussions.
Dear Reviewer,
Thanks for your feedback! Regarding your remark, we would like to bring to your attention that it might be addressed by the 2nd point in the previous response---for TorchMD-NET, the Biophysics, Physiology, and Physical Chemistry properties (Table 4) are "clearly uncertain" training points, since TorchMD-NET is only trained on Quantum Mechanical datasets. The same for the two unpre-trained models: DNN and GIN. Checking out their performance in Tables 1, 2, 9--22 might be able to address your concerns.
Best regards,
MUBen Authors
Regarding the number M of deep ensembles, we use M=10 on most datasets and only use 3 on 3 very large ones (HIV, MUV, QM9) to reduce the computational cost, as mentioned in Appendix C1, Paragraph “Deep Ensembles” under equation 10. Using smaller M does improve the model performance over the “deterministic” baseline, but it does not necessarily surpass other UQ methods with a much cheaper cost. In addition, using a small M may cause inadequate exploration of the latent space, resulting in less stable performance gain (or loss). A larger M would give a more constant promotion. We did some experiments to investigate the impact of M. The results are shown below:
Table 1, DNN on Lipo dataset
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| rmse | 0.785 | 0.751 | 0.742 | 0.739 | 0.73 | 0.728 | 0.727 | 0.727 | 0.726 | 0.726 |
| mae | 0.608 | 0.577 | 0.574 | 0.572 | 0.565 | 0.565 | 0.562 | 0.561 | 0.56 | 0.56 |
| nll | 1.013 | 0.826 | 0.731 | 0.643 | 0.518 | 0.459 | 0.456 | 0.408 | 0.363 | 0.374 |
| ce | 0.032 | 0.032 | 0.031 | 0.03 | 0.029 | 0.028 | 0.028 | 0.026 | 0.025 | 0.026 |
Table 2, DNN on Tox21
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| roc-auc | 0.734 | 0.741 | 0.738 | 0.743 | 0.742 | 0.746 | 0.746 | 0.746 | 0.747 | 0.748 |
| ece | 0.043 | 0.037 | 0.038 | 0.037 | 0.038 | 0.039 | 0.036 | 0.037 | 0.036 | 0.035 |
| nll | 0.279 | 0.271 | 0.271 | 0.269 | 0.269 | 0.267 | 0.267 | 0.267 | 0.266 | 0.266 |
| brier | 0.079 | 0.076 | 0.076 | 0.076 | 0.076 | 0.075 | 0.075 | 0.075 | 0.075 | 0.075 |
Table 3. Uni-Mol on Lipo
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| rmse | 0.572 | 0.569 | 0.568 | 0.567 | 0.565 | 0.57 | 0.569 | 0.57 | 0.571 | 0.569 |
| mae | 0.423 | 0.416 | 0.415 | 0.413 | 0.414 | 0.419 | 0.417 | 0.419 | 0.419 | 0.418 |
| nll | 1.825 | 1.156 | 0.966 | 0.919 | 0.432 | 0.346 | 0.32 | 0.314 | 0.361 | 0.353 |
| ce | 0.051 | 0.048 | 0.047 | 0.047 | 0.043 | 0.042 | 0.042 | 0.041 | 0.042 | 0.042 |
Table 4. Uni-Mol on Tox21
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| roc-auc | 0.785 | 0.797 | 0.8 | 0.799 | 0.798 | 0.803 | 0.804 | 0.807 | 0.807 | 0.808 |
| ece | 0.052 | 0.042 | 0.042 | 0.038 | 0.038 | 0.037 | 0.035 | 0.034 | 0.034 | 0.034 |
| nll | 0.269 | 0.251 | 0.251 | 0.248 | 0.246 | 0.243 | 0.243 | 0.242 | 0.242 | 0.241 |
| brier | 0.072 | 0.069 | 0.069 | 0.069 | 0.068 | 0.068 | 0.067 | 0.067 | 0.067 | 0.067 |
We hope our response addresses your concern. We would be grateful if you raise your evaluation of our manuscript in light of these responses. Let us know if you have further questions or suggestions!
Best regards,
MUBen authors
Thank you for the clarification! Seeing the prediction scores and the UQ scores along with M was informative.
As for data distribution, what I meant was on the top of assuming that MUBen benchmark is based on scaffold splitting. Existing pretrained models such as UniMol and GROVER were already evaluated by using scaffold splitting in their original papers (for MoleculeNet downstream tasks).
Given that these pretrained models predict well on the test data by scaffold splitting of MoleculeNet, these datasets might not be a good choice for simulating the cases of "out of training distributions". It might be OOD in the input space (scaffold-wise), but it will overlap well with the training data in the latent space, i.e., the space of learned representation by large-scale pretraining. That would be the central idea of representation learning by large-scale pretraining, and that would be why the prediction scores of these pretrained models were reportedly good for MoleculeNet downstream tasks, even when we used scaffold splitting.
This paper proposes an evaluation method called MUBEN to benchmark various pre-trained molecular representation models. The authors fine-tuned different models with a series of molecular descriptors and provided assessments and insights for model selection.
优点
This paper comprehensively investigated the uncertainty quantification for molecular representation models, including various pre-trained backbones covering string-based, graph-based, 3D-structure-based and hand-crafted models. Also, various methods of uncertainty quantification were involved such as Bayes by Backprop.
缺点
The novelty is limited. As a benchmark article, the final conclusion is pretty general that Ensemble methods seem to have better performance for evaluating molecular uncertainty.
As a benchmark paper, it is necessary to illustrate the best configuration of backbone choice, UQ method and data splitting strategy for classification and regression tasks.
The arrangement of this paper could be improved to make it better. Some Tables provide redundant information, such as Table 1 and Table 2.
问题
-
This paper is not friendly for the general readers to read and follow. Since the UQ methods comparison is the key points of this paper, I highly recommend the authors to make simple schematic diagrams for each category of UQ methods, instead of only text used in Figure 1.
-
For the first category of pre-trained molecular representation models, why do you choose ChemBertTa? According to Molformer paper, Molformer has better representation ability than ChemBertTa.
-
In the regression results of Table 2, the SGLD seems to also have good performance. Please analyze the reasons.
-
By comparing Uni-Mol, ChemBERTa, and DNN in Figure 4, it is concluded that larger models such as Uni-Mol are more confident to their results. However, the performance of two models is shown for each dataset. Are the figures for ‘ChemBERTa on FreeSolv’ and ‘Uni-Mol on Lipo’ missing?
-
Do the Table 3 results only belong to Uni-Mol? If so, you should underscore the Uni-Mol in Table 3 title. And about the discussion for Table 3, the text mainly focuses on whether frozen backbone or not. How about split methods? Why does the random splitting perform better on classification tasks but worse on regression tasks than scaffold splitting?
-
It is better to illustrate the best configuration of backbone choice, UQ method and data splitting strategy for classification and regression tasks.
Dear Reviewer,
Thank you for your valuable feedback on our manuscript. We address your concerns as follows:
-
UQ Illustration Update: We acknowledge the need to update the illustration and will include them in the revised version that will be posted in a few days.
-
MolFormer Comparison: Our initial attempt to test and adapt MolFormer using this script encountered the error of missing model checkpoint, similar to this issue. However, we now recognize that these issues have been resolved in the updated GitHub repository. We plan to incorporate MolFormer in our benchmark in future work. In fact, while we were investigating the pre-trained models, we noticed that many works did not open-source their models, and we assumed the same situation for MoLFormer.
-
Discussion on SGLD: We have provided a detailed discussion on this topic, as indicated in the second paragraph below Figure 4, starting with “Although limited in classification efficacy…”
-
Figure 4 Subfigures Selection: We chose these subfigures to compare ChemBERTa with DNN and Uni-Mol with DNN respectively on two datasets to show a general trend. We selected DNN as an anchor since it has the best regression calibration, as shown in Figure 2 (b). As the space for the main paper body is limited, we only selected the most representative figures to support our argument and left other results in the appendix. We have checked all figures during experiments and they show a coherent pattern as displayed and described in our paper.
-
Table 3 Content and Discussion: The results in Table 3 represent a macro average across datasets, backbones, and UQ methods, and are not exclusive to Uni-Mol, as the caption indicates. We spend the second half of the paragraph “Frozen Backbone and Randomly Split Datasets” in section 5 and the whole appendix subsection E3 discussing the behavior of the models on randomly split datasets. So we think the statement “discussion mainly focuses on frozen backbone” is not fully justified. Randomly splitting also performs better on Regression tasks. For regression metrics, the scores are the lower the better, as indicated by the arrows beside each metric.
-
Best Configuration of Method Choices: By “configuration”, if it refers to what backbone/UQ method we should choose under a certain scenario, then we provided some insights in our Conclusion section. If it refers to the model hyper-parameters, we have set them as the default parameters in our code. For the data-splitting strategy, it depends on the problem set-up (whether it is in-distribution or out-of-distribution) and there is no “best configuration” for it.
We respectfully disagree with your point that “novelty is limited”, we would like to argue that we did provide insights in our experiments on the model and UQ selection in different scenarios in experiments (Section 5) and conclusion (Section 6). Despite experimental results suggesting that the ensemble method is a superior choice, we also would like to highlight the contribution of a benchmark paper does not only come with the conclusion and suggestion of which method is the best but also includes:
-
Set a fair comparison and evaluation pipeline and encourage more researchers to study better UQ methods (or specific adaptation or inspiration from the molecular domain)
-
Provide the implementation of the UQ method ready to use for the researchers and practitioners
We hope these clarifications address your concerns effectively. We would be grateful if you could reconsider your evaluation of our manuscript in light of these responses.
Sincerely,
MUBen Authors
Dear Reviewer,
We're appreciative of your constructive feedback on our work. As the rebuttal session draws to a close, we would like to kindly remind you to check out our responses and updates of the draft. We would greatly value your insights on these updates, and welcome any further suggestions or concerns you might have.
Dear Reviewers,
Thank you for your advices to our work! We have updated our draft and uploaded a new version into the system. The main adjustment include:
- Figure 1 updated: We redraw Figure 1 to make it more informative with respect to molecular descriptors and uncertainty quantification methods, as suggested by Reviewer KD9p ;
- Updated citations and discussion: We updated citations and discussions in Section 1 Introduction and Section 2 Related Works according to the suggestions from [Reviewer MCGM].
In addition, we made some minor updates to fix typos and incorrect citation formats across the article.
Please Let use know if you have further questions and concerns, so that we can continuously update our draft to make it better and more solid!
Best regards, MUBen Authors
Three reviewers have raised many critical remarks about the conceptual novelty of this work, problems concerning the interpretation of experimental results, questions regarding the experimental setup etc. It seemed that none of these reviewers had the impression that the rebuttal could address these weaknesses in a convincing way (and I come to similar conclusions). On the other hand, there was one more positive review, but during the discussion phase, I had the impression that this positive vote was not unconditionally positive, since several of the points of criticism were also shared by this reviewer. In summary, I think tat for this paper the weaknesses outweigh the strength, and therefore, I recommend rejection.
为何不给更高分
Limited conceptual novelty and several open questions about the experimental setup.
为何不给更低分
N/A
Reject