PaperHub
6.0
/10
Poster4 位审稿人
最低5最高8标准差1.2
8
5
5
6
3.5
置信度
正确性3.3
贡献度2.8
表达2.5
NeurIPS 2024

Exploring Molecular Pretraining Model at Scale

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06
TL;DR

we systematically investigate the scaling laws on molecular pretraining and scale the model to 1.1 billion parameters through pretraining on 0.8 billion conformations, making it the largest molecular pretraining model to date.

摘要

关键词
Molecular PretrainingScaling LawMolecular Property Prediction

评审与讨论

审稿意见
8

Summary

This paper presents Uni - Mol2, a molecular pretraining model, and systematically investigates the scaling law within molecular pretraining models.

Contributions

  1. The largest dataset for molecular pretraining: Curated a dataset of approximately 884 million 3D conformations for pretraining.
  2. Scaling law: It is the first time to demonstrate the scaling law of molecular pretraining and its impact on downstream task performance.
  3. Significant improvement: Uni - Mol2 is the SOTA model and demonstrates consistent improvement in downstream task performance with increasing model parameters.

优点

  1. Presents Uni - Mol2, a novel molecular pretraining model that leverages a two - track transformer to integrate features at multiple levels.
  2. Conducts the first exploration of the scaling law in molecular pretraining models.
  3. Curates a large dataset of approximately 884 million 3D conformations, providing a solid foundation for training large - scale molecular models.

缺点

Limited details of pretraining hyper-parameters: The paper primarily focuses on analyzing the power-law relationships, but does not give details of the pretraining hyper-parameters.

问题

  1. What is the computational cost and time complexity of training the Uni - Mol2 model with different parameters on the large dataset?
  2. Could you please provide more details on how the temperature - based sampling method affects the balance of the training dataset and the performance of the model?

局限性

See the weakness and questions plz.

作者回复

We extend our sincere thanks to Reviewer ER2p for the positive evaluation and the thoughtful time invested in reviewing our manuscript. Your encouraging feedback is greatly appreciated, and we have carefully addressed each of your comments in the detailed responses provided below.

Response to Weaknesses

  1. Limited Details of Pretraining Hyper-Parameters
    Thank you for pointing out the lack of detailed information regarding the pretraining hyper-parameters in the current manuscript. We have outlined the details of the pre-training hyperparameters in Section 3.3 of the paper, which includes comprehensive details on the hyper-parameters used, including learning rates, batch sizes, number of epochs, optimizer settings, and regularization techniques. If there are specific details or additional aspects you would like us to clarify, please let us know, and we would be happy to discuss them further.

Response to Questions

  1. Computational Cost and Time Complexity
    The time complexity of training the Uni-Mol2 model primarily depends on the number of parameters, dataset size, and computational resources. We provide the details of the training process, including hardware specifications, training duration, and computational resources used in the general rebuttal. We will subsequently integrate this information into the manuscript.

  2. Impact of Temperature-Based Sampling
    In our current study, we faced a significant imbalance in the skeletal clustering results of the data. Specifically, the first two categories had a disproportionately high proportion of molecules, while the subsequent categories had much lower proportions. Our dataset contains 73 million scaffolds, with the top two scaffolds alone accounting for 2.78% of the total number of molecules. In contrast, the last 25 million scaffolds together represent only 2.8% of the molecules. To address this issue and ensure that rare scaffolds can be sampled, we referred to some sampling techniques [1][2] used in language models. We adopted a temperature-based sampling approach that reduces the sampling probability of selecting the top two scaffolds by a factor of 19,000, which in turn increases the chances of sampling rarer scaffolds.

    However, due to current computational resource constraints, we have not been able to conduct extensive ablation studies on different sampling strategies. Nevertheless, based on recent advancements in Large Language Models (LLMs) [3][4][5], it is evident that various data sampling methods and the mixture of data types present a promising direction for future research. We appreciate your suggestion and consider it a valuable area for further investigation.

References

[1] Wang X, Tsvetkov Y, Neubig G. Balancing training for multilingual neural machine translation[J]. arXiv preprint arXiv:2004.06748, 2020.
[2] Fan A, Bhosale S, Schwenk H, et al. Beyond english-centric multilingual machine translation[J]. Journal of Machine Learning Research, 2021, 22(107): 1-48.
[3] Shao Y, Li L, Fei Z, et al. Balanced Data Sampling for Language Model Training with Clustering[J]. arXiv preprint arXiv:2402.14526, 2024.
[4] Ye J, Liu P, Sun T, et al. Data mixing laws: Optimizing data mixtures by predicting language modeling performance[J]. arXiv preprint arXiv:2403.16952, 2024.
[5] Gu J, Yang Z, Ding C, et al. CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models[J]. arXiv preprint arXiv:2407.17467, 2024.

评论

I really like your rebuttal and the paper. You've added some training details, and I'm curious about how you ultimately chose the learning rate scheduler? For example, I've noticed some interesting recent papers on training dynamics, such as:

MiniCPM: https://arxiv.org/abs/2404.06395

Qwen technical report: https://arxiv.org/abs/2309.16609

Could you discuss or possibly explore the impact of adopting these new learning rate schedulers and include this part in your paper?

评论

Thank you for your kind words and your interest in our work. We appreciate your attention to the training details and your insightful question regarding the choice of the learning rate scheduler. Based on our observations, selecting an appropriate learning rate scheduler often lacks a straightforward or universally applicable solution. This process predominantly relies on experiential knowledge. We chose the polynomial decay scheduler based on prior work [1], as we found that, with certain hyperparameter adjustments, it effectively optimized Uni-Mol2 to convergence. We also noticed that the cosine scheduler has been widely adopted in the training of many large language models [2, 3, 4]. In fact, we conducted some preliminary experiments using Uni-Mol2 84M to compare the cosine scheduler with the polynomial decay scheduler employed in this paper. Our preliminary experimental results indicate that in the pretraining task, the performance of these two schedulers is consistent. We will supplement some additional results in this part as required in the future. Thank you again for your valuable suggestions.

References

[1] Zhou G, Gao Z, Ding Q, et al. Uni-mol: A universal 3d molecular representation learning framework[J]. 2023.
[2] Bai J, Bai S, Chu Y, et al. Qwen technical report[J]. arXiv preprint arXiv:2309.16609, 2023.
[3] Dubey A, Jauhri A, Pandey A, et al. The Llama 3 Herd of Models[J]. arXiv preprint arXiv:2407.21783, 2024.
[4] Kaplan J, McCandlish S, Henighan T, et al. Scaling laws for neural language models[J]. arXiv preprint arXiv:2001.08361, 2020.

评论

Yeah. Good luck with your paper. Let me know what I can help.

评论

Thank you for your encouragement and support! I truly appreciate your willingness to help and your confidence in my work.

审稿意见
5

In this work, the authors propose Uni-Mol2 , a molecular pretraining model that leverages a two-track transformer to integrate features at the atomic level, graph level, and geometry structure level. The authors also investigate the scaling law within molecular pretraining models, characterizing the power-law correlations between validation loss and model size, dataset size, and computational resources.

优点

The work investigates an important problem in chemistry and machine learning, and proposes a useful LLM model. The figures are very informative, especially Fig. 2 on the architectural pipeline.

缺点

In the Related Work, it would also be relevant if the authors can discuss graph neural network (GNN) models that have recently been used to aid in improved representation learning for molecular graphs e.g., YieldGNN. Furthermore, GNNs are significantly more computationally efficient as compared to LLM-based models.

This paper provides a survey on molecular representation learning, and seems to be a relevant recent reference:

Zhichun Guo, Kehan Guo, Bozhao Nan, Yijun Tian, Roshni G. Iyer, Yihong Ma, Olaf Wiest, Xiangliang Zhang, Wei Wang, Chuxu Zhang, and Nitesh V. Chawla. 2023. Graph-based molecular representation learning. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI '23). Article 744, 6638–6646. https://doi.org/10.24963/ijcai.2023/744

问题

Can the authors more clearly describe (1) the contributions of their work, and (2) their design choice behind considering an LLM model as opposed to GNN model. Why not also consider a combination of both?

What is the time complexity of training time of the LLM model. Also, how many GPU resources were used?

局限性

The work sufficiently describes the limitations.

作者回复

We would like to thank Reviewer XhCf for the detailed and constructive review. Below, we respond to all your comments.

We appreciate the opportunity to clarify a key point at the outset: typically, when referring to LLM models, the term denotes large language models that primarily aim at next-token prediction within the natural language processing domain [1][2]. However, our manuscript builds upon a pre-trained model specifically tailored for the small molecule domain. Fundamentally, our model is a two-track transformer aimed at masked token prediction and coordinate denoising. Our research focuses on examining the scaling laws relevant to this category of molecular pre-training models. Hence, we believe that a more appropriate comparison would be between transformer-based models and GNN-based models within the small molecule domain.

Response to Questions

  1. Contributions, Design Choice of Uni-Mol2 and Comparision with GNN
    Our research's main contribution is to investigate the scaling laws relevant to this category of molecular pre-training models. Our results reveal power-law relationships between validation loss and model size, dataset size, and computational resources. Additionally, we observe consistent improvements in downstream tasks as the model size increases.

    We recognize the advantages of GNNs, particularly their efficiency in processing molecular graph structures and their strong capability to capture local relationships within a molecule. However, the locally connected graph fails to adequately represent long-range interactions between atoms. These long-range interactions are crucial in molecular representation learning (MRL). In contrast, transformer-based models have shown exceptional performance in various tasks within the molecular domain, demonstrating remarkable representation capabilities. Some researchers even view transformers as a form of fully-connected GNN [3]. Additionally, recent advancements in transformer engineering optimization have further enhanced their effectiveness [7].

  2. Time Complexity and GPU Resources
    The time complexity of training the Uni-Mol2 model primarily depends on the number of parameters, dataset size, and computational resources. We provide the details of the training process, including hardware specifications, training duration, and computational resources used in the general rebuttal. We will subsequently integrate this information into the manuscript.

Response to Weaknesses

  1. Discussion of Graph Neural Network (GNN) Models
    We appreciate the suggestion to include a discussion on Graph Neural Network (GNN) models, such as YieldGNN [9], in the Related Work section. Specifically, we will reference the survey on graph-based molecular representation learning by Guo et al. (2023) [8] and highlight the advantages of GNNs, particularly in terms of representation power and computational efficiency. We believe that including this related work will significantly enhance the completeness and context of our manuscript.

References

[1] Achiam J, Adler S, Agarwal S, et al. Gpt-4 technical report[J]. arXiv preprint arXiv:2303.08774, 2023.
[2] Reid M, Savinov N, Teplyashin D, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context[J]. arXiv preprint arXiv:2403.05530, 2024.
[3] Min E, Chen R, Bian Y, et al. Transformer for graphs: An overview from architecture perspective[J]. arXiv preprint arXiv:2202.08455, 2022.
[4] Zhou, Gengmo, et al. "Uni-mol: A universal 3d molecular representation learning framework." (2023).
[5] Yu, Q., Zhang, Y., Ni, Y., Feng, S., Lan, Y., Zhou, H., and Liu, J. Unified molecular modeling via modality blending. arXiv preprint arXiv:2307.06235, 2023.
[6] Luo S, Chen T, Xu Y, et al. One transformer can understand both 2d & 3d molecular data[C]//The Eleventh International Conference on Learning Representations. 2022.
[7] Dao T, Fu D, Ermon S, et al. Flashattention: Fast and memory-efficient exact attention with io-awareness[J]. Advances in Neural Information Processing Systems, 2022, 35: 16344-16359.
[8] Zhichun Guo, Kehan Guo, Bozhao Nan, Yijun Tian, Roshni G. Iyer, Yihong Ma, Olaf Wiest, Xiangliang Zhang, Wei Wang, Chuxu Zhang, and Nitesh V. Chawla. 2023. Graph-based molecular representation learning. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI '23). Article 744, 6638–6646. https://doi.org/10.24963/ijcai.2023/744
[9] Shi R, Yu G, Huo X, et al. Prediction of chemical reaction yields with large-scale multi-view pre-training[J]. Journal of Cheminformatics, 2024, 16(1): 22.

评论

Thanks for your response and clarifications. I also look forward to seeing the new revisions in the future version of the paper, which will improve quality of this work. I will increase my score by one point.

评论

We sincerely appreciate your feedback, and will incorporate the revisions in the next version of the manuscript. Thank you for your consideration and for raising the score, your support is greatly valued.

审稿意见
5

This paper studies large scale pretraining for molecule. The authors compose the largest molecular pretraining dataset, Uni-Mol2, and train the largest molecular pretraining model with 1.1B parameters. This paper also fits a scaling law in this domain that can accurately predict losses on a validation set. Downstream performance on QM9 and COMPAS-1D datasets demonstrate the proposed model can outperform previous approaches.

优点

  1. This paper creates the largest molecular pretraining dataset and the authors indicate the intent to open-source the dataset in the checklist.
  2. The authors pretrain various scales of molecular models up to 1.1B parameters on the new dataset, a scale that has not been studied previously.
  3. The results on downstream tasks surpass previous works.

缺点

  1. On the downstream numbers reported in the paper, the scores do not scale well with the model size for most of them. For example, on most of the properties Uni-Mol2 1.1B is not apparently better than much smaller models like Uni-Mol2 310M, and on COMPAS-1D the Uni-Mol2 1.1B is even comparable to Uni-Mol2 84M in some cases – it is likely that I am not familiar with these datasets and do not perceive the score difference well, yet I am wondering whether a gain like 0.0001 is really meaningful and how can we know it is statistically significant? Is this because the tasks are too simple or the large 1.1B model is not trained properly? If the reason is the former, I think more difficult tasks should be included to demonstrate the benefit of larger models – otherwise the impact of this paper on scaling up model sizes is limited.
  2. The paper’s presentation should be improved. For example, in the results tables it is better to indicate the parameter size of the baselines (like Uni-Mol) to understand the effect of model sizes on the results. There are also some grammar errors such as Line 39 “Most” -> “most”. Line 65 “To” -> “to”.

问题

Related questions have been asked above.

局限性

The authors have included a limitation section in Appendix C, on the prediction accuracy of the scaling law. However, I would like to see a discussion on scaling up models with the proposed approach when the experimental results have shown saturation with only 1.1B parameters – which may indicate that the proposed training losses cannot be used to train powerful larger and larger models.

作者回复

We deeply appreciate Reviewer buoE's careful review and thoughtful feedback. Your suggestions have greatly contributed to improving our manuscript. We respond to your comments and questions in detail below.

Response to Weaknesses

1. Performance Improvements

We acknowledge your concern regarding the insufficient improvement of Uni-Mol2 in certain downstream tasks. We have provided additional clarification for our downstream experimental results and added some new downstream experiments. Overall, Uni-Mol2 1.1B demonstrates a considerably greater performance improvement compared to Uni-Mol2 84M and Uni-Mol2 310M in downstream tasks.

For the aEA and aIP tasks in the COMPAS-1D dataset, Uni-Mol2 1.1B with graph features demonstrates a more noticeable MAE reduction compared to Uni-Mol2 84M (0.96% -> 10.58% for aEA and 1.23% -> 3.90% for aIP). On the 10 downstream tasks of the QM9 dataset, Uni-Mol2 1.1B achieved an average improvement of 8.472% compared to Uni-Mol2 310M. Detailed explanations and results supporting these clarifications can be found in the general rebuttal section.

2. Presentation Improvements

We appreciate the suggestions for improving the presentation of our manuscripts. We will revise the results tables to include the parameter sizes of baseline models, such as Uni-Mol, to better illustrate the impact of model size on performance. Additionally, we will correct the identified grammar errors and thoroughly review the manuscript to ensure clarity and accuracy.

Response to Limitations

We appreciate the reviewer's observation regarding the limitations of our proposed approach, particularly in relation to the observed saturation of experimental results with models containing 1.1 billion parameters. We agree that this finding suggests a potential limitation in the scalability of the proposed training losses.

In response, we would like to clarify that the saturation observed may be due to the specific experimental setup and dataset limitations. While some experiments of the current results show a plateau at 1.1B parameters, it does not necessarily imply an inherent ceiling of the proposed method. We added some new extra experiments in the supplementary appendix; the increase in Uni-Mol2 1B compared with 310M is still significant and does not indicate that the results have reached saturation.

Returning to the starting point of this paper, we primarily investigate the relation of validation loss as the model, data, and computation size increase, which is consistent with scaling laws. In the field of molecular pretraining, establishing appropriate pretraining loss to ensure the scaling effects of the model is indeed a compelling direction[1,2,3]. Given our current result, we have ensured stable training, and this has led to consistently superior results in downstream tasks. We will discuss these considerations in the manuscript to provide a more comprehensive understanding of the limitations and to further explore this direction in future work.

References

[1] Yang J, Zheng K, Long S, et al. MOL-AE: Auto-Encoder Based Molecular Representation Learning With 3D Cloze Test Objective[J]. bioRxiv, 2024: 2024.04. 13.589331.
[2] Ni Y, Feng S, Ma W Y, et al. Sliced Denoising: A Physics-Informed Molecular Pre-Training Method[J]. arXiv preprint arXiv:2311.02124, 2023.
[3] Feng, Shikun, et al. "UniCorn: A Unified Contrastive Learning Approach for Multi-view Molecular Representation Learning." arXiv preprint arXiv:2405.10343 (2024).

评论

Thanks for your response! I appreciate the authors for the added the results and it is encouraged to include those results in the next revision. These new results mitigate my concern on the improvement, though in many cases the improvement still seems small and such scaling does not seem to be universally successful. Thus I would like to raise by score a bit to 5.

评论

We sincerely appreciate your positive feedback and the increase in the score. We are glad that the additional results we provided have helped to mitigate some of your concerns regarding the improvements in our approach. We fully concur with your suggestion to incorporate these results in the next revision.

审稿意见
6

This paper studies the pretraining task on molecular domain. The main contribution includes extending the size of pretraining dataset, scaling the model to 1.1B size, investigating pretraining scaling law behavior. The evaluation of the pretrained model is conducted for some molecular property prediction task.

优点

  1. Extend the pretraining dataset size, including both strings, graphs, and 3D conformation.
  2. While the model architecture is not new and is based on existing work, the effort to scale it up to 1B model is great for the current literature.
  3. The scaling law behavior is interesting, while it is different from language model.
  4. While the downstreaming tasks are limited, consistent performance improvement is shown with increasing model size.

缺点

  1. The evaluation of the pretrained model is limited, it would be great if the author can show more diverse and complicated chemistry tasks. Better include some domain-intense tasks.
  2. The model training time is not reported, the author should include enough detail for reproducing the experiment.
  3. Pretraining tasks/losses have some hyperparameter, the author should include ablation study result for these choices.

问题

  1. For these downstreaming evaluation tasks, the improvement of performance from small model to large model is not that high, (for example the number in Table 5 and Table 6) can the author provide some justification about these numbers? 1.1B model is significantly costly than 84M, so we would expect a larger improvement.
  2. Can the author include more downstreaming tasks?

局限性

NA

作者回复

We are grateful to Reviewer V7dw for the thorough review and insightful comments. Below, we have carefully considered your feedback and provided detailed responses to each point.

Response to Weaknesses:

  1. Limited Evaluation of the Pretrained Model
    We begin by highlighting the promising results achieved in the fields of small molecules and photoelectrics, while also acknowledging the limitations of our evaluation tasks. Due to constraints on computational resources, we focused on a specific set of molecular property prediction tasks. We agree that including more diverse and complex chemistry tasks could provide a broader evaluation of the model's capabilities. To address this, we have added new Biogen ADME benchmark results to demonstrate the model's capabilities. In future work, we plan to incorporate additional domain-intensive tasks to better showcase the utility of our pre-trained model.

  2. Model Training Time and Reproducibility
    Thank you for pointing out the omission of detailed information regarding the model training time. We have outlined the details of the pre-training hyperparameters in Section 3.3 of the paper. Additionally, we provide comprehensive details about the training process, including hardware specifications, training duration, and computational resources used in the general rebuttal. We will subsequently integrate this information into the manuscript. If there are specific details or additional aspects you would like us to clarify, please let us know, and we would be happy to discuss them further.

  3. Ablation Study on Pretraining Tasks/Losses Hyperparameters
    Thank you for your valuable suggestion. We have indeed adjusted the hyperparameters of the loss function during model training to ensure the stability of the training process. In the end, we set the weight for each loss component to 1 across all model sizes. However, systematically conducting ablation studies on the loss function at expanded model and dataset scales would require substantial computational resources, which are currently beyond our capacity for comprehensive exploration in the short term. Nevertheless, we recognize the importance of this research direction and will consider it a significant area for future investigation.

Response to Additional Concerns:

  1. Justification for Performance Improvement
    We fully understand your concern regarding the relatively modest performance improvement observed when transitioning from the small model (310M) to the large model (1.1B) in the downstream evaluation tasks, as illustrated in Table 5 and Table 6. To address this, we have provided further clarification of the results and supplemented our study with additional experiments. The detailed findings and explanations have been included in the general rebuttal section.

  2. Inclusion of More Downstream Tasks
    Yes, expanding the range of tasks will allow for a more comprehensive evaluation of the model's performance across different aspects. We will explore more applications in this area.

评论

Thank you for answering my questions. I'm looking forward to see your model achieve some substantial influence for downstreaming tasks in future. I will keep the score.

评论

We sincerely appreciate your thoughtful questions and your engagement with our work. We are encouraged by your interest in seeing our model have a substantial impact on downstream tasks in the future. Thank you for your valuable feedback and for taking the time to review our paper.

作者回复

General Rebuttal

(R IDs: R1=V7dw, R2=buoE, R3=XhCf, R4=ER2p)

We thank the reviewers for the detailed and helpful reviews. Next, we address the main concerns from reviewers.

  1. Time Complexity and GPU Resources

We utilized a computational cluster comprising 64 NVIDIA A100 GPUs, each equipped with 80GB of HBM2 memory. The GPUs were interconnected via a high-speed Nvidia Infiniband fabric, offering 400 Gbps bandwidth for inter-GPU communication. The details of each model size are listed as follows.

ParamsCompute Resouce(GPUs)Training Time(GPU hours)
84M322585.6
164M325120
310M327680
510M6413824
1.1 B6430720
  1. Performance Improvement Concern

Similar to other works on scaling laws [1][2], our paper explores the power-law relationship between the validation loss of pre-trained models in the molecular domain and the scales of data, model, and computation. This represents the first validation at the scale of billions of data points and pre-trained model parameters.

On the other hand, while our proposed Uni-Mol2 consistently outperforms other baseline models across various downstream tasks, the performance on diversified downstream tasks is influenced by several factors such as data partitioning, data quality, and label noise levels [3][4][5]. Consequently, in some downstream tasks, there isn't always a substantial performance improvement as the model scale increases because the data quality and quantity may exert a more significant impact on the final results compared to the model scale. A typical example is the aEA and aIP tasks in the COMPAS-1D dataset, where adding features to Uni-Mol2 1.1B leads to significant performance improvements, achieving an average 14% improvement compared with Uni-Mol.

Regarding the differences in model performance on the QM9 dataset that you mentioned, we have supplemented four additional property prediction tasks (U0, U, G, and H) in the supplementary material. Overall, when comparing Uni-Mol2 310M with Uni-Mol2 1.1B on the QM9 dataset, eight of ten tasks showed improvements exceeding 2.7% with Uni-Mol2 1.1B, while two tasks achieved improvements above 26%. It is worth noting that in the fine-tuning process for QM9, we only used atomic features and conformational features. We believe this is one of the potential reasons contributing to the observed convergence in model performance during scaling up Uni-Mol2.

In summary, our experiments affirm that scaling models is effective and leads to noticeable improvements. The results clearly demonstrate that increasing model size yields significant performance gains in downstream tasks.

References

[1] Kaplan J, McCandlish S, Henighan T, et al. Scaling laws for neural language models[J]. arXiv preprint arXiv:2001.08361, 2020.
[2] Hoffmann J, Borgeaud S, Mensch A, et al. Training compute-optimal large language models[J]. arXiv preprint arXiv:2203.15556, 2022.
[3] Sultan A, Sieg J, Mathea M, et al. Transformers for molecular property prediction: Lessons learned from the past five years[J]. arXiv preprint arXiv:2404.03969, 2024.
[4] Deng J, Yang Z, Wang H, et al. Unraveling Key Elements Underlying Molecular Property Prediction: A Systematic Study[J]. arXiv preprint arXiv:2209.13492, 2022.
[5] Martinez-Mayorga K, Rosas-Jiménez J G, Gonzalez-Ponce K, et al. The pursuit of accurate predictive models of the bioactivity of small molecules[J]. Chemical Science, 2024, 15(6): 1938-1952.

最终决定

This paper seeks to evaluate the effectiveness of scaling up pretraining data and parameter size in generative models of molecular structure and to establish initial scaling laws analogous to those already established for generative models of natural language. The paper trains the largest generative molecular model to date and, in experiments, demonstrates improved performance on downstream tasks as pretraining data and parameter scale are increased.

The majority of reviewers found the contributions of this paper to be valuable, specifically: (1) the sheer scale of these experiments, including the additional pretraining data collected, (2) the establishing of new scaling laws that are interesting in how they differ from natural language scaling laws, and (3) the consistent performance improvements on downstream tasks.

Concerns were raised about (1) the breadth and depth of downstream tasks for evaluation, (2) missing training details like training time and effect of hyperparameters, and (3) missing comparisons with GNN-based modeling approaches. These concerns were all addressed adequately in rebuttal.