PaperHub
6.4
/10
Poster4 位审稿人
最低4最高4标准差0.0
4
4
4
4
3.5
置信度
创新性2.5
质量2.5
清晰度3.0
重要性2.3
NeurIPS 2025

Mol-LLaMA: Towards General Understanding of Molecules in Large Molecular Language Model

OpenReviewPDF
提交: 2025-05-06更新: 2025-10-29
TL;DR

We propose a large molecular language model with general knowledge and reasoning capabilities for molecular reasoning, demonstrating its utility as a general-purpose assistant for molecular analysis.

摘要

关键词
Large Molecular Language ModelGeneral-purpose Assistant for Molecular AnalysisMolecular Comprehension

评审与讨论

审稿意见
4

This paper presents Mol-LLaMA, a general-purpose molecular assistant designed to perform molecular analysis with enhanced reasoning and explainability. The primary contribution is the creation of Mol-LLaMA-Instruct, a large instruction-following dataset generated via GPT-4o, incorporating hierarchical reasoning across structural, chemical, and biological levels. The dataset construction leverages annotated IUPAC names and curated PubChem descriptions, using GPT-4o to generate multi-level reasoning samples.

The model architecture integrates 2D and 3D molecular encoders, a 2D–3D blending module, and a Q-Former to project molecular features into a large language model space. The authors adopt a two-stage training procedure involving molecular representation learning followed by multi-modal instruction tuning. Evaluation is conducted using GPT-4o as an automated judge across multiple criteria such as helpfulness, relevance, and accuracy.

优缺点分析

Strengths

  • The hierarchical design of the instruction dataset is well thought out and supports improved reasoning capabilities.
  • The integration of 2D and 3D molecular representations via a blending module effectively captures both topological and spatial features of molecules. This design choice is supported by ablation studies.

Weaknesses

  • The paper primarily reuses existing tools such as GPT-4o for data generation and filtering, pre-trained molecular encoders, and Q-Former. The main novelty lies more in the engineering effort and the construction of the dataset rather than in introducing new models or learning algorithms.

  • The paper uses GPT-4o both to generate molecular reasoning data and to evaluate the factual accuracy of the generated samples. While this is a practical approach given GPT-4o’s strong domain knowledge, I am concerned about potential risks of hallucination. If GPT-4o produces factually incorrect or misleading descriptions during generation, it may not always be able to reliably identify these errors during the evaluation phase, since both tasks rely on the same model’s internal knowledge. I suggest the authors discuss this limitation explicitly and, if possible, consider additional validation strategies, such as incorporating human expert review.

  • The instruction dataset is a central contribution of the paper. However, the details of how this dataset is constructed are largely placed in the appendix. I recommend moving the dataset construction process into the main text and expanding the discussion to make the contribution clearer and more accessible to readers.

  • Minors: The paper refers to the "string version of the molecule" and clarifies that this corresponds to the IUPAC name. Since IUPAC names are less commonly used as string representations in computational chemistry compared to formats like SMILES, SELFIES, or InChI, it would be clearer and more precise for the authors to explicitly state "IUPAC name" instead of the more ambiguous "string version." This would help avoid confusion for readers familiar with standard molecular string formats.

问题

How well justified is the choice of molecular encoders (MoleculeSTM for 2D, UniMol for 3D)? Were alternative encoders considered or compared?

局限性

I did not observe any major formatting issues during my review of the paper.

最终评判理由

While I still believe the model-side method lacks novelty, I acknowledge that the newly introduced dataset is valuable. The inclusion of preliminary human evaluation results with a fairly high R2 is promising. Given this, I am willing to raise my rating, as I believe this work could make a useful contribution to the research community.

格式问题

I did not observe any major formatting issues during my review of the paper.

作者回复

We sincerely thank you for your constructive and helpful comments. We appreciate the positive comments.

  • The hierarchical design of the dataset is well thought out and supports improved reasoning capabilities.
  • The integration of 2D and 3D molecular representations via a blending module effectively captures both topological and spatial features.

We initially address your concerns below.


Comment 1: The paper primarily reuses existing tools such as GPT-4o for data generation and filtering, pre-trained molecular encoders, and Q-Former. The main novelty lies more in the engineering effort and the construction of the dataset rather than in introducing new models or learning algorithms.

Response 1: We propose a novel framework to build molecular LLMs, making notable contributions in fourfold: 1) novel dataset construction pipeline for molecules, 2) model architecture, 3) resulting foundational model, Mol-LLaMA, and 4) further utilization.

Dataset Construction Pipeline

  • Motivation-level: We first tackle that existing databses and dataset construction pipelines are not appropriate to build the molecular LLMs, identifying the following two challenges: 1) molecular LLMs should learn a wide-ranging knowledge encompassing chemical and biological features as this knowledge is domain-specific and 2) unlike other modalities, molecular features are not directly apparent from structures, requiring complex reasoning and interpretability [lines 31-52, 112-125, and Table 2].
  • Methodology-level: To solve these limitations, we design novel data types that have not been explored in the literature, based on our observation of the hierarchical relationships between molecular structures and their features.

Model Architecture

  • Motivation-level: We tackle that the current molecular LLMs rely on a single type of molecular representations, often failing to correctly predict the molecular properties [lines 52-54].
  • Methodology-level: We observe that 2D and 3D molecular encoders have their distinct advantages. Thus, we opt to utilize both molecular representations [lines 167-172] and propose an effective model architecture, the blending module, that can integrate their advantages [lines 173-178], which have not been covered by previous works.

Resulting Foundational Model: Mol-LLaMA

Our resulting model, Mol-LLaMA, exhibits several key strengths as follows:

  • General understanding of molecular features: Mol-LLaMA accurately predicts the general molecular features compared to other molecular LLMs as shown in Table 3 and 4, showcasing its potential as a general-purpose assistant for molecular analysis.
  • Explainability and Reasoning Capabilities: Unlike other molecular LLMs, Mol-LLaMA provides interpretable responses with informative reasoning processes as shown in Table 1, 9, 10 11, and 17, enhancing the reliability and thus further opening its utilization for practical fields.

Further Utilization

We would like to emphasize that our dataset construction pipeline and the model architecture can be further utilized to build other LLMs specialized in understanding the biomolecules such as proteins, RNAs and their complexes, which are notable contributions for the AI4Science field [lines 306-309]. Please note that we will publicly release all materials, including code, data, and model weights, as a state-of-the-art foundational model for molecular fields.


Comment 2: While this is a practical approach given GPT-4o’s strong domain knowledge, I am concerned about potential risks of hallucination.

Response 2: We appreciate your valuable comment. First, we would like to clarify that our instruction dataset generation is grounded in the annotated molecular features and IUPAC names provided by PuChem, which helps mitigate the risk of hallucination.

Additionally, we adopt GPT-4o due to its demonstrated capabilities in scientific domains, as evidenced by prior studies [1], which support its reliability in this context. In contrast, other LLMs have not been rigorously analyzed in the molecular science domain, making them less reliable for generating and evaluating data.

To show the effect of other LLMs, we filtered data using Qwen1.5-14B, Gemma-1.5-12B-IT, Llama-3.1-8B-Instruct and GPT-4o, selecting samples rated 4 by all. As shown in Table R1, this quadruple-filtered version (Mol-LLaMA^*) yielded no significant gains, suggesting these LLMs may wrongly discard correct data and limit model scope. We will include the results in the final revision.

Although current LLMs are still limited, we would like to emphsize that our dataset construction pipeline offers a notable contribution for building molecular LLMs by being readily adapted to future LLMs with more extensive knowledge in scientific domains.

ModelsHelp.Relev.Acc.DetailsOverallHelp.Relev.Acc.DetailsOverallHelp.Relev.Acc.DetailsOverall
Mol-LLaMA1.1261.1451.1541.0901.1251.2241.2661.3021.2111.2511.5781.8402.0301.5281.744
Mol-LLaMA^*1.0041.0241.0350.9411.0061.1431.1931.2551.1131.1791.6051.8582.0971.5691.801

Table R1: Quantitative evaluation on structural (left 5 criteria), chemical (middle 5 criteria), and biological (right 5 criteria) understanding.

[1] AI4Science, Microsoft Research, and Microsoft Azure Quantum. "The impact of large language models on scientific discovery: a preliminary study using gpt-4." arXiv preprint arXiv:2311.07361 (2023).


Comment 3: If possible, consider additional validation strategies, such as incorporating human expert review.

Response 3: We thank you for your valuable suggestion. However, gathering a sufficient number of qualified experts for accurately evaluating data quality is highly challenging. First, assessing the generated molecular descriptions requires interdisciplinary expertise across chemistry, biochemistry, biology, and pharmacology. Additionally, some properties may not be annotated in public databases like PubChem, meaning factually correct descriptions could appear unsupported. The inherent ambiguity in certain molecular properties or descriptions may also introduce evaluator bias, as experts from different backgrounds might interpret the same description differently.


Comment 4: I recommend moving the dataset construction process into the main text and expanding the discussion to make the contribution clearer and more accessible to readers.

Response 4: Due to the page limit, we explained the core components of our dataset construction pipelines, including context types, data types, and filtering process, postponing the details such as prompts to the appendix. We will move detailed explanations of the dataset construction pipeline to the main paper in the final revision.


Comment 5: It would be clearer and more precise for the authors to explicitly state "IUPAC name" instead of the more ambiguous "string version."

Response 5. We will revise the term “string representation” to its specific type, such as SMILES or IUPAC name for clarity.


Comment 6: How well justified is the choice of molecular encoders (MoleculeSTM for 2D, UniMol for 3D)? Were alternative encoders considered or compared?

Response 6: We adopt MoleculeSTM as it inherently captures molecular semantics through contrastive learning between 2D molecular structures and textual descriptions. Prior studies [2] have shown that contrastive learning with textual descriptions shows better performance than reconstruction-based training for building multimodal LLMs.

UniMol is a state-of-the-art molecular encoder across both 2D and 3D representations. Although it is trained with a reconstruction loss, its use of transformer architectures enables strong representational expressiveness.

[2] McKinzie, Brandon, et al. "Mm1: methods, analysis and insights from multimodal llm pre-training." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024.


We thank the reviewer for your time and feedback and hope that our responses have addressed your concerns. If you have any further questions, we are happy to address them. If your concerns are appropriately addressed, we hope you kindly consider updating the rating accordingly.

评论

We sincerely thank the reviewer for the thoughtful feedback on the review process. We appreciate your careful consideration of our responses and your support for our work. Your constructive comments are truly helpful in improving the quality and clarity.

评论
  1. Thank you for your detailed response. I acknowledge and appreciate the contribution of your dataset, which I had already mentioned in my initial comment as a key contribution of the paper.
  • The model architecture mostly combines existing 2D and 3D encoders (MoleculeSTM and UniMol) rather than introducing a fundamentally new design. This kind of integration, while practical, isn’t a strong novelty on its own. I’m not saying it’s bad to use others’ encoders, but I want to make it clear that there’s no novelty in this point. The Blending Module contains self-attention and cross-attention layers, and the Q-Former are also not new.
  • Regarding the foundational model as a novelty, Mol-LLaMA’s strengths in molecular understanding and reasoning seem to come mainly from the training data (data-driven) rather than from new modeling or algorithms. Introducing a new model that performs well is common but doesn’t automatically imply a method advance.
  • The further utilization you mention also depends on these data-driven capabilities. Overall, the main contribution appears to be the dataset and its construction pipeline, not the modeling methods.
  1. Thank you for sharing insights on the performance of other LLMs, that's very helpful. From my own experience, GPT4o’s ability to generate molecular descriptions is hallucination-prone and only performs reliably only on common molecules. For instance, a single molecule can be represented by multiple SMILES strings, and depending on which SMILES is used as input, the model’s output can vary largely. Additionally, GPT4o may mistakenly consider two distinct molecules similar due to the one-to-many mapping between molecules and their SMILES representations. This is my prompt for ChatGPT (GPT4o-based):

Given 2 molecules: CCCCOC(=O)CCC(C(=O)OCCCC)NC(=O)C1=CC=C(C=C1)N(C)CC2=CN=C3C(=N2)C(=NC(=N3)N)N and CCCCOC(=O)CCC(NC(=O)c1ccc(N(C)Cc2cnc3(N)nc(N)c3n2)cc1)C(=O)OCCCC. Are they represent the same molecule? Only answer Yes or No.

The model answers “Yes” — even though the second molecule is not a valid structure.

Given 2 molecules: C(C1C(C(C(C(=O)O1)O)O)O)OP(=O)(O)O and O=C1OC(COP(=O)(O)OCCC)C(O)C(O)C1O. Are they represent the same molecule? Only answer Yes or No.

The model answers Yes even when they are different molecules.

Regarding the 2D and 3D encoders you employed, both are transformer-based architectures rather than graph neural networks (GNNs). This means they still heavily rely on sequentialized molecular representations, which inherently can vary for the same molecule. That's why I had a question about choosing molecular encoders. One GNN encoder may alleviate this problem because molecule to molecular graph is one-one mapping. I also noticed there was no mention of training on canonical SMILES, which may contribute to the unreliability of GPT4o’s generated results.

Given that your model is trained multi-level descriptions (structure, function), could you test whether queries using two different SMILES strings for the same molecule produce consistent results for a given function (e.g., BBBP prediction)? This would be particularly important since you are not using human evaluation.

  1. Regarding my comment on additional validation strategies, what I am suggesting is human validation of the LLM outputs on a random subset of your dataset — not manual annotation of the entire dataset. You argue that factually correct (generated) descriptions could appear unsupported is much less likely than the model generating incorrect descriptions. That’s why I recommend human evaluation: to better understand how reliable the generated descriptions actually are. While it can be challenging, it’s certainly not impossible. I believe experts can perform this validation to some extent, and doing so would only strengthen your dataset. I understand that this validation might not be feasible at this stage — it’s just a suggestion and doesn’t affect my overall assessment of your paper.
评论

We sincerely appreciate the reviewer’s valuable comments and engaging in the discussion. We further address your concerns below.


Comment 7: Regarding the novelty of the model architecture.

Response 7: We thank the reviewer for the insightful feedback. While our model is built upon the existing components, our originality lies in the straightforward yet effective model architecture that effectively leverages and integrates multiple molecular representations to address key limitations in molecular LLMs. In this sense, we respectfully believe that our approach offers a novel and well-reasoned combination that demonstrates practical relevance and empirical gains, in line with the broader perspective on originality.


Comment 8: Mol-LLaMA’s strengths in molecular understanding and reasoning seem to come mainly from the training data (data-driven) rather than from new modeling or algorithms.

Response 8: To the best of our knowledge, the significance of the instruction dataset is becoming crucial, as recent work underscores the central role of instruction datasets in endowing LLMs with new skills [3]. In this spirit, we present a novel algorithmic pipeline for constructing instruction datasets, especially suited for scientific problems that require domain-specific knowledge and complex reasoning capabilities. Additionally, our experimental comparisons and ablation study provide valuable insights into the effects of the role of instruction datasets and the types of knowledge necessary for developing advanced LLMs tailored to scientific problem solving.

[3] Yue, Yang, et al. "Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?." arXiv preprint arXiv:2504.13837 (2025).


Comment 9: Regarding the foundational model as a novelty.

Response 9: We respectfully believe that proposing a foundational model tailored for molecular reasoning is a significant contribution to advancing the AI4Science field. Foundational models have the potential to generalize across multiple molecular tasks, enable transfer learning for downstream scientific applications, and further induce future works, such as training-free molecular reasoning enhancement or building an end-to-end framework that combines LLM capabilities with external scientific resources. We believe these opportunities further highlight the importance of our proposed foundational model.


Comment 10: GPT-4o’s hallucinations on SMILES when generating data.

Response 10: We would like to clarify that we use the annotated IUPAC names to generate the data to alleviate the hallucination problems. The design choice is motivated by the analyses that IUPAC names are the most interpretable string representations, as studied in the recent work [4]. Additionally, at the time of our study, GPT-4o was the most suitable model for automating the dataset construction pipeline, as it is the only model whose capabilities in scientific problem solving have been extensively analyzed. While GPT-4o is not without limitations, our dataset construction pipeline leverages its strengths to incorporate comprehensive molecular knowledge and reasoning capabilities, as you mentioned.

[4] Microsoft Research AI4Science, and Microsoft Azure Quantum. "The impact of large language models on scientific discovery: a preliminary study using gpt-4." arXiv preprint arXiv:2311.07361 (2023).


Comment 11: Sequentialized molecular representations of encoders might introduce error.

Response 11: Our model architecture ensures permutation invariance by avoiding reliance on sequential molecular representations. Specifically, MoleculeSTM is built on a GNN architecture (i.e., GIN) and UniMol utilizes interatomic distances as features without using positional embeddings from the original transformer architecture, thereby maintaining permutation equivariance. Additionally, Q-former ensures permutation invariance through its cross-attention mechanisms, as discussed in lines 182-183. Please note that we use the molecular structures rather than SMILES representations as inputs to mitigate the issue that a single molecular structure can have multiple SMILES forms.


Comment 12: Human evaluation on a subset of our instruction dataset.

Response 12: Thank you for your valuable suggestion. We are currently conducting the human evaluation and will share the results as soon as they are finalized.


We sincerely thank you again for your time and effort in reviewing our paper. If you have further questions, we are happy to address them.

评论

Dear Reviewer voqH,

As the human evaluation is finalized, we would like to politely draw your attention again.


Comment 12: Human evaluation on a subset of our instruction dataset.

Response 12: We conducted a human evaluation on fifteen samples of our detailed structural descriptions, involving nine experts specializing in chemistry and biology. The experts were asked to rate the accuracy of each structural description on a 4-point scale. In Table R2, we report the Pearson and Spearman correlation coefficients between the average human evaluation scores and the scores given by GPT-4o. Both coefficients are close to 1 and associated with low pp-values, indicating a strong and statistically significant agreement between human evaluations and GPT-4o's assessments. Notably, the average human score for the subset of data that received a score of 4 from GPT-4o is 3.71, further demonstrating the high reliability of our instruction dataset.

Pearsonpp-valueSpearmanpp-value
0.960.962.68×1082.68\times10^{-8}0.910.913.43×1063.43\times10^{-6}

Table R2: Pearson correlation coefficient between the scores of human and GPT-4o.


We sincerely appreciate your thoughtful feedback and hope that the human evaluation results have addressed your concerns. If you have further feedback and questions, we would be happy to respond. If your concerns are appropriately resolved, we kindly ask you to consider updating your ratings.

评论

Thank you for your response and for addressing my concerns, especially regarding the human evaluation.

While I still believe the model-side method lacks novelty, I acknowledge that the newly introduced dataset is valuable. The inclusion of preliminary human evaluation results with a fairly high R2 is promising. Given this, I am willing to raise my rating, as I believe this work could make a useful contribution to the research community.

审稿意见
4

In this paper, Mol-LLaMA, a large molecular language model that grasps the general knowledge centered on molecules and exhibits explainability and reasoning ability, is introduced. To develop the model, a novel, large-scale instruction dataset (Mol-LLaMA-Instruct) is constructed with GPT-4o, featuring three specific data types designed to teach foundational molecular knowledge and causal relationships. To improve the performance, the authors design a "blending module" that integrates information from both 2D and 3D molecular encoders to create a more comprehensive structural representation. The paper demonstrates through extensive experiments that Mol-LLaMA outperforms existing models, including GPT-4o, in tasks such as molecular understanding and property prediction.

优缺点分析

strength

  1. Comprehensive Instruction Dataset: A core strength is the systematic design of the Mol-LLaMA-Instruct dataset. Instead of relying on existing task-specific datasets, the authors created three new data types: "Detailed Structural Description," "Structure-to-Feature Relationship Explanation," and "Comprehensive Conversation." This approach explicitly aims to build a foundational understanding of molecular principles and enhance the model's reasoning capabilities.
  2. Innovative Model Architecture: The proposed 2D-3D blending module is an architectural contribution. Recognizing that 2D and 3D representations offer complementary information (e.g., bond connectivity vs. spatial arrangement), the model uses a cross-attention mechanism to integrate them, which is shown to reduce errors and improve the understanding of molecular structures compared to using a single encoder or simple concatenation.

weakness

  1. Reliance on GPT-4o for Data and Evaluation: The instruction dataset was generated using GPT-4o, and the quality of the generated responses was also evaluated using GPT-4o as a judge. This heavy reliance on a proprietary, closed-source model introduces potential issues: Inherited Biases: Any factual errors, biases, or limitations present in GPT-4o could be propagated into the training data and, consequently, into Mol-LLaMA. Evaluation Circularity: Using the same family of models for both data generation and evaluation can lead to results that favor the style and knowledge base of that model, potentially inflating performance scores.
  2. Lack of Statistical Significance Testing: The authors state that due to high computational costs, they did not train models multiple times. While understandable, this makes it harder to assess the robustness and generalizability of the performance gains. At least, the authors should use temperature sampling and test the results multiple times.

问题

  1. How effective are the molecular representations learned by the proposed model for direct application in downstream tasks? Furthermore, what are its specific advantages over previous unified representation models, such as in [1]?
  2. Regarding the evaluation methodology using GPT-4o as a judge: The paper describes a pairwise comparison to derive a relative score. Was one of the two assistants consistently the GPT-4o model itself, acting as a reference baseline? The prompt instructs the judge to avoid order effects, but were procedural safeguards, such as randomizing the response order, implemented to mitigate positional bias? Finally, considering these factors, how robust is this relative scoring method compared to an absolute grading system where each response would be evaluated individually against a fixed rubric?
  3. Prompt Specificity and Bias: The model was trained and tested on a specific set of instruction formats. To what extent does this in-distribution testing create an advantage for Mol-LLaMA over other models that were not trained on these specific prompts?
  4. Zero-Shot vs. Fine-Tuning: On the MoleculeQA benchmark, all models were fine-tuned. Would a zero-shot evaluation on this task be possible, and what might it reveal about the model's intrinsic generalization capabilities without task-specific training?
  5. Property Prediction Benchmarking: For the molecular property prediction tasks, the comparison could be expanded since Mol-LLaMA is finetuned on the dataset. Have you considered: Including more datasets from the same class (e.g., Tox21, HIV)? Benchmarking against smaller, state-of-the-art models that are specialized for these tasks such as [2] ?
  6. Using the Area Under the ROC Curve (AUROC) metric in addition to accuracy, as it can be more informative for classification tasks?
  7. Evaluation of Baselines: In the property prediction results (Table 4), the "Fidelity" and "Helpfulness" scores are not reported for the GPT-4o baseline. Could you clarify why these metrics were omitted for the baseline model?
  8. Does the model support inputs involving multiple molecules at once? The general LLMs such as GPT-4o can process SMILES of multiple molecules.
  9. Influence of Molecular Conformation: Given the use of a 3D encoder, how sensitive is the model's performance to different input molecular conformations? How effectively does it handle molecules with high conformational flexibility?
  10. How well does the model follow instructions and generalize to tasks beyond those it was explicitly trained on?
  11. Practical Utility: As a general-purpose tool, what are some concrete examples of how Mol-LLaMA could assist chemists or biologists in their day-to-day research workflows?
  12. Comparison with Other Models: The current benchmarks include models of a similar scale. How would Mol-LLaMA be expected to perform against significantly larger foundation models such as GPT-4.5? Could a comparison be made against other existing general-purpose (multi-modal) chemical language models that were not included in this study, such as [3, 4]?
  13. Open-Source Plans: Are there plans to release the model weights, training code, and the Mol-LLaMA-Instruct dataset to the public to support reproducibility and further community research? I don’t find the links in this paper.

I recognize that my previous questions cover a wide range of topics, and it may not be feasible to address every point exhaustively. Therefore, you can focus your response on clarifying and defending the primary contribution of your work. The key is to demonstrate the unique value proposition of Mol-LLaMA. For instance: If the core contribution is superior molecular representation, please provide more targeted evidence to demonstrate this superiority over existing methods. If the central claim is that Mol-LLaMA is a general-purpose foundational model for chemistry, its evaluation should include a broader set of tasks to prove its versatility and superiority over other general-purpose chemical models. If, however, Mol-LLaMA excels as a specialized model for a particular task, the most convincing comparison would be against state-of-the-art (SOTA) models dedicated to that task. Clearly articulating and substantiating the unique value would strengthen the paper and my assessment of it.

[1]Zhu, Jinhua, et al. "Unified 2d and 3d pre-training of molecular representations." Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining. 2022. [2]Zhang, Zaixi, et al. "Motif-based graph self-supervised learning for molecular property prediction." Advances in Neural Information Processing Systems 34 (2021): 15870-15882. [3]Zhang, Di, et al. "Chemllm: A chemical large language model." arXiv preprint arXiv:2402.06852 (2024). [4]Li, Junxian, et al. "Chemvlm: Exploring the power of multimodal large language models in chemistry area." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 39. No. 1. 2025.

局限性

  1. No Generation abilities: Mol-LLaMA is designed primarily for molecular analysis and understanding. It currently lacks the capability to perform generative tasks, such as designing novel molecules with desired properties. The authors have to consider how this model can help the improvement of chemistry or biology,
  2. Dataset Filtering: While the instruction dataset was filtered for factual accuracy using GPT-4o as a judge, this process is not infallible. There is still a possibility that incorrect or subtly flawed information remains in the samples used for training.

最终评判理由

I still hold the reservation that the proposed method may not be the most effective path toward creating a truly general-purpose foundational model for chemistry. In my view, more convincing routes toward that goal would likely involve continued pre-training on large-scale chemical and biological corpora or learning from real-world feedback. For this reason, I believe the long-term significance of this specific approach is somewhat limited.

Nevertheless, considering the strong performance demonstrated across multiple tasks in the manuscript, I raise my rating to 4.

格式问题

No

作者回复

We sincerely thank you for your constructive and helpful comments. We appreciate the positive comments.

  • The design of the three new data types and the resulting instruction dataset builds a foundational understanding of molecular principles and enhances the reasoning capabilities
  • The proposed 2D-3D blending module is innovative, reduces errors, and improves the understanding of molecular structures.

Due to the character limit, we initially address your major concerns and some of your questions below.


Comment 1: The closed-source model (i.e., GPT-4o) could introduce potential issues: Factual errors or biases could be propagated into the training data.

Response 1: We appreciate your valuable comment. While we aimed to select reliable LLMs to ensure data quality, to the best of our knowledge, alternative models lack extensive analysis capabilities for molecular and scientific reasoning. In other words, GPT-4o [1] remains the only model with demonstrated reliability in these domains, making it a more suitable choice despite its closed-source nature.

Further, we investigate the effect of other LLMs for evaluating the data quality by filtering with four LLMs: GPT-4o, Qwen1.5-14B, Gemma-1.5-12B-IT, and Llama-3.1-8B-Instruct, selecting samples rated 4 by all. As shown in Table R1, this quadruple-filtered version (Mol-LLaMA^*) yielded no significant gains, suggesting these LLMs may wrongly discard correct data and limit knowledge scope. We will include the results in the final revision.

Despite the limitations of current LLMs, we would like to emphasize that our dataset construction pipeline is a notable contribution for building molecular LLMs as it can be readily applied with upcoming LLMs exhibiting more extensive knowledge for scientific domains.

ModelsHelpRelevAccDetailsOverallHelpRelevAccDetailsOverallHelpRelevAccDetailsOverall
Mol-LLaMA1.1261.1451.1541.0901.1251.2241.2661.3021.2111.2511.5781.8402.0301.5281.744
Mol-LLaMA^*1.0041.0241.0350.9411.0061.1431.1931.2551.1131.1791.6051.8582.0971.5691.801

Table R1: Quantitative evaluation on structural (left 5 criteria), chemical (middle 5 criteria), and biological (right 5 criteria) understanding.

[1] Microsoft Research AI4Science and Microsoft Azure Quantum. "The impact of large language models on scientific discovery: a preliminary study using gpt-4." arXiv:2311.07361 (2023).


Comment 2: Using the same family of models for both data generation and evaluation can lead to results that favor their style and knowledge base, potentially inflating performance scores.

Response 2: We thank you for your valuable comment. To investigate whether the same family favors the style and knowledge base, we conduct the evaluation using four LLMs: GPT-4o, Qwen3-14B, Gemma-3-12B-IT, and Llama-3.1-8B-Instruct. As shown in Table R2, Mol-LLaMA still outperforms all baselines, including GPT-4o, with consistent trends observed in Table 3, implying that GPT-4o’s evaluation is not biased rather agrees with other LLMs. We will include these results in the final revision.

ModelsHelpRelevAccDetailsOverallHelpRelevAccDetailsOverallHelpRelevAccDetailsOverall
Llama2-based
Llama-2-7B-Chat0.3720.3740.2640.3590.3280.5340.5200.4050.5510.4920.4790.4080.3450.5980.441
Mol-Instructions0.2170.2390.2520.1310.2020.2420.2740.3010.1450.2290.3190.3760.4160.2010.315
LlasMol0.2780.3060.3040.2040.2650.2950.3350.3120.2150.2800.3320.3780.4430.2600.341
3D-MoLM0.6740.6580.5640.7190.6430.7840.7880.7010.7980.7620.8980.9450.9530.9110.923
LLaMo0.3060.4070.4430.1790.3120.3450.4640.5240.1970.3560.4540.6300.8020.2360.490
Mol-LLaMA1.0421.0631.0630.9601.0381.1241.1791.2121.0441.1461.3761.6071.6831.1671.468
Llama3.1-based
Llama3.1-8B0.6690.6800.5600.6480.6340.6980.6930.6000.6950.6670.7050.6630.6170.7320.674
Mol-Instructions0.2610.3340.3690.1520.2630.2650.3500.3960.1490.2690.3580.4680.5770.1910.367
3D-MoLM0.8350.8740.7940.7820.8170.9020.9730.9020.8290.9001.1141.2951.3460.9701.180
LLaMo0.4800.6160.5600.3010.4730.4010.5340.5600.2450.4140.5960.7880.8340.3110.603
Mol-LLaMA1.0521.0811.0920.9641.0491.1391.1921.2251.0681.1631.4591.6991.8011.2101.555
Mol-LLaMA^*1.0011.0261.0300.9110.9971.1131.1701.2051.0341.1391.4211.6401.7411.2231.522

Table R2: Quantitative evaluation on structural (left 5 criteria), chemical (middle 5 criteria), and biological (right 5 criteria) understanding using diverse evaluators.


Comment 3: Lack of statistical significance testing makes it harder to assess the robustness and generalizability.

Response 3: We appreciate your valuable suggestion. However, we adopt greedy decoding, as it aligns with real-world applications where users prioritize the most confident and reliable responses from LLMs, specifically for the scientific problems. Instaed, we perform the LLM evaluation three times to assess the quality of the generated responses.

Furthermore, we apply the temperature sampling with three runs and report the averaged metrics in Table R3 using the same four evaluator LLMs in Response 2. Performance is maintained and Mol-LLaMA consistently outperforms the baselines, including GPT-4o, indicating that Mol-LLaMA robustly and generally shows superior performance.

ModelsHelpRelevAccDetailsOverallHelpRelevAccDetailsOverallHelpRelevAccDetailsOverall
Llama-3.1-8B-Instruct0.6500.6710.5550.6310.6230.6990.6910.6190.7190.6750.7240.7040.6570.7610.706
Mol-Instructions0.2580.3270.3550.1490.2580.2750.3450.3910.1530.2750.3550.4560.5830.1900.370
3D-MoLM0.6940.6790.5750.7340.6630.7560.7610.6650.7710.7310.8690.8860.8520.8740.870
LLaMo0.4750.5920.5430.3060.4650.4050.5260.5540.2410.4130.5950.7460.7950.3110.588
Mol-LLaMA1.0351.0461.0680.9931.0401.1181.1641.2021.0661.1401.3901.5881.6931.2061.483

Table R3: Quantitative evaluation on structural (left 5 criteria), chemical (middle 5 criteria), and biological (right 5 criteria) understanding with temperature sampling


Comment 4: As a general-purpose tool, what are some concrete examples of how Mol-LLaMA could assist chemists or biologists in their research workflows?

Response 4: We carefully designed the experiments to show the effectiveness of Mol-LLaMA in scientific research workflows, focusing on three key points: 1) Molecular analysis, 2) Interactive manipulation, and 3) Task specialization.

Analysis of De novo Molecules

Chemists, biologists, and drug developers frequently encounter novel molecules that lack annotations in public databases. In such cases, Mol-LLaMA enables initial assessments by providing insights into physicochemical properties and likely biological functions. Tables 3 and 4 demonstrate Mol-LLaMA’s effectiveness in this utilization, showcasing its superior ability in accurately predicting molecular features with helpful explanations.

Interactive Manipulation

Another key strength of Mol-LLaMA is its interactive capability, allowing users to guide its outputs through prompts. For the PAMPA task, we devise “with task info” prompting that injects the task-specific information reflecting real-world scenarios where users guide LLM outputs. As shown in Table 4, Mol-LLaMA built upon LLaMA3.1, shows performance gains when prompted with task-specific information, demonstrating its ability to effectively handle diverse user prompts.

Task Specialization

Mol-LLaMA can be further fine-tuned for specific tasks, making it well-suited for researchers and developers working in highly specialized domains such as drug discovery. We validate Mol-LLaMA’s effectiveness in task transfer settings of Tables 5, 12, and 13, demonstrating superior performance compared to baseline models.


Comment 5: To what extent does this in-distribution testing create an advantage for Mol-LLaMA over other models that were not trained on these specific prompts?

Response 5: In the PAMPA task, we assess Mol-LLaMA’s generalization ability using prompts that were entirely different from those seen during training, as shown in Table 32. These settings show that Mol-LLaMA is capable of reasoning and generating accurate answers beyond the training prompts, showcasing its generalization ability.


Comment 6: Have you considered more datasets such as MoleculeNet?

Response 6: We opted for the PAMPA task as it is well-suited to evaluate whether molecular LLMs truly understand molecular structures and features. In contrast, tasks in MoleculeNet require external knowledge beyond the molecule itself such as the structure of the blood-brain barrier for the BBBP task or nuclear receptors for the Tox21 task which heavily influenced by the internal knowledge of the base LLM, making fair comparisons difficult.

Nevertheless, we include a comparison on the BBBP task from MoleculeNet in Table 8, where Mol-LLaMA outperforms existing molecular LLMs built upon the same model architectures.


Comment 7: Are there plans to release the model weights, training code, and the Mol-LLaMA-Instruct dataset?

Response 7: We will release the Mol-LLaMA weights, training code, and Mol-LLaMA-Instruct dataset to facilitate further research in molecular foundation models and practical usage for the scientific domain.


We thank the reviewer for your time and feedback and hope that our responses have addressed your concerns. If you have any further questions, we are happy to address them. If your concerns are appropriately addressed, we hope you kindly consider updating the rating accordingly.

评论

Dear Reviewer 6V81,

Would you mind to have a check over authors' replies?

AC

评论

Thank you for your detailed response. Based on your clarifications, I now understand that the positioning of this work is as a general-purpose foundational model for chemistry. Your answers are compelling, and the additional experiments you provided have successfully addressed my previous concerns on several points.

Could you also provide responses to a few other questions related to this "general-purpose foundational model" concept, specifically Questions 2, 8, 9, and 12.

If these points can be satisfactorily addressed, I would be happy to raise my rating for the manuscript.

评论

We are deeply grateful for your time and effort in reviewing our work and for the constructive feedback that has strengthened our paper. We greatly appreciate your support in recommending our work for acceptance.

评论

We sincerely appreciate your thoughtful response and engagement in the discussion. We are grateful for the opportunity to address the remaining questions and further deepen the understanding of our work.

We address your remaining questions below.


Comment 8: Regarding the evaluation setting: 1) were GPT-4o consistently used as a reference baseline? and 2) Were procedural safeguards implemented?

Response 8: Yes, we leverage GPT-4o as a consistent reference baseline. And we randomly shuffled the response order for each evaluation and reported the average of three evaluations to mitigate the positional bias.


Comment 9: How robust is this relative scoring method compared to an absolute grading system?

Response 9: While it is difficult to determine which scoring method is generally more robust, we chose the relative scoring method since it best suits our experimental objectives. Specifically, as studied in recent works [2,3], the relative scoring method is effective for comparing and identifying better responses, especially in the cases of qualitative assessments. As our experiments include the quality evaluation such as helpfulness, relevance, level of detail, and fidelity, we adopted the relative scoring method rather than the absolute grading system.


Comment 10: Does the model support inputs involving multiple molecules at once? The general LLMs such as GPT-4o can process SMILES of multiple molecules.

Response 10: Our model can process and understand multiple molecules thanks to the nature of LLMs that understand diverse contexts. Due to the lack of an appropriate benchmark to assess this capability, we experimentally address the challenge by employing a well-established prompting method: In-context learning with examples.

Experimental Setup

  • For the PAMPA task, we randomly selected kk pairs of molecules and their corresponding labels from the training dataset.
  • During inference, we used the same input-label pair examples in the same order across models to minimize other biases such as positional, textual, and molecular biases. This setup allows us to focus solely on evaluating each model’s ability to understand multiple molecules.

Results and Analysis

As shown in Table R4, Mol-LLaMA attains a consistent performance improvement as the number of in-context examples increases, demonstrating its capability to effectively process and understand multiple molecules and their corresponding labels. In contrast, GPT-4o does not effectively utilize multiple molecules, showing the performance declines with one or three in-context examples. We will include these results in the final revision.

Models0-shot1-shot3-shot5-shot
Mol-LLaMA63.5573.4677.6480.10
GPT-4o48.6531.4546.6857.74

Table R4: Accuracy on the PAMPA task using different numbers of in-context learning examples.


Comment 11: Given the use of a 3D encoder, how sensitive is the model's performance to different input molecular conformations? How effectively does it handle molecules with high conformational flexibility?

Response 11: To demonstrate robustness to diverse molecular conformations, we evaluate Mol-LLaMA using three different conformations for each molecule, generated by RDKit and OpenBabel. The evaluation is conducted on the representative molecules in Table 3, which have an average of 11.11 rotatable bonds and thus can generate diverse conformations. As shown in Table R5, the model’s performance remains largely consistent across conformations in totals, indicating strong robustness. We will also add these results in the final revision.

ModelsHelp.Relev.Acc.DetailsOverallHelp.Relev.Acc.DetailsOverallHelp.Relev.Acc.DetailsOverall
Mol-LLaMA1.1271.1451.1541.0901.1251.2241.2661.3021.2111.2511.5781.8402.0301.5281.744
Diverse Conf1.0581.0701.1121.0101.0651.1711.2251.3011.1381.2131.6411.8952.2031.5771.841

Table R5: Quantitative evaluation with diverse conformations.


Due to the character limit, our responses continue below.

评论

Comment 12: Additional comparison with larger models such as GPT-4.5 and other existing general-purpose (multi-modal) chemical language models.

Response 12: We sincerely appreciate your valuable suggestion. Unfortunately, GPT-4.5 is not available via API and has usage limitations on ChatGPT. Therefore, we use GPT-4.1, which is the most recent accessible version apart from GPT-4.5. Additionally, we compare Mol-LLaMA with the suggested models, including ChemLLM-SFT [4], ChemLLM-DPO [4], and ChemVLM [5]. As shown in Table R6, Mol-LLaMA outperforms open-source molecular LLMs such as ChemLLM and ChemVLM. Mol-LLaMA exceeds GPT-4.1 in biological understanding, showing the effectiveness of Mol-LLaMA. We will add these results in the final revision.

ModelsHelp.Relev.Acc.DetailsOverallHelp.Relev.Acc.DetailsOverallHelp.Relev.Acc.DetailsOverall
ChemLLM-7B-1.5-SFT0.2260.2390.1760.1780.2110.2630.2850.2200.2040.2480.3210.3480.3170.2890.320
ChemLLM-7B-1.5-DPO0.2020.1960.1630.1710.1900.2130.2120.1850.2050.2000.2670.2600.2550.2650.258
ChemVLM-8B0.2140.2400.1650.1290.2050.2200.2360.1720.1330.2080.2410.2630.2900.1880.248
GPT-4.11.1531.1221.1841.1811.1591.1881.1841.2291.2381.2051.3201.3961.5641.3421.407
Mol-LLaMA1.1271.1451.1541.0901.1251.2241.2661.3021.2111.2511.5781.8402.0301.5281.744

Table R6: Quantitative evaluation with additional baselines.


[2] Levtsov, Georgii, and Dmitry Ustalov. "Confidence and Stability of Global and Pairwise Scores in NLP Evaluation." arXiv preprint arXiv:2507.01633 (2025).

[3] Liusie, Adian, Potsawee Manakul, and Mark JF Gales. "LLM comparative assessment: Zero-shot NLG evaluation through pairwise comparisons using large language models." arXiv preprint arXiv:2307.07889 (2023).

[4] Zhang, Di, et al. "Chemllm: A chemical large language model." arXiv preprint arXiv:2402.06852 (2024).

[5] Li, Junxian, et al. "Chemvlm: Exploring the power of multimodal large language models in chemistry area." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 39. No. 1. 2025.


We appreciate your time and effort in reviewing our paper and hope that our responses and additional results have addressed your concerns and provided further clarity. If you have further questions, we would be pleased to respond.

评论

Thank you for your detailed response. The additional experiments you provided have largely addressed my previous concerns.

However, I still hold the reservation that the proposed method may not be the most effective path toward creating a truly general-purpose foundational model for chemistry. In my view, more convincing routes toward that goal would likely involve continued pre-training on large-scale chemical and biological corpora or learning from real-world feedback. For this reason, I believe the long-term significance of this specific approach is somewhat limited.

Nevertheless, considering the strong performance demonstrated across multiple tasks in the manuscript, I raise my rating to 4.

审稿意见
4

In this work, the author proposed Mol-LLaMA which could poentially enhance the understanding of molecular features with strong reasoning and explainability. It introduces a novel instruction dataset that includes detailed structural descriptions, structure-to-feature relationship explanations, and comprehensive conversations. The model also integrates complementary information from different molecular encoders, improving the understanding of molecular structures and features. Overall, Mol-LLaMA outperforms existing models in predicting and explaining molecular properties, potentially accelerating scientific discovery in chemistry and biology.

优缺点分析

  1. Provided chemistry dataset includes detailed structural descriptions, structure-to-feature relationship explanations.
  2. The blending module integrates information from different molecular encoders, improving the accuracy of molecular analysis.
  3. Mol-LLaMA outperforms existing models in predicting and explaining molecular properties, showcasing its potential to accelerate scientific discovery in chemistry and biology.

Weaknesses:

  1. There is a concern about whether LLMs are fundamentally suitable for certain tasks in the benchmark, such as molecular property prediction, which is why only LLM results were presented.
  2. The molecule-to-text evaluation relies exclusively on text similarity metrics (BLEU, ROUGE, METEOR), which are inadequate for assessing chemical factual accuracy.
  3. The evaluation of LLMs based on helpfulness and relevance scores raises doubts about whether these scores accurately reflect the correctness of the information.
  4. The multimodal model and multimodal text data for molecules appear to lack originality.

[1] Liu P, Ren Y, Tao J, et al. Git-mol: A multi-modal large language model for molecular science with graph, image, and text[J]. Computers in biology and medicine, 2024, 171: 108073. [2] Liu S, Nie W, Wang C, et al. Multi-modal molecule structure–text model for text-based retrieval and editing[J]. Nature Machine Intelligence, 2023, 5(12): 1447-1457.

问题

  1. Use more nuanced evaluation criteria that consider both text similarity and chemical factual accuracy to provide a more balanced assessment of the model's performance.
  2. Include a wider variety of molecular structures and properties in the instruction dataset to ensure comprehensive coverage like QM9 [1] or GEOM datasets.[2]

[1] Ramakrishnan R, Dral P O, Rupp M, et al. Quantum chemistry structures and properties of 134 kilo molecules[J]. Scientific data, 2014, 1(1): 1-7.

[2] Axelrod S, Gomez-Bombarelli R. GEOM, energy-annotated molecular conformations for property prediction and molecular generation[J]. Scientific Data, 2022, 9(1): 185.

局限性

Yes

最终评判理由

The authors effectively addressed concerns regarding the suitability of LLMs for molecular property prediction by highlighting Mol-LLaMA’s strengths in interpretability, flexible interaction, and fine-tuning capabilities, supported by empirical results.

格式问题

No

作者回复

We sincerely thank you for your constructive and helpful comments. We appreciate the positive comments.

  • Constructed instruction dataset includes detailed structural descriptions and structure-to-feature relationship explanations.
  • The proposed blending module improves the accuracy of molecular analysis.
  • Mol-LLaMA outperforms in predicting and explaining molecular properties, showing its potential to accelerate scientific discovery.

We initially address your concerns below.


Comment 1: There is a concern about whether LLMs are fundamentally suitable for certain tasks in the benchmark, such as molecular property prediction.

Response 1: LLMs and specifically Mol-LLaMA exhibit significant advantages for scientific domains in three key areas: 1) Interpreability, 2) Flexible Interaction, and 3) Fine-tuning Capability.

Interpretability

LLMs, including Mol-LLaMA, generate interpretable outputs by detailed explanations, which is critical for scientific applications such as drug discovery where reliability and transparency are essential. In contrast, traditional predictive models lack interpretability, limiting their practical application. Especially, Mol-LLaMA exhibits its interpretability and reasoning capabilities for molecules, providing relevant, helpful, and detailed responses as shown in Table 3 and 4. This interpretability of Mol-LLaMA can accelerate scientific discovery by fostering trust and insight.

Flexible Interaction

LLMs and Mol-LLaMA support interactive and adaptive use cases, such as user-driven prompting and the integration of external domain knowledge. This flexibility is valuable for addressing complex scientific problems. One might provide additional molecule- or task-specific information via prompts to guide the model predictions, enhancing both performance and interpretability. Especially, Mol-LLaMA is beneficial in this utilization as shown in Table 4. Specifically, Mol-LLaMA shows a significant performance gain and achieves the best performance when prompted with task-specific information while maintaining helpful and interpretable responses.

Fine-tuning Capability

Open-source LLMs can be specialized for specific tasks via fine-tuning. In Table 13, we demonstrate that Mol-LLaMA, fine-tuned on property prediction tasks, significantly outperforms the state-of-the-art model without LLMs (UniMol). This result highlights the promising opportunities for further utilization of Mol-LLaMA via fine-tuning. Please note that all datasets, codes, and model weights will be publicly released.


Comment 2: The molecule-to-text evaluation relies exclusively on text similarity metrics (BLEU, ROUGE, METEOR), which are inadequate for assessing chemical factual accuracy.

Response 2: In the main experiments of Table 3 and 4, we do not use similarity metrics such as BLEU, ROUGE, or METEOR, but leverage the LLM-as-a-judge to evaluate not only the factual accuracy but also the quality and informativeness by assessing the textual semantics via LLMs. In Table 12, we strictly follow the evaluation settings of the benchmark to ensure fair comparison with the baselines, showing the superior performance of our Mol-LLaMA in the task transfer scenarios.


Comment 3: The evaluation of LLMs based on helpfulness and relevance scores raises doubts about whether these scores accurately reflect the correctness of the information.

Response 3: The helpfulness and relevance scores depend on the correctness, since these metrics inherently consider the factual correctness. We experimentally validate these characteristics by intentionally generating factually incorrect outputs. Specifically, given a pair of the ii-th molecule X_iX\_i and the generated structural description of the ii-th molecule D_iD\_i, we instruct GPT-4o to evaluate the quality of a permuted pair, defined as (X_i,D_j)(X\_i, D\_j) where ineqji\\neq j. As shown in Table R1, factually incorrect outputs (Permuted) significantly degrade the overall performance, including helpfulness and relevance, indicating that these metrics are affected by the factual correctness.

HelpRelevAccDetailsOverall
Original1.1261.1451.1541.0901.125
Permuted0.3530.2990.2190.3460.300

Table R1: Quantitative results on structural understanding


Comment 4: The multimodal model and multimodal text data for molecules appear to lack originality. [1, 2]

Response 4: Compared to the reference works, our work made significant contributions in threefolds: 1) dataset construction pipeline tailored to molecules, 2) model architecture to effectively combine the different representations, and 3) application-level contributions.

First, we propose a dataset construction pipeline for molecular understanding, which is not explored in the reference works. We observe that relying on public databases, including the datasets used in [1, 2], limits the knowledge scope of trained models, often generates irrelevant responses, makes wrong predictions, and limits interpretability [lines 46-52]. To address this limitation, from our key insight into the hierarchical relationships between molecular structures and their properties [lines 132-136], we design three data types and construct a large and comprehensive instruction dataset. Our dataset not only facilitates an understanding of general molecular features but also enhances explainability and reasoning capabilities [lines 138-159], which brings a substantial contributions compared to the multimodal text data in [1, 2].

Second, we propose a model architecture that effectively integrates the complementary information from different and essential molecular representations. Specifically, the main contributions on the model architectures can be summarised as follows:

  • We utilize both 2D and 3D molecular representations as they have their distinct and essential information for modeling molecular structures [lines 167-172]. In contrast, the reference work [1] employs 2D molecular structures with the images of 2D molecular structures, not providing additional information from using images.
  • Our proposed blending module effectively integrates these distinct advantages between 2D and 3D molecular encoders [lines 173-178], while the reference work [1] utilizes a simple concatenation, which cannot fully leverage the complementary information from different molecular representations as experimentally validated in Table 6 (Right, 2D+3D (Concat)).

Third, our work and multi-modal model, Mol-LLaMA, have distinct advantages for the practical utility as follows:

  • Model Abilities: Mol-LLaMA can play diverse roles with users and, thus facilitate the scientific discovery, whereas the reference work [2] is specialized for several tasks because of the limited model architecture. Please note the reference work [2] (i.e., MoleculeSTM) is one of our molecular encoders [lines 171-172].
  • Further Utilization: Our dataset construction pipelines and model architecture can be further utilized for other biomolecules, including RNAs, proteins, and their complexes [lines 299-309], which are fundamental and essential components to understand the living organisms.

[1] Liu, Pengfei, et al. "Git-mol: A multi-modal large language model for molecular science with graph, image, and text." Computers in biology and medicine 171 (2024): 108073.

[2] Liu, Shengchao, et al. "Multi-modal molecule structure–text model for text-based retrieval and editing." Nature Machine Intelligence 5.12 (2023): 1447-1457.


Comment 5: Include a wider variety of molecular structures and properties in the instruction dataset to ensure comprehensive coverage like QM9 [3] or GEOM datasets. [4]

Response 5: Our dataset includes 320K molecular structures derived from PubChem, providing extensive coverage comparable to QM9 (133K) and GEOM (450K). However, unlike QM9 and GEOM, which primarily annotate highly specific numerical properties (e.g., quantum chemical properties in QM9 and BBBP or BACE-1 binding affinity in GEOM), our dataset focuses on general molecular properties described textually. The narrow and numeric-specific annotations in QM9 and GEOM limit their applicability for instructional purposes, as discussed in lines 118-125. By leveraging PubChem’s broad textual descriptions, our dataset effectively captures a comprehensive spectrum of molecular properties and ensures suitability for diverse instructional tasks.

[3] Ramakrishnan, Raghunathan, et al. "Quantum chemistry structures and properties of 134 kilo molecules." Scientific data 1.1 (2014): 1-7.

[4] Axelrod, Simon, and Rafael Gomez-Bombarelli. "GEOM, energy-annotated molecular conformations for property prediction and molecular generation." Scientific Data 9.1 (2022): 185.


We thank the reviewer for your time and feedback and hope that our responses have addressed your concerns. If you have any further questions, we are happy to address them. If your concerns are appropriately addressed, we hope you kindly consider updating the rating accordingly.

评论

Dear Reviewer JRtJ,

Would you mind to have a check over authors' replies?

AC

评论

Dear Reviewer JRtJ,

We sincerely appreciate your time and effort in reviewing our work. To streamline your review, we briefly summarize our rebuttal below.


W1: Importance of LLMs for scientific discovery

Beyond the molecular property prediction, LLMs exhibit their capabilities to accelerate scientific discovery: 1) Interpretability, 2) Flexible interaction, and 3) Fine-tuning capability. Further, our Mol-LLaMA exhibits those capabilities, also outperforming the state-of-the-art model without LLMs (UniMol) on the property prediction task.


W2 & Q1: Reliance on text similarity metrics for evaluation (BLEU, ROUGE, METEOR)

For the main experiments in Tables 3 and 4, we adopt LLM-as-a-judge to rigorously assess chemical factual accuracy by considering the textual semantics. The text similarity metrics used in Table 12 are to strictly follow the experimental settings of the previous works.


W3: Whether qualitative metrics, such as helpfulness and relevance scores, adequately reflect accuracy

An additional experiment in Table R1 showed that factually incorrect responses significantly degenerate performance, including helpfulness and relevance scores, indicating that these qualitative metrics reflect accuracy.


W4: Regarding originality

Our work made significant contributions, including 1) a novel dataset construction pipeline tailored to molecules, 2) a straightforward yet effective model architecture to combine the different representations, and 3) application-level contributions such as versatility and further utilization for other biomolecules such as RNAs or proteins, which are not covered by the suggested works.


Q2: Wide variety of molecular structures and properties in the instruction dataset

Our instruction dataset contains a large number of molecular structures comparable to QM9 and GEOM, and general textual properties that are well-studied in the scientific literature, rather than the specific numerical properties in QM9 and GEOM, such as quantum properties.


We thank you once again for taking the time to review our work and hope that this summarization helps facilitate your review process. As the end of the discussion phase is approaching, we would be grateful if you could let us know whether our responses have addressed your concerns. If there are any additional concerns or questions, we are happy to address them.


Best regards,

Authors

评论

Dear Reviewer JRtJ,

As the discussion phase will end soon, we would like to kindly remind you regarding our rebuttal. We have clarified the importance of LLMs and our Mol-LLaMA, the rigorousness of the metrics used in our paper, the originality of our work, and the diversity of our instruction dataset.

If possible, could you kindly review our rebuttal and consider providing a response to our clarifications? Your feedback would be greatly appreciated.

Sincerely,

Authors

审稿意见
4

This paper introduces Mol-LlaMA, a multimodal LLM designed for general understanding and reasoning about molecules. The authors create a novel instruction dataset with three types of synthetic data: structural descriptions of functional groups, structure-to-feature relationship explanations, and conversations based on the first two, all generated using GPT-4o.

Architecture-wise, Mol-LlaMA uses 2D (MoleculeSTM) and 3D (UniMol) molecular encoders to capture structural information. These embeddings are aligned through a blending module, and a Q-Former projects the combined embeddings to the LLM.

Detailed experimental results show that Mol-LlaMA outperforms other baselines across various tasks.

优缺点分析

Strengths:

  • The authors present a novel and comprehensive instruction dataset that covers structural, chemical, and biological reasoning data about the molecule.
  • The proposed 2D-3D blending module effectively aligns information from different molecular encoders, and the benefits of this over single-encoder approaches are covered in the experimental results.
  • Qualitative and quantitative experimental results show that Mol-LlaMA outperforms baselines on tasks like property prediction and molecular comprehension.

Weaknesses:

  • The instruction dataset is fully synthetic and heavily relies on GPT-4o. Although GPT-4o is used to filter incorrect data, some hallucinations may still persist. A more rigorous evaluation of the dataset quality would strengthen the work.
  • GPT-4o is used both for generating and evaluating the data, which may introduce bias. Including human evaluations would provide a more robust and independent assessment of Mol-LlaMA’s capabilities.

问题

  • Is there any evaluation performed on GPT-4o's accuracy in extracting functional groups and their connectivity information from the IUPAC notation of the molecule when generating the "Detailed Structural Description" data type?

  • Please include a full ablation study on other benchmarks like property prediction, MoleculeQA, and 3D-MoIT? This could help to better understand the comprehensive benefits of the Mol-LLaMA architecture and data types.

  • Is there any plan to publicly release the data and code?

局限性

yes

最终评判理由

The authors present a novel method to construct an instruction dataset for molecules, combining structural, chemical, and biological reasoning data. The experimental results are promising and highlight the potential of the method.

While some of my concerns about the evaluation with GPT-4o were addressed, the issues of dataset quality and potential hallucinations due to reliance on IUPAC naming remain unresolved.

格式问题

N/A

作者回复

We sincerely thank you for your constructive and helpful comments. We appreciate the positive comments.

  • The proposed instruction dataset is novel and comprehensive, covering structural, chemical, and biological reasoning data.
  • The proposed blending module effectively aligns information from different molecular encoders.
  • Experimental results show that Mol-LLaMA outperforms baselines.

We initially address your concerns below.


Comment 1: The dataset is fully synthetic, relying on GPT-4o.

Response 1: To avoid potential misunderstanding, we would like to clarify that, while our dataset involves synthetic generation with GPT-4o, we mitigate possible hallucinations by grounding molecular feature annotations and IUPAC names using PubChem. This grounding ensures alignment with established chemical knowledge and enhances the reliability of the dataset.


Comment 2: A more rigorous evaluation of the dataset quality beyond using GPT-4o would strengthen the work.

Response 2: We appreciate your valuable suggestions. However, we chose GPT-4o as the primary model because, to the best of our knowledge, it is currently the only LLM whose capabilities in molecular and scientific reasoning have been extensively analyzed and documented [1]. At the time of our study, other LLMs lacked comparable in-depth evaluations in the molecular domain, which limited the reliability for those LLMs.

To validate whether other LLMs actually help to improve the dataset quality, we further filter out data using diverse LLMs: GPT-4o, Qwne3-14B, Gemma-3-12B-IT, and Llama-3.1-8B-Instruct, and select the data with a score of 4 for all evaluator LLMs. As shown in Table R1, the quadruple filtration (Mol-LLaMA^*) does not bring a significant performance improvement, implying that other LLMs misjudge to remove the correct data and narrow the knowledge scope of the trained model. We will add the experimental results in the final revision.

Even though the current LLMs have limitations in evaluating data quality, our dataset construction pipeline is readily adaptable with emerging LLMs that exhibit strong domain expertise in molecular sciences, which is a notable contribution for building molecular LLMs.

ModelsHelp.Relev.Acc.DetailsOverallHelp.Relev.Acc.DetailsOverallHelp.Relev.Acc.DetailsOverall
Mol-LLaMA1.1261.1451.1541.0901.1251.2241.2661.3021.2111.2511.5781.8402.0301.5281.744
Mol-LLaMA^*1.0041.0241.0350.9411.0061.1431.1931.2551.1131.1791.6051.8582.0971.5691.801

Table R1: Quantitative evaluation on structural (left 5 criteria), chemical (middle 5 criteria), and biological (right 5 criteria) understanding.

[1] Microsoft Research AI4Science, and Microsoft Azure Quantum. "The impact of large language models on scientific discovery: a preliminary study using gpt-4." arXiv preprint arXiv:2311.07361 (2023).


Comment 3: GPT-4o is used both for data generation and performance evaluation, which may introduce bias.

Response 3: Thank you for your valuable suggestion. To robustly evaluate the quality of generated responses, we report the average of all metrics measured by GPT-4o, Qwen3-14B, Gemma-3-12B-IT, and Llama-3.1-8B-Instruct. As shown in Table R2, Mol-LLaMA continues to outperform all baselines, including GPT-4o, and the overall trend remains consistent with Table 3, indicating that the evaluation of other LLMs agree with the one of GPT-4o. We will include these results in the final revision.

ModelsHelp.Relev.Acc.DetailsOverallHelp.Relev.Acc.DetailsOverallHelp.Relev.Acc.DetailsOverall
Llama-2-based
Llama-2-7B-Chat0.3720.3740.2640.3590.3280.5340.5200.4050.5510.4920.4790.4080.3450.5980.441
Mol-Instructions0.2170.2390.2520.1310.2020.2420.2740.3010.1450.2290.3190.3760.4160.2010.315
LlasMol0.2780.3060.3040.2040.2650.2950.3350.3120.2150.2800.3320.3780.4430.2600.341
3D-MoLM0.6740.6580.5640.7190.6430.7840.7880.7010.7980.7620.8980.9450.9530.9110.923
LLaMo0.3060.4070.4430.1790.3120.3450.4640.5240.1970.3560.4540.6300.8020.2360.490
Mol-LLaMA1.0421.0631.0630.9601.0381.1241.1791.2121.0441.1461.3761.6071.6831.1671.468
Llama3.1-based
Llama3.1-8B0.6690.6800.5600.6480.6340.6980.6930.6000.6950.6670.7050.6630.6170.7320.674
Mol-Instructions0.2610.3340.3690.1520.2630.2650.3500.3960.1490.2690.3580.4680.5770.1910.367
3D-MoLM0.8350.8740.7940.7820.8170.9020.9730.9020.8290.9001.1141.2951.3460.9701.180
LLaMo0.4800.6160.5600.3010.4730.4010.5340.5600.2450.4140.5960.7880.8340.3110.603
Mol-LLaMA1.0521.0811.0920.9641.0491.1391.1921.2251.0681.1631.4591.6991.8011.2101.555
Mol-LLaMA^*1.0011.0261.0300.9110.9971.1131.1701.2051.0341.1391.4211.6401.7411.2231.522

Table R2: Quantitative evaluation on structural (left 5 criteria), chemical (middle 5 criteria), and biological (right 5 criteria) understanding using diverse evaluators.


Comment 4: Including human evaluations would provide a more robust and independent assessment of Mol-LlaMA’s capabilities.

Response 4: Unfortunately, conducting human evaluation is difficult due to the high complexity of accurately assessing molecular properties. Accurate and reliable evaluations of model-generated responses in the scientific domain require deep interdisciplinary expertise spanning chemistry, biochemistry, biology, and pharmacology. Furthermore, models may generate correct responses that are not included in the annotated descriptions as shown in Table 11, potentially introducing evaluator bias and making an accurate assessment challenging. Consequently, gathering evaluators possessing the required interdisciplinary expertise poses significant practical difficulties.

As an alternative, we rigorously evaluate Mol-LLaMA across diverse and independent tasks, providing a robust demonstration of its capabilities. Specifically, Mol-LLaMA demonstrates its capabilities on understanding general molecular features including structural, chemical, and biological features (Tables 3 and R2 of Response 3), basic and quantum property prediction (Tables 4 and 13), specific biological property prediction (Table 8), and molecular comprehension benchmarks (Tables 5 and 12).


Comment 5: Is there any evaluation performed on GPT-4o's accuracy in extracting functional groups and their connectivity information from the IUPAC notation of the molecule when generating the "Detailed Structural Description" data type?

Response 5: To the best of our knowledge, there is currently no benchmark to evaluate the accuracy of understanding molecular structures with IUPAC names, as annotating IUPAC names requires specialized expertise. Meanwhile, we chose to use IUPAC names and GPT-4o based on the experimental analyses reported in the literature. Microsoft Research AI4Science and Microsoft Azure Quantum teams [1] conducted extensive evaluation of GPT-4 within the scientific contexts, providing informative analyses that have not been conducted for other LLMs. Additionally, the work [1] has shown that IUPAC names are the most interpretable string representations, as their tokenization of IUPAC names provides subword semantics of molecular substructures. Therefore, our selection of both IUPAC names and GPT-4o is grounded in empirical evidence and scientific relevance.


Comment 6: A full ablation study on other benchmarks.

Response 6: The primary aim of the ablation study on the PAMPA task is to evaluate the knowledge and reasoning capabilities required for predicting molecular properties. Therefore, we focused on data types that significantly influence these capabilities of LLMs.

We further conduct an ablation study on the task transfer scenario using MoleculeQA. As shown in Table R3, each data type contributes incrementally to performance with steady improvements observed across data types (i.e., S \rightarrow S+F \rightarrow Full). Table R4 further demonstrates that incorporating different molecular representations and using the blending module improves performance in total, suggesting that combining 2D and 3D representations is beneficial and that the blending module effectively integrates complementary information. Overall, the impact of data types on performance gains is greater than that of model architecture, highlighting the importance of instruction dataset quality in task transfer scenarios. We will add the experimental results in the final revision.

MethodStruct.SourceProp.App.Totals
S72.7471.4149.3647.2766.84
S+F76.3372.1946.3145.9168.42
Conv.72.5371.4148.8845.4666.48
Full.77.8175.5049.6349.3070.76

Table R3: Ablation study of data types on MoleculeQA

MethodStruct.SourceProp.App.Totals
2D72.2673.7949.5947.7567.21
3D76.2074.4250.9450.4569.93
2D+3D(Concat)76.7775.6149.5648.5870.14
2D+3D(Blended)77.8175.5049.6349.3070.76

Table R4: Ablation study of blending module on MoleculeQA


Comment 7: Is there any plan to publicly release the data and code?

Response 7: We will publicly release all materials, including data, code, and model weights, to be broadly utilized for the scientific domain.


We thank the reviewer for your time and feedback and hope that our responses have addressed your concerns. If you have any further questions, we are happy to address them. If your concerns are appropriately addressed, we hope you kindly consider updating the rating accordingly.

评论

Thanks for your response and ablation studies.

I understand that GPT-4's capabilities for drug discovery tasks are well studied, and using IUPAC for data generation makes sense since it's more interpretable than SMILES. However, generating structural information from IUPAC using GPT-4o is prone to hallucinations. Since this is the first step in the data generation pipeline, there's a high chance of errors and hallucinated structures in the dataset. Additional analysis - ideally by experts - on the generated data could help improve the dataset's overall quality and reliability.

评论

We sincerely appreciate your insightful feedback and engagement in the discussion phase. We address your concerns below.


Comment 8: Since this is the first step in the data generation pipeline, there's a high chance of errors and hallucinated structures in the dataset.

Response 8: We believe that our dataset construction pipeline and the resulting dataset exhibit a significant contribution to the field as follows:

  • Model-Agnostic Nature: Our dataset construction pipeline is model-agnostic and can be further utilized with upcoming LLMs that exhibit advanced scientific knowledge.
  • Further Utilization: Our dataset construction pipeline is suited for generating instruction data of biomolecules, where domain-specific knowledge and reasoning are required.
  • Enhancement of understanding and reasoning: The resulting dataset improves the understanding of molecular features and enhances explainability and reasoning capabilities. As a result, our model outperforms existing molecular LLMs, including GPT-4o, highlighting the strength of our datasets.

Comment 9: Additional analysis - ideally by experts - on the generated data could help improve the dataset's overall quality and reliability.

Response 9: We perform an additional analysis by conducting a human evaluation for a subset of our instruction dataset. Specifically, we ask nine experts specializing in chemistry and biology to evaluate the accuracy of fifteen samples in the detailed structural descriptions of our instruction dataset. In Table R4, we report Pearson and Spearman correlation coefficients between the average expert scores and those assigned by GPT-4o. The high correlation indicates that GPT-4o’s evaluations closely align with those of experts, showcasing the reliability of GPT-4o's evaluation. Furthermore, the samples that received a score of 4 from GPT-4o had an average expert score of 3.71, validating the quality and reliability of our instruction dataset.

Pearsonpp-valueSpearmanpp-value
0.960.962.68×1082.68\times10^{-8}0.910.913.43×1063.43\times10^{-6}

Table R5: Pearson correlation coefficient between the scores of human and GPT-4o.


We sincerely thank you again for your time and effort in reviewing our paper, and hope our responses and the expert evaluation results address your concerns. If you have further feedback and questions, we would be pleased to respond.

评论

Dear Reviewer QhZK,

We would like to politely draw your attention to our recent responses regarding the reliability of our instruction dataset. In these responses, we demonstrated the reliability of our dataset through human evaluation results and clarified that both our dataset construction pipeline and the resulting dataset make significant contributions.

As the discussion phase is coming to an end, we kindly ask that you review our latest responses. We sincerely thank you again for taking the time to review our paper.

Sincerely,

Authors

最终决定

The paper introduces Mol-LLaMA, a large molecular language model designed to advance general molecular understanding through a novel instruction dataset and a multimodal architecture integrating 2D and 3D molecular encoders. Reviewers generally agree on the strengths of the work, including the comprehensiveness and novelty of the dataset, the effectiveness of the proposed blending module, and the model’s strong empirical performance across multiple benchmarks. While some concerns were raised regarding reliance on GPT-4o for both data generation and evaluation, the authors provided extensive rebuttals, including additional experiments, multi-model comparisons, and limited human evaluation, which addressed most of the critical points.