MolTextQA: A Curated Question-Answering Dataset and Benchmark for Molecular Structure-Text Relationship Learning
A benchmarking dataset for molecule structure - text relationship learning
摘要
评审与讨论
This paper introduces a dataset of approximately 500,000 question-answer pairs centered on molecule structure-based questions and textual description-based molecule retrieval, derived from around 240,000 molecules in PubChem. The authors utilize this dataset to benchmark multiple models, including multi-modal architectures and large language models. Notably, they highlight the high performance of Galactica and MoleculeSTM in molecule QA and retrieval tasks, respectively.
优点
The paper contributes a novel dataset focused on molecule structure-based questions and textual description-based molecule retrieval tasks. Additionally, several scientific large language models are evaluated on this new dataset, providing initial benchmarks.
缺点
- The writing of the paper requires significant improvement, as it is currently challenging to follow. The paper includes redundant sections while lacking clarity and justifications in key areas.
- Acronyms are defined multiple times throughout the paper; for example, "QA" is defined five times and "LLMs" four times.
- Several statements lack clarity and justification. For example, in line 042, "deep learning methods" is vague in context. Lines 044-046 lack justification, and line 053 references "a surge in the development of models to decipher the complex relationships between molecular structures and textual descriptions" without specifying which models are being referred to.
- The motivation is unclear. Line 073 mentions, "Many existing datasets are based on free-form text generation or molecule/text retrieval, complicating the assessment of a model’s capability to infer specific molecular properties." However, it is unclear which datasets are being referenced. Although the authors mention that section 2.2 elaborates on current methodological shortcomings, the motivation for proposing a new QA dataset should be explicitly outlined in the introduction. The motivation behind introducing a new QA dataset is currently missing from the introduction.
- The authors state, "commonly used evaluation metrics such as the BLEU score are not entirely reliable for this context." However, BLEU scores are typically not used for molecule/text retrieval tasks, which brings into question the relevance of this statement.
- The benchmarking experiments need to add several recent baselines, such as BioT5, BioT5+, and MolecularGPT.
- The authors claim to have conducted "a comprehensive validation process that includes human annotation of a small subset to evaluate data accuracy." However, no details are provided on the human annotation process.
- The paper asserts that it provides "valuable insights into the advantages and limitations of current models," but these insights are not clearly presented.
问题
- Please clarify the statements mentioned above.
- How were the human annotations conducted?
- What specific advantages and limitations of current models does the analysis reveal?
伦理问题详情
The authors propose new datasets with human annotations, however, the human annotations details are missing.
We thank the author for their review. We would like to address the questions and weeknesses raised
-
All redefinitions of acronyms have been removed.
-
Lines 44–46 requiring justification: The statement, "identifying the potential side effects of a drug might depend heavily on analyzing narrative case studies and patient reports," has been clarified in the updated introduction. Specifically, we refer to databases such as clinicaltrials.gov[1] and FDALabels[2], which include detailed information about molecular structures and biological applications, such as therapeutic uses and adverse reactions in patients. The objective is to use information present in such databases to enable learning relationships between molecular structure and critical chemical/biolgical information available in text. These references have been incorporated into the revised introduction to explicitly support this point.
-
Unclear motivation: The introduction has been expanded (lines 72–82 in the updated draft) to better highlight the motivation for this work. Additionally, Section 2.2 provides a comprehensive comparison of our dataset with existing ones, including specific examples and shortcomings, as summarized in Table 1. Appendix C further elaborates on these limitations with concrete examples to provide additional context. Here is the paragraph from the revised introduction with citations included:
Despite significant progress in model development, several challenges persist in the evaluation of these models. Existing datasets, such as those in [3], [4], [5] and [6], often rely on free-form text generation or molecule/text retrieval tasks, which hinder the assessment of a model’s ability to infer specific molecular properties. These datasets typically use generic prompts like "Describe the molecule", which fail to extract precise information. A more effective approach would involve using targeted questions, such as "What is the physical state of the molecule at room temperature?". Additionally, widely used evaluation metrics, such as the BLEU score in molecule captioning tasks, are inadequate for this context. Since answers to vague prompts like "Describe the molecule" can vary significantly, ranging from physical properties to industrial applications, BLEU’s reliance on semantic similarity makes it unreliable. Further details on these methodological shortcomings are provided in Section 2 -
Use of BLEU scores: BLEU scores are not typically used for molecule/text retrieval tasks, and text retrieval is not the primary focus of this paper. Instead, our main objective is to evaluate structured inference for directed questions. While question answering (QA) cannot be directly applied to models like CLIP-based frameworks (e.g., MoleculeSTM [5]), we adapted our evaluation to include retrieval for standardizing comparisons between LLMs and multimodal architectures.
Although BLEU scores have been used as evaluation metrics in prior works such as [6][7][8][9][10][11], they are often unreliable due to their lack of specificity. In contrast, our multiple-choice evaluation approach provides precise metrics for model evaluation. Furthermore, our dataset includes answer options in sentence form for those preferring text-based evaluations. We elaborate in Section 2.2 (point 2) and Appendix C with concrete examples demonstrating why BLEU scores are problematic and how our dataset addresses this issue.
Please let us know if there are any additional clarity issues or concerns.
[1] CTGov. ClinicalTrials.gov, 2024. URL https://clinicaltrials.gov/.
[3] Kirill Degtyarenko, Paula De Matos, Marcus Ennis, Janna Hastings, Martin Zbinden, Alan McNaught, Rafael Alcántara, Michael Darsow, Mickaël Guedj, and Michael Ashburner. Chebi: a database and ontology for chemical entities of biological interest. Nucleic acids research, 36(suppl_1): D344–D350, 2007.
[4] Bing Su, Dazhao Du, Zhao Yang, Yujie Zhou, Jiangmeng Li, Anyi Rao, Hao Sun, Zhiwu Lu, and Ji-Rong Wen. A molecular multimodal foundation model associating molecule graphs with natural language. arXiv preprint arXiv:2209.05481, 2022.
[5] Shengchao Liu, Weili Nie, Chengpeng Wang, Jiarui Lu, Zhuoran Qiao, Ling Liu, Jian Tang, Chaowei Xiao, and Animashree Anandkumar. Multi-modal molecule structure–text model for text-based retrieval and editing. Nature Machine Intelligence, 5(12):1447–1457, 2023a.
[6] Yin Fang, Xiaozhuan Liang, Ningyu Zhang, Kangwei Liu, Rui Huang, Zhuo Chen, Xiaohui Fan, and Huajun Chen. Mol-instructions: A large-scale biomolecular instruction dataset for large language models. In ICLR. OpenReview.net, 2024. URL https://openreview.net/pdf? id=Tlsdsb6l9n.
[7] Carl Edwards, ChengXiang Zhai, and Heng Ji. Text2mol: Cross-modal molecule retrieval with natural language queries. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 595–607, 2021.
[8] Carl Edwards, Tuan Lai, Kevin Ros, Garrett Honke, Kyunghyun Cho, and Heng Ji. Translation between molecules and natural language. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 375–413, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology. org/2022.emnlp-main.26.
[9] Qizhi Pei, Wei Zhang, Jinhua Zhu, Kehan Wu, Kaiyuan Gao, Lijun Wu, Yingce Xia, and Rui Yan. Biot5: Enriching cross-modal integration in biology with chemical knowledge and natural language associations. arXiv preprint arXiv:2310.07276, 2023.
[10] Qizhi Pei, Lijun Wu, Kaiyuan Gao, Xiaozhuan Liang, Yin Fang, Jinhua Zhu, Shufang Xie, Tao Qin, and Rui Yan. Biot5+: Towards generalized biological understanding with iupac integration and multi-task tuning. arXiv preprint arXiv:2402.17810, 2024.
[11] Sihang Li, Zhiyuan Liu, Yanchen Luo, Xiang Wang, Xiangnan He, Kenji Kawaguchi, Tat-Seng Chua, and Qi Tian. 3d-molm: Towards 3d molecule-text interpretation in language models. In ICLR, 2024. URL https://openreview.net/forum?id=xI4yNlkaqh.
Thank you for bringing this to our attention. We have now included the BioT5 and BioT5-plus models in our benchmark and updated the discussions accordingly. Below is a summary of the metrics:
Molecule QA:
| Model | Entire Dataset | Physical Properties | Chemical Info | Biological Info | Sources | Uses |
|---|---|---|---|---|---|---|
| BioT5 | 23.54 | 39.79 | 23.89 | 23.36 | 18.18 | 41.18 |
| BioT5-plus | 23.00 | 32.31 | 23.86 | 21.32 | 19.87 | 29.95 |
Molecule Retrieval:
| Model | Entire Dataset | Physical Properties | Chemical Info | Biological Info | Sources | Uses |
|---|---|---|---|---|---|---|
| BioT5 | 23.34 | 37.63 | 21.29 | 21.14 | 17.26 | 33.16 |
| BioT5-plus | 22.30 | 31.24 | 21.28 | 19.46 | 17.75 | 27.72 |
Additionally, regarding MolecularGPT, we note that this model is primarily designed for property prediction tasks and not for molecule captioning or text retrieval. As the focus of our paper is on directed question answering, MolecularGPT is not directly applicable in this context.
We elaborated on the human evaluation procedure in Section 3.5 and Appendix E. Further, all the evaluation examples, along with the reasons for rejection, were provided with the dataset and benchmark repository at this link.
Below, we provide a detailed explanation of our evaluation process:
Evaluation Criteria
Our human evaluation involved a systematic review of 400 randomly sampled QA pairs from the test set, applying the following critical criteria:
-
Caption and Structure Alignment Questions and their correct answers must be logically inferrable from both the provided molecular structure and explicitly supported by information in the caption.
-
Answer Unambiguity Through Distinct Options Answer options must be mutually exclusive and clearly distinguishable, avoiding overlapping or ambiguous choices. This ensures there is exactly one correct answer that can be definitively identified. For example, chemical classification options should not include categories that could simultaneously apply to the same molecule.
-
Question Relevance Questions must contribute meaningful chemical or biological insights based on the structure, avoiding trivial or uninformative queries.
Representative Examples
Rejected Examples
-
PubChem ID: 54671008
- Question: When did the molecule receive FDA approval?
- Options:
- 10 October 2006
- 12 October 2007
- 12 October 2008
- 10 October 2009
- 12 October 2010
- Rejection Rationale: Relies on temporal metadata rather than molecular properties, which cannot be inferred from structure.
-
PubChem ID: 10129
- Question: What type of odor does the molecule have?
- Options:
- Strong
- Mild
- Sweet
- Pungent
- Unpleasant
- Rejection Rationale: Options lack clear differentiation and are potentially overlapping.
-
PubChem ID: 101562486
- Question: What is the general class of biomolecules to which the molecule belongs?
- Options:
- Carbohydrate
- Lipid
- Oligopeptide
- Nucleic acid
- Heterocycle
- Rejection Rationale: Multiple options could be technically correct.
-
PubChem ID: 53361968
- Question: What characteristic may make the molecule a desirable therapy?
- Options:
- It is less expensive
- It is less likely to generate resistance
- It is only for treatment-naive patients
- It is only for PI-experienced patients
- It is only for HIV-2 infections
- Rejection Rationale: Addresses charecterisits such as price are not directly inferrable from structure.
Accepted Examples
-
PubChem ID: 21580808
- Question: What is the molecule resulting from?
- Options:
- Protonation of the oxygen of the primary amino group of sotalol
- Protonation of the nitrogen of the secondary amino group of sotalol
- Deprotonation of the nitrogen of the primary amino group of sotalol
- Protonation of the oxygen of the secondary hydroxyl group of sotalol
- Deprotonation of the oxygen of the primary hydroxyl group of sotalol
- Accepted because: The question addresses specific chemical modifications with clearly distinguishable options.
-
PubChem ID: 47528
- Question: What is the mechanism of action of the molecule on vascular smooth muscles?
- Options:
- Membrane depolarization
- Membrane hyperpolarization
- Increased transmembrane sodium conductance
- Increased intracellular concentration of cyclic AMP
- Reduced transmembrane potassium conductance
- Accepted because: The question relates to structure-function relationships with distinct, non-overlapping answer choices.
-
PubChem ID: 1711945
- Question: Where is the molecule naturally found?
- Options:
- Tilia platyphyllos
- Tilia tomentosa
- Sargassum natans
- Sargassum micracanthum
- Sargassum flavescens
- Accepted because: The question has a single, verifiable correct answer among distinct options.
These examples, along with their detailed rationale, have been included in the manuscript (in Appendix E) to clarify the robustness of our evaluation process.
We summarize the key insights and analytical contributions in the following. These insights were presented in Section 4 of the paper along with model benchmarking.
Benchmarking Across Architectures
- Our dataset is the first to systematically compare both popular LLMs and specialized architectures like MoleculeSTM within the same benchmark.
- Existing datasets typically focus on evaluating only one class of models (e.g., LLMs or multimodal systems), limiting their scope.
Architectural Capabilities and Limitations
- LLMs struggled significantly in Molecule Retrieval tasks, performing close to random guessing, and have outperformed popular multi-modal architectures.
- Multimodal architectures, particularly MoleculeSTM, excelled in retrieval tasks, achieving over 66% accuracy.
- This reveals that while LLMs are versatile in handling diverse data types described in textual input, multimodal systems are more effective at leveraging structural molecular representations for specific tasks.
Performance Variations Across Task categories based on architecture
- LLMs demonstrated strong capabilities in predicting physical properties and uses (e.g., appearance, odor), while multimodal architectures excelled in processing chemical structural information.
Fine-Tuning Dynamics and Performance Transformations
- Fine-tuning had a significant effect on LLM performance, with Llama3-8B achieving a 45% accuracy improvement.
- Galactica-6.7B became the top-performing model post-fine-tuning, achieving 69% accuracy.
- The effectiveness of fine-tuning highlights the adaptability of these models for molecular informatics tasks, although these are not specifically designed for molecular modeling.
- Multimodal architectures also showed substantial gains, with performance improving by 20%.
- However, fine-tuning introduced trade-offs, with gains in physical property tasks accompanied by slight declines in chemical property performance especially in multi-task models. We also observe a similar trend between BioT5 and BioT5-plus, where BioT5-plus demonstrates improved performance in the "Sources" and "Uses" categories but shows a decline in accuracy for "Physical" and "Chemical Properties.”
Scaling and Model Potential
- LLM performance correlated with model size, suggesting that scaling can further enhance capabilities.
- The performance gap between BioT5 and MolT5 highlights the potential benefits of integrating diverse model types and training paradigms.
Future Research Directions
- Exploring hybrid architectures that leverage the strengths of both LLMs and multimodal systems.
- Investigating modeling approaches to achieve balanced performance across diverse molecular property categories.
We once again thank the reviewer for their valuable comments. As the end date for the rebuttal phase is approaching, we kindly request your feedback on whether our responses have satisfactorily addressed your concerns. We would be happy to provide any additional information.
Thank you for your detailed response to the review comments and the additional clarifications. While the paper has potential for improvement, some weaknesses remain. For example, the human evaluation lacks sufficient detail. It is unclear who performed the annotation, the level of expertise of the annotators, and whether there were inconsistencies in the evaluation process. The approach of generating data using Model 1, validating it with Model 2, and subsequently verifying it through human annotation is noted, but greater transparency is needed. Additionally, I recommend involving ethical reviewers, as the paper addresses areas with high-risk implications. I have no further questions and will maintain my original rating.
This paper introduces a large, structured question-answering (QA) benchmark for studying the relationship between molecular structures and text descriptions. It contains 500,000 QA pairs of about 240,000 molecules from PubChem, designed to enhance model evaluation by offering specific, multiple-choice questions across categories like chemical, physical, and biological properties.
优点
-
The dataset’s structured QA approach and validation process ensure a reliable and robust QA dataset.
-
Evaluation across multiple architectures enables direct comparisons in molecular QA.
-
The paper is generally well-written and clear in describing the dataset construction, with well-organized figures and tables.
缺点
- The QA generation process relies heavily on LLMs, which may introduce non-factual content if not properly validated.
问题
See above.
We thank the reviewer for their comments and positive assessment of the paper. We appreciate the acknowledgment of our contributions, including:
- Building a structured approach to QA dataset construction and validation.
- Benchmarking across different architectural classes.
- Ensuring presentation quality with well-organized figures and tables.
We would like to provide additional clarification regarding the stated weakness: "relies heavily on LLMs, which may introduce non-factual content if not properly validated."
-
LLM-Generated QA Validation:
The dataset mitigates potential inaccuracies from LLM-generated content through a two-stage validation process involving both automated filtering and manual evaluation. This ensures that QA pairs are factually grounded and relevant to the molecular context. This approach is a common practice, such as in [1], where LLM-based validation was used to mitigate the risk of hallucinations. -
Reliability Assurance:
Further, manual evaluation on a random sample (showing >98% accuracy on the subset) confirms a >96% accuracy rate on the entire dataset under a hypergeometric test, demonstrating the robustness of the validation pipeline and minimizing the risk of non-factual content impacting model performance. In future iterations, more advanced LLMs will help bridge this gap even further.
References:
[1] Es, Shahul, et al. "Ragas: Automated evaluation of retrieval augmented generation." arXiv preprint arXiv:2309.15217 (2023).
Thanks for your response which clarifies some of my concerns. I will keep my score.
This work is a benchmarking effort on the topic of molecular structure-text data. The paper presents MolTextQA, a dataset and benchmark for evaluating models on molecular structure-text relationship learning. The dataset contains approximately 500,000 question-answer pairs related to 240,000 molecules from PubChem, specifically curated for structured-directed questions and molecule retrieval tasks. The paper benchmarks various architectural approaches including multi-modal models and large language models, with reporting the finding that Galactica and MoleculeSTM perform best on Molecule QA and Molecule Retrieval tasks respectively, alongside other analyses and highlighting challenges.
优点
- the data size is substantial with 500k pairs and as shown, comes from diverse categories.
- the benchmark emphasizes specificity by using targeted questions with multiple-choice answers.
- the details on the data curation process is sufficiently included in the paper as well as in the README of the code repo. the types of information included in the schema are also reasonable and easy to follow.
- although llm generations during the curation process may be not 100% factual, the posed challenges are attempted to be addressed and corresponding validation are shown.
缺点
- in the context of LLM being used in the process, although validation efforts are made, questions about long-term scalability and manual validation persist.
- while the validation steps adopted shows high accuracy (around 96%), the remaining error rate could potentially complicate model training, as acknowledged in the paper.
- some questions might be straightforward questions or redundant, such as on the polarity questions
问题
na
We thank the reviewer for their detailed feedback and for highlighting both the strengths and areas for improvement in our work. We appreciate the recognition of the strengths and contributions of our paper, including the size and diversity of the dataset, emphasis on question answering, thorough documentation and validation efforts.
Addressing weaknesses:
1. Validation Efforts and Error Rate:
We acknowledge the reviewer’s concern about the reliance on manual validation. As LLMs become more advanced, we believe this error rate will reduce further, thereby decreasing the burden on manual validation. Currently, a hypergeometric test reveals that the dataset has an error rate of less than 4%, ensuring its overall reliability.
2. Impact of Residual Errors on Model Training:
While we recognize the potential concern regarding the ~4% error rate, we believe the dataset’s large size and diversity make it unlikely to significantly impact training. This is supported by the strong performance of models trained on the dataset, such as BioT5, which achieved up to 75% accuracy, indicating the dataset’s effectiveness in enabling robust learning.
3. Redundant or Straightforward Questions:
As acknowledged in the conclusion, a small fraction of the questions may be straightforward or redundant. However, we believe these questions are still valuable as they serve as baseline tasks to benchmark fundamental molecular understanding. In future work, we aim to refine our question filtering and categorization criteria to better evaluate models across different question complexity levels.
Please let us know if additional clarifications are needed.
This paper introduces a question-answer (QA) dataset containing over 500,000 QA pairs associated with 240,000 unique small molecules. The dataset is derived from the PubChem textual corpus and includes multiple-choice questions about chemical structures, physical and biological properties, applications, and manufacturing information. The questions and answers are generated using various large language models (LLMs), including Llama3 and ChatGPT. A subset of the dataset is human-annotated to enable evaluation. The authors also present an evaluation of existing molecule-text multimodal models and LLMs on the proposed dataset.
优点
- Textual data can play a valuable role in enhancing molecular prediction tasks. This paper introduces a large-scale dataset to advance the molecule-text research, providing a benchmark for molecule-related question-answering (QA) tasks, where comprehensive resources are currently lacking.
- The dataset is generated using a combination of multiple large language models (LLMs) and includes a subset with human annotation. This approach aims to enhance the dataset’s quality and usability, potentially offering the community a resource to explore QA tasks related to molecular data.
缺点
-
The dataset consists solely of multiple-choice questions with sentence-level descriptions, potentially limiting the depth and complexity of the tasks. It is unclear whether the dataset effectively tests cross-modal reasoning between text and molecules, as many examples seem to involve only simple references to the molecule in the text.
-
Additionally, the paper does not provide a strong justification for how advancements in this benchmark would lead to meaningful improvements in molecular prediction models. Consequently, it remains uncertain how this dataset would contribute to real-world molecular applications or enhance the predictive capabilities of existing models.
问题
N/A
"It remains uncertain how this dataset would contribute to real-world molecular applications or enhance the predictive capabilities of existing models."
We believe that this dataset holds significant potential to impact real-world molecular applications and advance the predictive capabilities of existing models in the following ways:
-
Augmenting Existing Molecule-Text Models, Datasets, and LLMs:
- Molecule-text relationship learning has garnered significant attention recently such as [1][2][3][4][5][6]. However, these are often unsuitable for extracting specific, directed information. The QA structure in this dataset addresses this gap by providing a more focused and structured approach to molecule-text relationships.
- For instance, tasks involving chemical transformations or mechanisms of action directly assess the ability of models to correlate molecular structures with textual descriptions, a crucial requirement for applications like drug discovery and chemical research.
- Large LLMs, while becoming foundational across domains, have shown limited applicability in this area, as demonstrated by their performance in our zero-shot evaluation.
- Our work demonstrates that both existing molecule-text models and LLMs can be supplemented by finetuning with with our proposed dataset, showcasing its utility in enhancing these models' performance and enabling directed information retrieval.
- Molecule-text relationship learning has garnered significant attention recently such as [1][2][3][4][5][6]. However, these are often unsuitable for extracting specific, directed information. The QA structure in this dataset addresses this gap by providing a more focused and structured approach to molecule-text relationships.
-
Evaluation of Multi-Modal Understanding by various model classes:
- A crucial focus of our paper is to enable the evaluation of molecule-text understanding. With the growing amount of research in this area, it is essential to establish reliable evaluation strategies, which our work aims to address. In Section 2.2 of the paper, we have highlighted the shortcomings of existing evaluation methodologies and demonstrated how our approach can address these gaps.
- Building Molecular Foundational Models
- Foundational models in NLP have revolutionized many tasks by training on large-scale, multi-task setups and enabling in-context learning. This paradigm facilitates learning across diverse tasks, yielding generalized benefits. In contrast, molecular models have traditionally focused on single-property prediction tasks, limiting their scope and versatility.
- By introducing directed questions, our dataset enables model steerability, paving the way for future foundational molecular models that can be prompted for various molecular prediction tasks. These models would benefit from learning relationships across a wide range of training data, enhancing their applicability and robustness.
- MoleculeSTM[3] has already demonstrated the value of learning representations from diverse textual information, improving the prediction of molecular properties. While molecular property prediction is beyond the scope of our current work, we believe that the directed nature and breadth of questions in our dataset will contribute to this area, and will be a key focus for future research.
[1] Shengchao Liu, Weili Nie, Chengpeng Wang, Jiarui Lu, Zhuoran Qiao, Ling Liu, Jian Tang, Chaowei Xiao, and Animashree Anandkumar. Multi-modal molecule structure–text model for text-based retrieval and editing. Nature Machine Intelligence, 5(12):1447–1457, 2023a.
[2] Yin Fang, Xiaozhuan Liang, Ningyu Zhang, Kangwei Liu, Rui Huang, Zhuo Chen, Xiaohui Fan, and Huajun Chen. Mol-instructions: A large-scale biomolecular instruction dataset for large language models. In ICLR. OpenReview.net, 2024. URL https://openreview.net/pdf? id=Tlsdsb6l9n.
[3] Carl Edwards, ChengXiang Zhai, and Heng Ji. Text2mol: Cross-modal molecule retrieval with natural language queries. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 595–607, 2021.
[4] Carl Edwards, Tuan Lai, Kevin Ros, Garrett Honke, Kyunghyun Cho, and Heng Ji. Translation between molecules and natural language. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 375–413, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology. org/2022.emnlp-main.26.
[5] Qizhi Pei, Wei Zhang, Jinhua Zhu, Kehan Wu, Kaiyuan Gao, Lijun Wu, Yingce Xia, and Rui Yan. Biot5: Enriching cross-modal integration in biology with chemical knowledge and natural language associations. arXiv preprint arXiv:2310.07276, 2023.
[6] Qizhi Pei, Lijun Wu, Kaiyuan Gao, Xiaozhuan Liang, Yin Fang, Jinhua Zhu, Shufang Xie, Tao Qin, and Rui Yan. Biot5+: Towards generalized biological understanding with iupac integration and multi-task tuning. arXiv preprint arXiv:2402.17810, 2024.
We thank the reviewer for their comments. We would like to address the weaknesses stated by the reviewer.
"The dataset consists solely of multiple-choice questions with sentence-level descriptions, potentially limiting the depth and complexity of the tasks. It is unclear whether the dataset effectively tests cross-modal reasoning between text and molecules."
- We argue that the directed nature of the question-answer pairs adds significant complexity to molecule captioning models, which are commonly available in existing datasets.
- For instance, many existing datasets, such as those in [1][2][3][4], use generic prompts like "Describe this molecule." In such cases, the multi-modal architecture does not need to focus on the question itself, as the task is open-ended and undirected.
- In contrast, our dataset employs more specific and directed questions, such as "What are the physical characteristics of the molecule at room temperature?" or "How can molecule X be manufactured from molecule Y?" Questions like these, as well as those targeting industrial applications, require models to extract precise and directed features solely from the structure.
- This directed nature is absent in simplistic prompts like "Describe this molecule."
- Consequently, our dataset introduces additional complexity, making it a more comprehensive test of multi-modal capabilities. It is not limiting but rather expands the complexity in comparison to existing datasets.
- Furthermore, the inclusion of multiple-choice questions is critical for evaluation. Unlike BLEU scores, which are widely used in similar contexts but suffer from limitations (as described in Section 2.2 of the paper), multiple-choice evaluation provides a more precise and reliable metric.
- "Many examples seem to involve only simple references to the molecule."
- We argue that the questions in the dataset cannot be answered without leveraging the model's ability to learn and utilize vast molecule-text relationships.
- For example, to provide an answer like "The molecule is in a gaseous state at room temperature and pressure," the model must correlate the structural properties of the molecule with those of similar molecules it has been trained on. This requires extracting and applying learned structural relationships.
- More complex examples, such as "What is the mechanism of action of the molecule on a certain disease target class?" involve even deeper reasoning. Answering such questions requires the model to synthesize information from clinical trials, drug reports, and related datasets to correlate structural properties with therapeutic applications. This goes far beyond simple molecule referencing.
- Our dataset spans a diverse array of such questions across multiple categories, making it a valuable tool for learning and testing molecule-text relationships in ways that simple references cannot achieve.
[1] Kirill Degtyarenko, Paula De Matos, Marcus Ennis, Janna Hastings, Martin Zbinden, Alan McNaught, Rafael Alcántara, Michael Darsow, Mickaël Guedj, and Michael Ashburner. Chebi: a database and ontology for chemical entities of biological interest. Nucleic acids research, 36(suppl_1): D344–D350, 2007.
[2] Bing Su, Dazhao Du, Zhao Yang, Yujie Zhou, Jiangmeng Li, Anyi Rao, Hao Sun, Zhiwu Lu, and Ji-Rong Wen. A molecular multimodal foundation model associating molecule graphs with natural language. arXiv preprint arXiv:2209.05481, 2022.
[3] Shengchao Liu, Weili Nie, Chengpeng Wang, Jiarui Lu, Zhuoran Qiao, Ling Liu, Jian Tang, Chaowei Xiao, and Animashree Anandkumar. Multi-modal molecule structure–text model for text-based retrieval and editing. Nature Machine Intelligence, 5(12):1447–1457, 2023a.
[4] Yin Fang, Xiaozhuan Liang, Ningyu Zhang, Kangwei Liu, Rui Huang, Zhuo Chen, Xiaohui Fan, and Huajun Chen. Mol-instructions: A large-scale biomolecular instruction dataset for large language models. In ICLR. OpenReview.net, 2024. URL https://openreview.net/pdf? id=Tlsdsb6l9n.
We once again thank the reviewer for their thoughtful comments. With the rebuttal phase deadline approaching, we kindly ask if our responses have adequately addressed your concerns. If there are any remaining questions or additional clarifications needed, please let us know.
Thank you to the authors for their response.
I acknowledge that the proposed dataset is more meaningful than some of the existing ones, such as molecule captioning.
However, I was unable to give a higher score because:
- I am not fully convinced that the dataset requires more molecule-text reasoning than simply replacing mentions of "the molecule" in the text with the actual molecule name.
- The dataset is inherently synthetic given the way it is constructed and is not sufficiently challenging, as it is a multiple-choice, sentence-level QA task.
- As noted by another reviewer, the validation process lacks clarity, raising concerns about the quality of the dataset.
I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.