PaperHub
5.8
/10
Poster4 位审稿人
最低5最高6标准差0.4
6
5
6
6
3.3
置信度
ICLR 2024

Towards 3D Molecule-Text Interpretation in Language Models

OpenReviewPDF
提交: 2023-09-22更新: 2024-03-10

摘要

关键词
3D moleculesLarge Language Model3D-text interpretation

评审与讨论

审稿意见
6

This paper proposes to learn 3D molecule representations that can be used with pretrained large language models. Using pretrained 3D molecule encoder and large language models, the proposed method fine-tunes Q-Former to get molecule representation in language model space, and fne-tunes large language model to predict text (i.e., description) given at molecule embeddings and SMILES representations. The proposed method shows improvements compared with using large language model (e.g., LLaMA) naively.

优点

  • I'm not an expert in this area, but it seems the proposed approach is quite novel to leverage pretrained language model for downsteam tasks related to 3D molecules.
  • The paper is generally well-written and easy to follow.
  • The paper is well-motivated.
  • The paper includes code for reproducibility.

缺点

  • The paper misses to include some failure cases (e.g., hallucination) of the proposed method. For instance, in Table 3(b), it seems the proposed 3D-MoLM wrongly interprets Globostellatic acid B (C34H48O6 while the ground truth is C33H48O7); the paper needs to discuss when the model tends to predict the wrong results.
  • I think the comparison in Table 3 might be somewhat unfair due to the different model sizes used for evaluation. Specifically, MolT5-Large has 800M parameters, while 3D-MoLM mainly uses LLaMA-7B models. In this respect, I suspect the improvements might simply come from the simple usage of language models with a large number of parameters and large training data rather than considering the 3D geometry of molecules. A good supporting experiment can be conducted by using a smaller LM for experiments (e.g., FLAN-T5-large).
  • I have a similar concern in Table 2 as well since the model used in the proposed method is much larger and uses much more data for pretraining compared with other baselines.

问题

  • Have the authors tried using larger model (e.g., LLaMA-13B)? If the authors tried, does it show the better performance?
评论

Thank you for your time and constructive feedback! To address your concerns, we present the point-to-point responses as follows. We have carefully revised our paper, taking all your feedback into account.

Q1. Failure case study. The paper misses to include some failure cases (e.g., hallucination) of the proposed method. For instance, in Table 3(b), it seems the proposed 3D-MoLM wrongly interprets Globostellatic acid B (C34H48O6 while the ground truth is C33H48O7); the paper needs to discuss when the model tends to predict the wrong results.

Response: Thank you for your insightful suggestion. We agree that discussing failure cases is crucial for a comprehensive understanding of our model's performance. In response, we have revised our submission to include two failure case studies in Appendix D. Specifically, we show two mistakes of 3D-MoLM: 1) it confuses between Globostellatic acid C and Globostellatic acid B; 2) it fails to generate a molecule's correct chemical formula. We attribute these two mistakes to the following limitations:

  1. The discernment of 3D-MoLM at a more refined granularity requires further enhancement.
  2. Existing language models, including GPT-4, are not yet fully capable of accurately counting the number of atoms and providing the correct chemical formula.

These cases underscore areas where 3D-MoLM can be improved. Thank you again for your constructive critique, which has significantly contributed to enhancing the quality of our paper.

Q2: Performance of 3D-MoLM using smaller base LMs. I think the comparison in Table 3 might be somewhat unfair due to the different model sizes used for evaluation. Specifically, MolT5-Large has 800M parameters, while 3D-MoLM mainly uses LLaMA-7B models. In this respect, I suspect the improvements might simply come from the simple usage of language models with a large number of parameters and large training data rather than considering the 3D geometry of molecules. A good supporting experiment can be conducted by using a smaller LM for experiments (e.g., FLAN-T5-large).

Response: Thank you for your insightful comments. We understand your concern about the potential bias in the comparison due to different model sizes. To address this, we have included additional experiments by replacing Llama2 with MolT5 in Appendix F.

Specifically, we have transplanted the 3D molecule encoder and the projector onto MolT5. This allows us to make a fair comparison with the same base LM. As shown in the table below, MolT5s with 3D perception consistently outperforms those without 3D perception across all scales. This result substantiates our claim that the incorporation of 3D geometry perception enhances the molecular understanding of LM.

Table 1: Comparison between 3D-MoLM and MolT5 for molecule captioning. X: without 3D perception. Y: with 3D perception.

Base LM3D PerceptionBLEU-2BLEU-4ROUGE-1ROUGE-2ROUGE-LMETEOR
MolT5-smallX22.5315.2330.4413.4520.3023.98
MolT5-smallY23.7316.5731.5714.3421.6324.81
MolT5-baseX24.5116.6132.1914.0421.3526.10
MolT5-baseY25.5517.3133.3615.4922.6227.54
MolT5-largeX25.8717.2834.0716.4223.4128.04
MolT5-largeY26.9118.5035.2517.3725.5029.45

We hope this additional experiment adequately addresses your concern and further underscores the value of considering 3D geometry of molecules in our model.

评论

Q3. Influence of Model's and Dataset's Scales for Retrieval Performance. I have a similar concern in Table 2 as well since the model used in the proposed method is much larger and uses much more data for pretraining compared with other baselines.

Thank you for the suggestion on maintaining similar scales for comparison. We have prepared responses for dataset scale and model scale separately.

Q3.1. Dataset Scale

Response: In response, we have revised Table 2 in our submission to include the retrieval performances of MoMu when using the same dataset as 3D-MoLM. The performances are shown in the Table below.

Table2: \dagger denotes our re-implementation using 3D-MoLM's dataset. Others are evaluated on their released checkpoints.

Retrievalin BatchRetrievalin test set
M2TT2MM2TT2M
ModelAcc R@20Acc R@20Acc R@20Acc R@20
MoMu-S87.58 99.2486.44 99.3847.29 90.7748.13 89.92
MoMu-K88.23 99.4187.29 99.4248.47 91.6449.46 90.73
MoMu-S\dagger90.43 99.5389.38 99.5060.51 93.2458.36 91.35
MoMu-K\dagger90.89 99.6790.16 99.4462.07 93.0659.17 92.01
3D-MoLM93.50 100.0092.89 99.5969.05 95.9170.13 94.88

We can observe that:

  1. 3D-MoLM outperforms both MoMu* and MoMu, consistent with our observation in the original submission. This underscores the advantage of 3D molecule-text modeling over 2D.
  2. MoMu* outperforms MoMu, which suggests that larger pretraining dataset provides benefits on the retrieval task.

We hope this additional experiment adequately addresses your concern and further highlights the effectiveness of our proposed 3D-MoLM model.

Q3.2. Model Scale

Response: We recognize the difference in model scales: MoMu (112M) uses the Sci-BERT (110M) and GraphMVP (2M), while our retrieval model (225M) employs the Q-former (178M) and Uni-Mol (47M). However, creating new MoMu or 3D-MoLM of equivalent scale is challenging because both methods rely on specific pretrained models (e.g., Uni-Mol and GraphMVP), for which there are no readily available versions of comparable scale. This limitation means that we cannot provide a direct scale-to-scale comparison within the constraints of our current resources.

We hope these responses satisfactorily address your concerns and further demonstrate the efficacy of our 3D-MoLM model. Thank you again for your valuable feedback.

Q4. How does the LM's scale influence performance? Have the authors tried using larger model (e.g., LLaMA-13B)? If the authors tried, does it show the better performance?

Response: Thank you for your suggestion. To address the question of how a language model's scale influences performance. we have included new experiments (Table 1 above) using MolT5-small (82M), MolT5-base (252M) and MolT5-large (782M). See our response to your Q2 for details. We can obseve that increasing LM's scale will also increase molecule captioning performance, which aligns with the common understanding in LM studies.

While we recognize the potential advantages of employing a larger LM (e.g., Llama-13B), our current study is constrained by limited computational resouces. We appreciate your understanding and will certainly consider this in our future work as resources permit.

评论

Thanks for the detailed response. It addressed my concern well and I raise my score from 5 to 6.

评论

Thank you for recognizing our efforts in rebuttal. We appreciate your decision to increase the rating. Your feedback has been invaluable to our work.

审稿意见
5

This paper proposes a framework to interpret and analyze 3D molecules by equipping the LM with a 3D molecular encoder. The framework integrates a 3D unimol and a LLama language model, the framework is the same as BLIP2. Through a pre-train and fine-tune strategy, the authors demonstrate the effectiveness of the molecular caption, retrieval, and open QA tasks.

优点

  1. The paper proposes a new framework for 3D molecule-text pertaining and fine-tuning to conduct multiple 3D molecular tasks.
  2. The framework is clear and reasonable.
  3. The experiments are conducted different tasks and the effectiveness is demonstrated.

缺点

  1. The novelty of this work is actually hard to say enough. It is clear that the framework is almost the same as BLIP2, the pre-training tasks and the fine-tuning stages are also the same.
  2. Besides, there are multiple different small places where I would like to ask questions. Most of them are unclear descriptions. Please look in the questions.

问题

  1. The first question is about the Q-former (this may be the same as the Q-former in BLIP2, but the process is not clear here). From Figure 2, the caption task and the contrastive task will use the text branch for text processing. These tasks are only used in pre-training, hence the question is that in the inference stage, the text branch in Q-former will be ignored (if I understand correctly). Therefore, the pre-training and the fine-tuning are mismatched, since we will only use llama2 for text generation, instead of the text branch in Q-former. What's the performance if we remove these two pertaining tasks and also the text branch in Q-former?
  2. In Figure 3, stage 1 is molecule-text retrieval, and stage 2 is the molecular caption, but in section 2.2.1, stage 1 describes multitask training with the whole three tasks. See "Following BLIP,m we perform multi-task training, including....". Please clarify the correct one.
  3. For the downstream tasks, the model is fine-tuned after different stages, which causes multiple different models, each task for a model. I am wondering about the multitasking fine-tuning and the performance of one model. Or at least, after three/two stage trainings, then finetune on the downstream tasks.
  4. At the beginning of section 3.1, the authors mentioned "we assess the stage-1 checkpoint of ...", but the last sentence on page 6 is "Q-former benefits from the multi-task pre-training", if the stage only contains retrieval task, the claim is wrong.
  5. From Table 4, it seems the performance gain is not as large as expected on open-text QA, do the authors have some analysis? 2D is similar to 3D.
  6. In Appendix C, did the retrieval task use LORA for fine-tuning?
评论

We sincerely appreciate your constructive and thorough comments. Your suggestions, especially the ones about including a generalist model and clarifying training stages, have significantly improved our presentation. To address your concerns, we provide responses as follows. If you have additional concerns, we would be pleased to discuss them with you.

Q1: Novelties and contributions. The novelty of this work is actually hard to say enough. It is clear that the framework is almost the same as BLIP2, the pre-training tasks and the fine-tuning stages are also the same.

Response: We appreciate your comments, but wish to respectfully emphasize our novelties and contributions. Our adaptation of Q-Former to 3D molecule-to-text generation tasks is nontrivial, offering a new dimension to 3D molecule understanding.

Further, we have made significant modifications to BLIP-2 for adaptation to the molecule domain, and devise a dataset for 3D-molecule centric instruction tuning. Our novelties and contributions, compared to BLIP-2, are summarized below:

  • First 3D molecule-text modeling method. 3D-MoLM is, to our knowledge, the first method to explore 3D molecule-text modeling. We believe 3D-MoLM advances the field by enabling text-based understanding of 3D molecules.
  • Multi-task Generalist Model through Instruction-tuning. Unlike BLIP-2, 3D-MoLM includes a multi-modal instruction-tuning stage. This instruction tuning phase aims to align 3D-MoLM to human preference, and develop a generalist model that can perform various downsteam tasks. The performance of the generalist 3D-MoLM are reported in our updated Table 3 and Table 4. See our response to your Q4 for more details.
  • Instruction-tuning Dataset. We present 3D-MoIT for 3D molecule-text instruction tuning. 3D-MoIT is, to our knowledge, the first 3D molecule-centric instruction tuning dataset. We also propose the GPT-3.5 enriched version of PubChem for 3D molecule-text alignment. We believe these datasets can facilitate further research and development in this area.

We hope this clarifies the unique contributions of our work and differentiate it from existing models like BLIP2.

评论

Q2. The first question is about the Q-former (this may be the same as the Q-former in BLIP2, but the process is not clear here). From Figure 2, the caption task and the contrastive task will use the text branch for text processing. These tasks are only used in pre-training, hence the question is that in the inference stage, the text branch in Q-former will be ignored (if I understand correctly). Therefore, the pre-training and the fine-tuning are mismatched, since we will only use llama2 for text generation, instead of the text branch in Q-former. What's the performance if we remove these two pertaining tasks and also the text branch in Q-former?

Thank you for the valuable question and comment. We have divided this question into three sub-questions and prepared answers separately:

Q2.1. Is Q-Former's text-branch ignored in inference?

Response: Q-Former's text-branch is crucial in the inference of molecule-text retrieval, alghough it is not used in molecule-to-text generation. Specifically, Q-Former's text branch is used in both molecule-text contrasting (MTC) and molecule-text matching (MTM) -- the two essential components for molecule-text retrieval [1]. In MTC, the text branch generates text embeddings to calculate cosine similarities for contrasting; In MTM, the text branch processes text tokens, from which the query tokens obtain text information. Therefore, removing the text-branch will impede the molecule-text retrieval functionality.

On the other hand, removing the text branch will not affect molecule-to-text generation inference. However, the performance of molecule-to-text generation can be hurted, which is explained in the response to Q2.3.

Q2.2. The text generation task uses Q-Former's text branch, which is ignored during the inference of text generation. Therefore, the pre-training and the fine-tuning are mismatched, since we will only use Llama2 for text generation, instead of Q-Former's text branch.

Response: Thank you for kindly identifying the ambiguity! We would like to clarify that the pretrainnig (using Llama2 for text generation in stage 2 and 3) and the inference (using Llama2 for generation after stage 3) is aligned, instead of mismatched. Our original Figure 2 may give the wrong impression that only the Q-Former's text branch is used for text generation. This issue is fixed in our revised Figure 2. The correct training strategy is: Q-Former's text branch is trained for text generation in stage 1, and Llama2 is trained for text generation in stage 2 and 3. Therefore, Llama2 is fully optimzed for generation before inference, and the pretraining and inference is aligned.

Q2.3. How will 3D-MoLM perform if we remove the text-branch and the objectives of molecule-text contrasting and molecule captioning in training stage 1?

Response: It is important to note that removing the text-branch eliminates all three objectives in training stage 1: the molecule-text mathcing objective is eliminated because it relies on the text-branch to process text tokens [1]. Therefore, removing the text-branch means to remove training stage 1 and use randomly initialized Q-Former for stage 2. Then, the question boils down to stage 1's influence to text generation performance, which is shown below:

Table 1: Molecule captionging performance on PubChem dataset.

Training stagesBLEU-2BLEU-4ROUGE-1ROUGE-2ROUGE-LMETEOR
Only Stage 227.8420.3534.1919.4228.9030.81
Original, stage 1&230.3222.5236.8422.3231.2333.06

We observe that including stage 1 pretraining significantly improves molecule captioning performances. This is because stage 1 pretraining acts as a "warmup" training before stage 2 and 3 [1]. It prepares the Q-Former's ability of extracting molecule features that are the most relevant to the texts, which is achieved by the three multi-modal objectives in stage 1.

评论

Q3. Confusion on the three-training stages. In Figure 3, stage 1 is molecule-text retrieval, and stage 2 is the molecular caption, but in section 2.2.1, stage 1 describes multitask training with the whole three tasks. See "Following BLIP, we perform multi-task training, including....". Please clarify the correct one.

......

At the beginning of section 3.1, the authors mentioned "we assess the stage-1 checkpoint of ...", but the last sentence on page 6 is "Q-former benefits from the multi-task pre-training", if the stage only contains retrieval task, the claim is wrong.

Response: Thank you for identifying the ambiguities. We appreciate the opportunity to clarify the training stages. Stage 1 includes three training objectives: molecule-text matching, molecule-text contrasting, and molecule captioning. These objectives collectively endows the Q-Former with strong molecule-text retrieval ability. Consequently, upon completing stage 1, Q-Former is fine-tuned for the retrieval task. The term "molecule-text retrieval" in Figure 3 originally refers to stage 1's downstream task, rather than a specific training objective.

To improve the clarify of our submission, we have made the following revisions:

  • Terminology Consistency. We have refined the use of the term "task" throughout the paper. Initially, it ambiguously referred to both downstream applications and training objectives. After revision, "task" strictly denotes downstream applications, while "training objective" is used for training objective.
  • Figure 2 & 3 Updates. Specifically, Figure 2b now illustrates the objectives and architectures of stage 2 & 3; Figure 3 uses consistent names for the three training stages across the paper.
  • Section 2.2.1 Revision. The description of Stage 2 is now more explicit about its training objective.

We hope these revisions adequately address your concerns and improve the clarity of our paper. Thank you again for your meticulous feedback, which has been instrumental in improving our work.

Q4. A generalist model for all text generation tasks. For the downstream tasks, the model is fine-tuned after different stages, which causes multiple different models, each task for a model. I am wondering about the multitasking fine-tuning and the performance of one model. Or at least, after three/two stage trainings, then finetune on the downstream tasks.

Response: Thank you for your insightful comments. To address your concern, we have included the performances of both the specialist models, which are fine-tuned for each task individually, and the generalist model, which is a unified model trained on all tasks and evaluated using a consistent checkpoint. The performance is shown in the Table below:

Table2: One checkpoint for captioning, open-text QA and computed properties.**

Table 2a: Molecule captioning performance.

BLEU-2BLEU-4ROUGE-1ROUGE-2ROUGE-LMETEOR
Specialist30.3222.5236.8422.3231.2333.06
Generalist29.2522.0736.4821.8030.9533.12

Table 2b: Descrptive open-text QA performance.

BLEU-2BLEU-4ROUGE-1ROUGE-2ROUGE-LMETEOR
Specialist32.0026.1340.1325.5534.6452.15
Generalist31.6125.8839.9325.6834.7851.74

Table 2c: Computed properties QA performance.

WeightLogPTPSAComplexityHOMOLUMOH-L GapSCF
Specialist14.790.669.7144.850.260.250.280.35
Generalist16.580.7810.9045.490.350.360.320.38

The results show that while the generalist model slightly underperforms compared to the specialist models, it still demonstrates a performance gain over other baselines. This highlights its versatility and ability to handle multiple tasks effectively. Please refer to the performance tables (Table 3 & Table 4) in the revised paper and the training details in Appendix C for more information.

评论

Q5. Performance gain on open-text QA. From Table 4, it seems the performance gain is not as large as expected on open-text QA, do the authors have some analysis? 2D is similar to 3D.

Response: Thank you for your comments. Table 4 has two subtables: Table 4a is about Descriptive Property QA, and Table 4c is about Computed Property QA. We now provide clarifications for the subtables separately:

  1. Table 4c (computed property QA). For tasks related to predicting 3D-dependent properties, 3D-MoLM significantly outperforms baselines that uses 1D or 2D molecular representations, demonstraing 3D-MoLM's effectiveness in perceiving 3D molecular geometries.
  2. Table 4a (descriptive property QA). The QAs in this dataset are generated based on descriptions in PubChem, where only a small fraction of description is 3D-dependent. The majority QAs from PubChem can be answered by reading 1D or 2D molecular representations. This could account for the smaller improvement of using 3D perception for this dataset.
  3. Table 4a (descriptive property QA). Although the improvement for this dataset is smaller, 3D-MoLM still surpasses baselines. This demonstrates the effectiveness of 3D molecule-text interpretation even in tasks that may not have strong connection with 3D structures.

We hope this explanation provides a clearer understanding of the performance of 3D-MoLM in different tasks.

Q6. In Appendix C, did the retrieval task use LORA for fine-tuning?

Response: Thanks for the question. No, we do not use LoRA for retrieval task, and the retrieval task uses checkpoint after the stage 1 directly, without additional fine-tuning.

We have revised Figure 2 to illustruate 3D-MoLM’s architectures at different stages and revised Appendix C to include more experimental details. To clarify, the retrieval task includes only the Q-Former and the Uni-Mol encoder from training stage 1, without the Llama-2 model. Due to the smaller scales of Uni-Mol and Q-Former, we do not employ LoRA in training stage 1.

Reference:

[1] BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In ICML 2023.

评论

Thank you for your valuable comments and suggestions on our submission. Your suggestions to 1) clarify the novelties and contributions of this work, 2) illustrate and evaluate the functionality of Q-Former's text branch, 3) clarify the training objective functions of the three training stages, and 4) include a generalist model for generations tasks of molecule captioning and open-text QA, have helped us to substantially improve the clarity, significance, and comprehensiveness of our submission. We sincerely hope that these improvements can be taken into consideration.

Now we are approaching the end of the discussion period on November 22nd. If our response has resolved your concerns on our paper, we will greatly appreciate it if you could re-evaluate our paper. Should you have any further questions or need additional clarification, please know that we are eager and prepared to continue our discussions.

审稿意见
6

This paper presented the 3D-MoLM, a novel molecule-text multimodal language model. This model is a combination of two pretrained foundation models: Uni-Mol and Llama2 for molecular and language understanding. The key contribution of this paper lies in its three-staged training process that integrates these two pretrained models.

In the first stage of training, the authors adopted the idea of Q-former and multi-task training approach, a method that has been previously proposed in vision-language multimodal model, to align the representation spaces of molecules and languages. Afterwards, the Q-former and Llama2 models are finetuned by solving the molecular captioning task. In the final stage, instruction tuning was performed to the model. To this end, the authors constructed 3D-MoIT, a 3D molecule-centric instruction tuning dataset.

The proposed method was tested on several molecule-language multimodal tasks, including molecule-text retrieval, molecule captioning, and molecular question answering. The experimental results demonstrated that the superiority of the proposed models by comparing them with several baseline models.

优点

As a machine learning researcher, multimodal learning between molecule and language domain was expected to appear sooner or later. In this light, I think this study was well presented at the right time. The authors effectively adopted several techniques of existing visual-language multimodal learning to combine a molecular domain model and large language model. In addition, initial results of instruction tuning via 3D-MoIT dataset demonstrated the potential usage of large language model in molecular domain.

缺点

The main criticism I have is that the motivation and needs for molecule-language multimodal model remains unclear. Of course, the multimodality between molecule and language domain is a promising research direction for machine learning researchers and practitioners. However, let us consider the situation where scientists actually conduct research using the proposed model. In what scenarios can the proposed model be utilized and how can it lead to the synthesis or retargeting of novel/existing compounds? The molecular domain is the task for domain experts (i.e., chemists and biologists), not the general users. Thus, discussion on the use case of the molecule-language multimodality for domain experts is crucial for setting the ultimate goal for this line of research. Unfortunately, however, this paper and other work in this field, seems to avoid answering this pivotal question. I hope that the authors address this question in the rebuttal or the revised manuscript.

问题

First of all, as mentioned in the weakness section, I would like to ask the authors about the motivation and need for molecule-language multimodal learning. What are the specific use cases of the proposed model molecule-language model? How would scientists or practitioners (i.e., domain experts) be able to harness the proposed method in their research or business?

Regarding the question-answering task, the authors performed molecular properties prediction. Basically, properties including HOMO/LUMO are typically derived from the density functional theory calculations or molecular representation models such as Uni-Mol. Is there any specific rationale or chance that the language model enhances the prediction accuracy of these properties? Additionally, this work only compared the prediction accuracy with language models, and the ability to predict the properties comes from the Uni-Mol as aforementioned; therefore, I think the Uni-Mol should also be the baseline of the experiment, which will demonstrate the effect of the multimodal learning.

Finally, PubChem database provides the 3D conformation of each molecule. While I am not sure what method were employed to obtain the 3D structure in PubChem, it might be worthwhile to consider leveraging these data instead of MMFF-relaxed structures.

伦理问题详情

None

评论

Thank you for the thorough review and positive feedback! Your meticulous suggestions have helped us to improve the clarity and enhance the comparison to baselines. To address your concerns, we present the point-to-point responses as follows.

Q1. The motivation and practical application of studying molecule-text modeling. The main criticism I have is that the motivation and needs for molecule-language multimodal model remains unclear. Of course, the multimodality between molecule and language domain is a promising research direction for machine learning researchers and practitioners. However, let us consider the situation where scientists actually conduct research using the proposed model. In what scenarios can the proposed model be utilized and how can it lead to the synthesis or retargeting of novel/existing compounds? The molecular domain is the task for domain experts (i.e., chemists and biologists), not the general users. Thus, discussion on the use case of the molecule-language multimodality for domain experts is crucial for setting the ultimate goal for this line of research. Unfortunately, however, this paper and other work in this field, seems to avoid answering this pivotal question. I hope that the authors address this question in the rebuttal or the revised manuscript.

...

First of all, as mentioned in the weakness section, I would like to ask the authors about the motivation and need for molecule-language multimodal learning. What are the specific use cases of the proposed model molecule-language model? How would scientists or practitioners (i.e., domain experts) be able to harness the proposed method in their research or business?

Response: Thank you for the insightful suggestion. We believe studying molecule-text multi-modal learning have values for scientists in (but not limited to) two aspects: 1) solving scientific tasks that are multi-modal in nature; 2) improving uni-modal scientific tasks by leveraging the science knowledge or reasoning ability of LMs. We now elaborate:

  • Solving multi-modal scientific tasks. These tasks are multi-modal in nature and cannot be solved without multi-modal methods. For example, experimental procedure prediction [1] requires reading chemical reactions (molecule modality) and output experimental instructions (text modality). This task has significant practical values for automating chemical synthesis experiment design, and 3D-MoLM can be a foundation model for fine-tuning on this task. Another example is molecule-text retrieval, where chemists can specify desired chemical properties (e.g., biological activity and toxicity) using natural language, and retrieve molecules from database that includes unannotated molecules. This task can help retargeting existing molecules.

  • Improving uni-modal scientific tasks. Uni-modal tasks (e.g., molecule property prediction) can potentially be improved by using the chemistry knowledge in LMs. [2] shows an example where joint multi-modal molecule-text learning improves molecule property prediction performance. [3] shows an example where joint multi-modal learning between an LM and a protein encoder improves the protein encoder's protein classification performance.

评论

Q2: Comparison with Uni-Mol. Regarding the question-answering task, the authors performed molecular properties prediction. Basically, properties including HOMO/LUMO are typically derived from the density functional theory calculations or molecular representation models such as Uni-Mol. Is there any specific rationale or chance that the language model enhances the prediction accuracy of these properties? Additionally, this work only compared the prediction accuracy with language models, and the ability to predict the properties comes from the Uni-Mol as aforementioned; therefore, I think the Uni-Mol should also be the baseline of the experiment, which will demonstrate the effect of the multimodal learning.

Response: Thank you for your insightful question. To address your concern, we have included an additional experiment where we fine-tuned one Uni-Mol model for eight computed property prediction tasks. As shown in the Table below, 3D-MoLM achieves comparable performance with Uni-Mol and outperforms it in predicting molecular weight, LogP, complexity, and SCF energy. We provide the following analysis to explain this:

Table: One model trained for all properties. Lower values are better.

WeightLogPTPSAComplexityHOMOLUMOH-L GapSCF
Uni-Mol20.350.5913.4857.240.320.350.210.45
3D-MoLM16.580.7810.9045.490.350.360.320.38

LM's Chance to Improve. LMs can potentially enhance the accuracy of molecular property prediction by leveraging the rich contextual knowledge found in chemistry literature. For instance, the pretraining corpus of 3D-MoLM contains descriptions about hydrophobicity (LogP) and solubility (TPSA). While Uni-Mol excels at predicting molecular properties by interpreting 3D geometric data, it cannot utilize textual descriptions of chemical properties, which are abundant in chemistry literature. By aligning the representations from Uni-Mol with an LM, the LM can leverage both the geometric knowledge from Uni-Mol and the rich contextual knowledge found in chemical literature. This dual-source knowledge utilization can enhance the prediction of molecular properties.

As per your suggestion, we have included Uni-Mol as a baseline in revised Table 4c. However, we would like to respectfully highlight the versatility of 3D-MoLM: 3D-MoLM extends beyond computed properties prediction, and can do captioning (Table 3) and open-text QA (Table 4a) that Uni-Mol cannot perform. This versatility is a significant advantage of our model, demonstrating the potential of multimodal learning in this domain.

Q3: 3D conformations of PubChem and RDkit. PubChem database provides the 3D conformation of each molecule. While I am not sure what method were employed to obtain the 3D structure in PubChem, it might be worthwhile to consider leveraging these data instead of MMFF-relaxed structures.

Response: Thank you for your suggestion regarding the use of PubChem's 3D conformations. It is important to note that both PubChem [4] and RDKit [5] use MMFF to obtain 3D conformations, although their MMFF implementations have subtle differences. To account for this difference, we have added PubChem's 3D conformations to our dataset, as an alternative to RDKit-generated 3D conformations. We plan to make both available online. The statistics of PubChem3D conformations are show below.

Table: The coverage of PubChem3D's 3D conformations on our PubChem dataset.

PubChem3DRDKit-generated
SubsetSizeCover Ratio (%)Cover Ratio (%)
pretrain309,68967.297.4
train12,00061.8100.0
valid1,00063.3100.0
test2,00064.3100.0

We can observe that PubChem provides 324,689 molecules with textual descriptions in total. Yet, only about two thirds of them are paired with 3D information. RDKit-generated conformation covers 100% downstream molecules because our pre-processing for downstream sets has filtered out molecules that fail to pass RDKit's MMFF algorithm.

Reference:

[1] Inferring experimental procedures from text-based representations of chemical reactions. In Nature Communications 2021.

[2] Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing. In Arxiv 2023.

[3] ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts. In ICML 2023

[4] https://pubchem.ncbi.nlm.nih.gov/docs/pubchem3d

[5] Bringing the MMFF force field to the RDKit: implementation and validation. In Journal of Cheminformatics 2014

评论

After carefully reading the response from the authors, I generally agree with their perspective.

Nonetheless, the proposed method and dataset do not seem to be capable of dealing with the scientific tasks mentioned in the response at the moment. While I acknowledge that this study is a first step towards the development of a multi-modal scientific language model, I think it is important that the authors specify the limitations of their work. This will not only clarify the contribution of this work, but also provide a guide for future research directions in this area.

Considering these points, I will maintain my initial rating.

评论

Q4: After carefully reading the response from the authors, I generally agree with their perspective.

Nonetheless, the proposed method and dataset do not seem to be capable of dealing with the scientific tasks mentioned in the response at the moment. While I acknowledge that this study is a first step towards the development of a multi-modal scientific language model, I think it is important that the authors specify the limitations of their work. This will not only clarify the contribution of this work, but also provide a guide for future research directions in this area.

Considering these points, I will maintain my initial rating.

Response: Thanks for your meticulous review and insightful feedback. In response, we have revised our submission and prepared responses below:

Regarding the capability of solving multi-modal scientific tasks.

  • First, we want to highlight that 3D-MoLM have shown promising results for some (although not all) multi-modal scientific tasks mentioned in our previous response. Specifically, 3D-MoLM achieves significant and consistent improvements for molecule-text retrieval (Table 3) and predicting descriptive molecular properties (Table 4a); it achieves comparable performances for predicting computed molecular properties (Table 4c), when contrasted with Uni-Mol.

  • Second, we acknowledge the limitations of the existing model and dataset, and want to propose the directions for future improvement:

    • More diverse and high-quality datasets. As you might have noticed, while some tasks are interesting and valuable (e.g., experimental procedure prediction [1]), they lack publicly available datasets (as per our submission time) to assist open-source studies. Additionally, existing tasks like molecular captioning and molecule-text retrieval can be further developed by introducing new datasets that focus on more complex and novel chemical properties (e.g., molecular spectra).

    • More fine-grained alignemnt between 3D molecules and texts. In Appendix D, we have included failure case studies showing 3D-MoLM's limitation in discerning fine-grained small molecular structures. Overcoming this issue is crucial for enhancing performance in tasks that demand a precise understanding of 3D molecular structures. One potential direction is to curate a new 3D molecule-text dataset with explicit 3D coordinate references within textual descriptions. Another direction is to explore more powerful 3D molecular encoders and 3D molecule-text projectors.

Regarding the limitation section. In response to your suggestion, we have revised and expanded the limitation section in Appendix G (page 18). This revised section now includes discussion about expanding to more molecule-text modeling tasks and fine-grained 3D molecule-text alignment. We also have a section discussing 3D-MoLM's failure cases in Appendix D, suggesting the exploration of better architectures and datasets for a more fine-grained alignment between 3D molecular geometries and textual concepts.

Your insights have significantly contributed to enhancing the completeness of our work. We hope our explanations and revisions can resolve your concerns about our submission. Should you have any other concerns, we are more than happy for discussions on openreview.

Reference:

[1] Inferring experimental procedures from text-based representations of chemical reactions. In Nature Communications 2021.

审稿意见
6

This paper focus on 3D molecule-text interpretation, and propose 3D-MoLM: 3D-Molecular Language Modeling. Specifically, 3D-MoLM enables an LM to interpret and analyze 3D molecules by equipping the LM with a 3D molecular encoder. This integration is achieved by a 3D molecule-text projector, bridging the 3D molecular encoder’s representation space and the LM’s input space. Overall, this is an interesting work.

优点

This paper is well-organized and written. The proposed 3D-MoLM enables an LM to interpret and analyze 3D molecules by equipping the LM with a 3D molecular encoder which is interesting and motivating.

缺点

The analysis of the proposed 3D-MoLM is not enough. The proposed method is interesting and straight-forward, any insight or analysis that could be offered to read to have a good understanding?

问题

Refer to Weakness.

评论

We sincerely thank you for your valuable comments and positive feedback! They have helped us to enrich the analysis of 3D-MoLM for a better understanding. To address your concern, we present the following response.

Q1: More insight and analysis. The analysis of the proposed 3D-MoLM is not enough. The proposed method is interesting and straightforward, any insight or analysis that could be offered to read to have a good understanding?

Response: Thank you for your constructive feedback. We agree that an in-depth analysis can help readers better understand 3D-MoLM's performances, and identify potential directions for future research and improvement.

Firstly, we have updated our paper to include a comparison between the performance of specialist 3D-MoLMs, which are fine-tuned for each task individually, and the generalist 3D-MoLM, which is a unified model trained on all tasks and evaluated using a consistent checkpoint. We also summarize the performance of them in the table below:

Table1: One checkpoint for captioning, open-text QA and computed properties.**

Table 1a: Molecule captioning performance.

BLEU-2BLEU-4ROUGE-1ROUGE-2ROUGE-LMETEOR
Specialist30.3222.5236.8422.3231.2333.06
Generalist29.2522.0736.4821.8030.9533.12

Table 1b: Descrptive open-text QA performance.

BLEU-2BLEU-4ROUGE-1ROUGE-2ROUGE-LMETEOR
Specialist32.0026.1340.1325.5534.6452.15
Generalist31.6125.8839.9325.6834.7851.74

Table 1c: Computed properties QA performance.

WeightLogPTPSAComplexityHOMOLUMOH-L GapSCF
Specialist14.790.669.7144.850.260.250.280.35
Generalist16.580.7810.9045.490.350.360.320.38

The results show that while the generalist model slightly underperforms in comparison to the specialist models, it still exhibits a performance gain over other baselines. This underscores its versatility and capability to effectively handle multiple tasks. For more detailed information, please refer to the performance tables in the revised paper (Table 3 and Table 4 in the revised paper) and the training details in Appendix C.

Secondly, we have revised our draft to include failure case studies in Appendix D, serving as preliminary examinations of 3D-MoLM's limitations. We identify two main limitations:

  • 3D-MoLM's ability to discern fine-grained small molecular structures can be improved. In the revised Table 6, we show an example that 3D-MoLM confuses between Globostellatic acid C and Globostellatic acid B, where the only difference is the position of a methoxy group.

  • We show that 3D-MoLM, and also GPT-4, cannot accurately count the number of atoms and generate the chemical formulas.

We are committed to continuously refining 3D-MoLM to address these issues. For a more detailed analysis and failure case study, please refer to Appendix D in the revised paper. We hope this analysis provides a more nuanced understanding of 3D-MoLM's capabilities and prompts future enhancement.

评论

We are grateful to all the reviewers for their insightful comments and suggestions. Their reviews have significantly helped us improve this submission. Here, we summarize the major updates incorporated into the revised manuscript:

  • Provision of a Generalist Model for All Generation Tasks [Reviewer Zm2K, HgTh]: In response to the concern raised by Reviewer Zm2K and to substantiate our claim of versatility, we have trained a generalist model. This unified model is trained on all generation tasks and evaluated using a consistent checkpoint. For more information, please refer to the performance tables (Table 3 & Table 4) in the revised paper and the training details in Appendix C.
  • Inclusion of Uni-Mol as a Baseline [Reviewer mT4V]: To address the concern of Reviewer mT4V and to demonstrate the advantage of multimodal learning, we have included Uni-Mol as our baseline in Table 4. We have also provided an analysis of the language model's potential to improve prediction accuracy.
  • Failure Case Study [Reviewer 3SWU, HgTh]: In response to the concern of Reviewer 3SWU and to aid reviewers in understanding our model's performance and limitations, we have revised our submission to include two failure case studies in Appendix D.
  • Revised Paper and Clarifications on the Training Stages [Reviewer Zm2K]: We have revised Figures 2 & 3 and the methodology section to clarify the training stages.

We have made every effort to address the main concerns raised by the reviewers, and we hope that these improvements will be taken into consideration. Updates in the revision are highlighted in orange. We also present point-to-point responses for each reviewer in the sections that follow.

AC 元评审

The paper has received three borderline accepts and one borderline reject.

After considering the rebuttal, reviewer Zm2K, who has the lower score, is willing to increase the score for the clarification in the rebuttal, but is still not convinced of the novelty of this work. The AC concurred that the components of this work are similar to BLIP2 designed for text-image multi-modal modeling. Nevertheless, all reviewers agree the strength of this paper proposes a novel framework for Molecule-Text modeling. Reviewer mT4V comments that this paper is well presented at the right time for the molecule-text modeling. Reviewer mT4V raised another concern about the applications of the proposed model. After considering the rebuttal, mT4V generally agreed with the rebuttal.

After careful consideration of reviews and rebuttal, AC concurred with the reviewers’ assessments that this paper proposes a solid framework for advancing the field of molecule-text, while the technical components are mostly inspired by existing works in text-image multi-modal modeling. Overall, AC does not find the technical novelty can overturn recommendations for acceptance from the majority of reviewers and recommends accepting the paper.

为何不给更高分

While the paper demonstrates a novel framework for molecule-text modeling, the components are mostly inspired by the text-image multi-modal methods. Therefore, AC recommends to accept the paper with poster.

为何不给更低分

The majority of reviewers recommends accepting the paper because the overall framework effectively solves downstream tasks in the text-molecule domain. AC does not find the novelty is an issue to overturn the majority recommendations.

最终决定

Accept (poster)