UniMoT: Unified Molecule-Text Language Model with Discrete Token Representation
摘要
评审与讨论
The authors introduce UniMoT, a Unified Molecule-Text LLM adopting a tokenizer-based architecture that expands the vocabulary of LLM with molecule tokens. Specifically, we introduce a Vector Quantization-driven tokenizer that incorporates a Q-Former to bridge the modality gap between molecule and text. This tokenizer transforms molecules into sequences of molecule tokens with causal dependency, encapsulating high-level molecular and textual information. Equipped with this tokenizer, UniMoT can unify molecule and text modalities under a shared token representation and an autoregressive training paradigm. It can interpret molecules as a foreign language and generate them as text. Following a four-stage training scheme, UniMoT emerges as a multimodal generalist capable of performing both molecule-to-text and text-to-molecule tasks.
优点
The background and motivation for utilizing LLMs for the learning process is well described, with relevant references.
There are different informative experiment tasks evaluated including molecular property prediction and molecule captioning. Ablation studies are also informative for the different generation tasks.
缺点
There is not much information about how the design choice of the author’s model is representing the natural graph molecular structure of the molecule itself (e.g., atom and bond structure). This is a recent survey paper that is missing in the cited work that the authors should include. The LLM itself is known to not as comprehensively capture graph structure.
[IJCAI 2023] Graph-based molecular representation learning. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI '23). Article 744, 6638–6646. https://doi.org/10.24963/ijcai.2023/744
The experiment benchmarks need to include more GNN based representation models, and experiment results of the model do not seem conclusive for the molecular property prediction task, perhaps indicating that there needs to be more explicit learning of graph topological structure.
问题
See above sections for details.
Thank you for your feedback and detailed review of our work. We appreciate the opportunity to address the key points raised.
W1. There is not much information about how the design choice of the author’s model is representing the natural graph molecular structure of the molecule itself (e.g., atom and bond structure). This is a recent survey paper that is missing in the cited work that the authors should include. The LLM itself is known to not as comprehensively capture graph structure.
Thank you for bringing this excellent work to our attention; we have included it in the related work section.
While LLMs are not inherently well-suited for capturing molecular graph structures, our approach addresses this limitation by utilizing molecule tokens. These tokens embed high-level molecular and textual information and are quantized from causal queries by the tokenizer. During reconstruction, the tokenizer decodes these molecule tokens back into molecular structures, demonstrating that the tokens retain all structural information from the original molecules. We maintain a molecule token vocabulary, with each token uniquely representing structural information from the molecule. These molecule tokens are aligned with text tokens during the molecule-to-text and text-to-molecule autoregressive pretraining.
W2. The experiment benchmarks need to include more GNN based representation models, and experiment results of the model do not seem conclusive for the molecular property prediction task, perhaps indicating that there needs to be more explicit learning of graph topological structure.
We have added more baselines for the molecular property prediction task in Table 15 of Section G. The GNN-based representation models now include EdgePred, AttrMask, InfoGraph, MolCLR, GraphMVP, GraphCL, and Mole-BERT. It is important to note that GNNs are limited to molecular property prediction tasks and cannot handle molecule captioning or molecule generation tasks, as these require the generation of text or molecule sequences. Additionally, GNNs typically struggle with molecule-text retrieval tasks, as they lack the ability to align molecular representations with corresponding text captions.
We acknowledge that UniMoT does not yield state-of-the-art results across all datasets on the molecular property prediction task. This is due to the unavoidable quantization process in Discrete-In-Discrete-Out models, which can introduce minor information loss in discrete tokens during quantization. Consequently, some performance on comprehension tasks is sacrificed. However, adopting discrete token representation is essential to support LLM autoregressive training, allowing us to unify comprehension and generation tasks within a single model. Achieving this unification requires a trade-off on comprehension task performance.
Dear Reviewer yMbG,
Thank you for your time and effort in reviewing our manuscript. We have carefully addressed the points you raised and provided detailed responses. We would greatly appreciate your feedback on whether our clarifications have adequately addressed your concerns. Please let us know if further clarification is needed.
Best regards,
Authors
Thank you for your detailed responses. After reviewing all the feedback, I have decided to keep the score the same since the model is not so novel, and the performance over competitive baselines is not that significant.
Thank you for your feedback and for evaluating our work.
- In our General Response, we have elaborated on the novelty of UniMoT, including how the tokenizer-based architecture achieves unified molecule and text modalities and demonstrated significant performance improvements in molecule generation tasks.
- The trade-off in molecule comprehension tasks is necessary to support a unified comprehension and generation framework, which is one of UniMoT’s key contributions.
To improve our work more effectively, we would greatly appreciate it if you could provide more specific details about how you perceive the novelty as insufficient within the molecule-text domain. We are more than willing to address your concerns and enhance our paper based on your detailed feedback.
The paper presents UniMoT, a Unified Molecule-Text Large Language Model that addresses limitations in existing molecular LLMs by employing a tokenizer-based architecture that treats molecule and text modalities equally. By introducing a Vector Quantization-driven tokenizer, UniMoT effectively bridges the gap between these modalities, enabling it to interpret and generate molecules as text. Extensive experiments show that UniMoT achieves state-of-the-art performance in various molecular comprehension and generation tasks.
优点
- The paper proposes a novel methodology for chemical Large Language Model (LLM) pretraining which addressed multiple problems related to cross-modal text and molecule pretraining, including long sequence lengths and the lack of left-to-right causal dependency in molecule features.
- Extensive evaluation on MoleculeNet highlights the effectiveness of the proposed pre-training technique on non-generative molecular property prediction tasks.
- The experimental results on molecular retrieval and generation tasks show that the proposed UniMoT models performs on par or better than the selected baselines.
- The methodological decisions are experimentally supported with a thorough ablation study.
缺点
- From the paper, it it unclear if the improvement over baseline methods is statistically significant.
- On molecule generation task, UniMot is mostly compared against general-purpose LLMs which were not specifically designed for chemistry-related tasks. The comparison against more domain-specific methods, such as nach0, GIT-MOL, and Text+Chem T5 could strenghten the paper claims.
- Lack of comparison with chemical language models (nach0, GIT-MOL, and Text+Chem T5, MolT5 as well as encoder-only ChemBERTa and PubChemDeBERTa) on MoleculeNet benchmark and molecule-text retrieval task.
- Lack of ablation study for training objectives. For instance, Equation 2 has 3 terms, what would happen if you remove some of them? How much would the performance drop if the codebook is removed and the model is forced to generate SMILES directly?
问题
Suggestions:
- Additional MoleculeNet experiments with encoder-only chemical language models, such as ChemBERTa [1], PubChemDeBERTa [2], would show whether LLMs are needed for simple regression/classification tasks in the chemical domain.
- Provide statistical significance tests' results to explore whether the performance gain over baseline models is statistically significant.
- Why are MolT5, MoMu, MolCA absent from Table 1? For me, it makes sense to compare UniMoT with them on MoleculeNet since they are also chemical language models.
- Add more experiments with other recent state-of-the-art chemical language models: GIT-MOL [3], nach0 [4], and Text+Chem T5 [5]. nach0 and GIT-MOL also report the MoleculeNet results as well as performance on molecule generation and molecule captioning
- Why don't you adopt SELFIES and experiment with SMILES only?
- Line 046: for a reader not familiar with the domain, it would be good have a brief description of what is SMILES and provide some examples.
- As far as I know, MoleculeNet mostly covers small datasets. How stable are results across different runs?
Typos:
- Line 159: is not introduced earlier.
References: [1] Chithrananda, Seyone, Gabriel Grand, and Bharath Ramsundar. "ChemBERTa: large-scale self-supervised pretraining for molecular property prediction." arXiv preprint arXiv:2010.09885 (2020). [2] Schuh, Maximilian G., Davide Boldini, and Stephan A. Sieber. "TwinBooster: Synergising Large Language Models with Barlow Twins and Gradient Boosting for Enhanced Molecular Property Prediction." arXiv preprint arXiv:2401.04478 (2024). [3] Liu, Pengfei, et al. "Git-mol: A multi-modal large language model for molecular science with graph, image, and text." Computers in biology and medicine 171 (2024): 108073. [4] Livne, Micha, et al. "nach0: Multimodal natural and chemical languages foundation model." Chemical Science 15.22 (2024): 8380-8389. [5] Christofidellis, Dimitrios, et al. "Unifying molecular and textual representations via multi-task language modelling." International Conference on Machine Learning. PMLR, 2023.
Q3. Why are MolT5, MoMu, MolCA absent from Table 1?
MolT5 does not report performance on the molecular property prediction task. We have included the performance of MoMu and MolCA on this task in Table 15 of Section G.
Q5. Why don't you adopt SELFIES and experiment with SMILES only?
SELFIES, as an alternate molecular string representation, is indeed a viable option and could be used in place of SMILES within our approach. Adopting SELFIES would not fundamentally change the framework. Exploring the incorporation of SELFIES could be an interesting direction for future work to assess its impact on tasks involving molecule-text alignment and generation.
Q7. As far as I know, MoleculeNet mostly covers small datasets. How stable are results across different runs?
We present the results of UniMoT across 3 runs on the MoleculeNet datasets below. The small variations observed are due to the inherent randomness of LLM generation, which primarily arises from the random sampling strategies used during decoding.
| Model | BBBP↑ | Tox21↑ | ToxCast↑ | Sider↑ | ClinTox↑ | MUV↑ | HIV↑ | BACE↑ |
|---|---|---|---|---|---|---|---|---|
| UniMoT (Llama-2-7B) | 71.37 ± 0.95 | 76.43 ± 0.62 | 65.78 ± 0.78 | 59.79 ± 1.28 | 92.89 ± 1.62 | 75.97 ± 0.84 | 78.49 ± 0.49 | 83.69 ± 0.87 |
Q6 & Typos.
Thank you for your suggestion! We have incorporated these revisions into our manuscript.
Thank you for your feedback and detailed review of our work. We appreciate the opportunity to address the key points raised.
W1 & Q2. It is unclear if the improvement over baseline methods is statistically significant.
We conducted Wilcoxon Signed-Rank Tests to evaluate the statistical significance of UniMoT’s improvements. The Wilcoxon Signed-Rank Test does not assume that the data follows a normal distribution, making it more robust for the relatively small sample size and potentially non-Gaussian distributions. The results show statistically significant improvements () in molecule captioning task (Table 2) and molecule generation tasks (Table 4), while the improvements in molecular property prediction task (Table 1) and molecule-text retrieval task (Table 3) are not statistically significant ().
We acknowledge that UniMoT does not yield state-of-the-art results across all molecule comprehension tasks. This is due to the unavoidable quantization process in Discrete-In-Discrete-Out models, which can introduce minor information loss in discrete tokens during quantization. Consequently, some performance on comprehension tasks is sacrificed. However, adopting discrete token representation is essential to support LLM autoregressive training, allowing us to unify comprehension and generation tasks within a single model. Achieving this unification requires a trade-off on comprehension task performance.
W2 & W3 & Q4. The comparison against more domain-specific methods on molecule tasks, such as nach0, GIT-MOL, and Text+Chem T5 could strenghten the paper claims.
We have included GIT-MOL for the molecular property prediction task and Text+Chem T5 for the molecule captioning task in Section G. For molecule generation tasks, GIT-MOL and Text+Chem T5 use the ChEBI-20 dataset, while we use the PubChem dataset, making a direct comparison infeasible. Additionally, nach0 employs different evaluation metrics than those used in our paper, so we have opted not to include this baseline. We report the baseline results as stated in their respective papers in Section G. Given the limited time during the rebuttal period, we were unfortunately unable to rerun these baselines on new tasks.
W3 & Q1. Additional MoleculeNet experiments with encoder-only chemical language models, such as ChemBERTa [1], PubChemDeBERTa [2],
We were only able to find the performance of ChemBERTa on the molecular property prediction task, and we have included this in Section G. While ChemBERTa and PubChemDeBERTa are strong encoder-only models, they are not suitable for UniMoT. UniMoT is designed to handle both comprehension and generation tasks, which require an autoregressive paradigm that decoder-only models like LLaMA provide. Since ChemBERTa and PubChemDeBERTa do not support autoregressive generation, they cannot replace LLaMA in UniMoT. In Table 5(b), we compared multiple LLM architectures, including Galactica-125M, Galactica-1.3B, Mistral-7B, Llama-2-7B, and Llama-2-13B.
W4.1. Lack of ablation study for training objectives. For instance, Equation 2 has 3 terms, what would happen if you remove some of them?
We cannot remove any term in Equation 2, as doing so would prevent the training process from converging. The first term, the alignment loss, ensures that the molecule tokens (after passing through the adapter) align closely with their corresponding SMILES embeddings. The second and third terms are standard components of VQ-VAE: the second term is a codebook loss that updates the codebook embeddings, while the third term is a commitment loss that encourages the queries to remain close to the selected codebook embeddings. Each term plays a critical role in ensuring the stability of the training process.
W4.2. How much would the performance drop if the codebook is removed and the model is forced to generate SMILES directly?
We conducted an experiment on the Caption-guided Molecule Generation task using the PubChem dataset. By removing the quantization process and turning our model into an adapter-based architecture that directly generates SMILES strings, we observed a significant performance drop, as shown in the table below. This demonstrates that the tokenizer-based architecture is more effective than the adapter-based architecture, particularly in generation tasks.
| Model | Exact↑ | BLEU↑ | Levenshtein↓ | RDK FTS↑ | MACCS FTS↑ | Morgan FTS↑ | Validity↑ |
|---|---|---|---|---|---|---|---|
| UniMoT (w/o tokenizer) | 0.001 | 0.017 | 55.437 | 0.014 | 0.021 | 0.009 | 0.006 |
| UniMoT (w/ tokenizer) | 0.237 | 0.698 | 27.782 | 0.543 | 0.651 | 0.411 | 1.000 |
Dear Reviewer Xqcq,
Thank you for your time and effort in reviewing our manuscript. We have carefully addressed the points you raised and provided detailed responses. We would greatly appreciate your feedback on whether our clarifications have adequately addressed your concerns. Please let us know if further clarification is needed.
Best regards,
Authors
Dear authors,
I have read your clarifications carefully and decided to keep my initial assessment.
Sincerely, reviewer
Dear Reviewer Xqcq,
Thank you for taking the time to carefully review our work and for your thoughtful feedback. We sincerely appreciate your support for our work.
Best regards,
Authors
The paper presents UniMoT, a unified molecule-text language model that leverages discrete token representations to integrate molecular and text. UniMoT uses a tokenizer-driven framework, where a VQ-based tokenizer, combined with a causal Q-Former, encodes molecules as discrete tokens. These discrete tokens are aligned with text tokens in a shared vocabulary, enabling UniMoT to operate in an autoregressive training paradigm across both modalities.
优点
- The VQ-driven tokenizer for molecules enables the unified modeling of molecule and text in language model.
- UniMoT shows impressive state-of-the-art results across multiple understanding and generation tasks.
缺点
- The core components, such as the Q-Former and VQ-based tokenizer, appear to be direct adaptations from CV, which may limit the paper’s originality within the molecular modeling domain.
- The discrete tokenization approach could lead to information loss, and the model’s effectiveness over directly training with combined molecule SMILES and text is not convincingly demonstrated.
- The multi-stage training process is complex.
- The evaluation on the Mol-Instructions benchmark lacks some important baselines, like DARK[1].
References
[1] Liu, Jinzhe et al. “DRAK: Unlocking Molecular Insights with Domain-Specific Retrieval-Augmented Knowledge in LLMs.” ArXiv abs/2406.18535 (2024).
问题
- Could the causal dependency in learnable queries negatively impact molecular understanding compared to bidirectional or other full-information modeling approaches?
- Are the models for each downstream task trained separately or together? The authors should provide more detailed introduction.
- More details on the adapter and SMILES decoder would strengthen clarity, particularly regarding their roles in aligning molecular representations and decoding.
Q3. More details on the adapter and SMILES decoder would strengthen clarity, particularly regarding their roles in aligning molecular representations and decoding.
To enable molecule decoding, we train an MLP adapter (or another Q-Former) to align the latent space of molecule tokens with the latent space of a molecular generative model. We employ the pretrained ChemFormer as the generative model, utilizing its SMILES encoder and SMILES decoder components. The alignment is achieved by minimizing the distance between the molecule tokens (after passing through the adapter) and the corresponding SMILES embeddings. We achieve this by optimizing the adapter’s parameters using the MSE loss. Once aligned, the molecule tokens processed by the adapter can be effectively decoded by the SMILES decoder.
W2.2. The model’s effectiveness over directly training with combined molecule SMILES and text is not convincingly demonstrated.
-
For molecule comprehension tasks, molecular features are essential for aligning molecule and text modalities. Directly training with SMILES and text significantly harms the performance of comprehension tasks.
-
For molecule generation tasks:
-
Adapter-based architectures require the LLM to directly output SMILES strings for molecule generation. This approach heavily depends on strong alignment between SMILES strings and molecule captions during pretraining. In practice, achieving such alignment is challenging, leading to suboptimal performance of adapter-based architectures in text-to-molecule generation tasks, as shown in Table 4.
-
Tokenizer-based architecture like UniMoT leverages a tokenizer to convert molecular features and text captions into molecule tokens. These tokens encapsulate high-level molecular and textual information, offering a richer representation compared to SMILES strings alone. By linking molecule tokens to molecule captions during molecule-to-text and text-to-molecule pretraining, UniMoT enables the autoregressive generation of molecule tokens.
-
We conducted an experiment on the Caption-guided Molecule Generation task using the PubChem dataset. By removing the quantization process and turning our model into an adapter-based architecture that directly generates SMILES strings, we observed a significant performance drop, as shown in the table below. This demonstrates that the tokenizer-based architecture is more effective than the adapter-based architecture, particularly in generation tasks.
| Model | Exact↑ | BLEU↑ | Levenshtein↓ | RDK FTS↑ | MACCS FTS↑ | Morgan FTS↑ | Validity↑ |
|---|---|---|---|---|---|---|---|
| UniMoT (w/o tokenizer) | 0.001 | 0.017 | 55.437 | 0.014 | 0.021 | 0.009 | 0.006 |
| UniMoT (w/ tokenizer) | 0.237 | 0.698 | 27.782 | 0.543 | 0.651 | 0.411 | 1.000 |
W3. The multi-stage training process is complex.
Our training strategy follows a common training recipe in multimodal LLMs: tokenizer pretraining, multimodal pretraining, and instruction tuning. Due to our use of a Causal Q-Former, we introduce a pretraining phase to optimize this module before tokenizer pretraining. Directly pretraining the Causal Q-Former with the VQ tokenizer can significantly impair the model’s capacity and task performance. Unlike in text-image LLMs, where pretrained image tokenizers are available, to the best of our knowledge, there are no off-the-shelf molecule tokenizers for LLMs. Thus, we need to train the tokenzier from scratch.
W4. The evaluation on the Mol-Instructions benchmark lacks some important baselines, like DRAK.
Thank you for your suggestion. We have included the Mol-Instructions benchmark results with DRAk in Table 18 of Section G.
Q1. Could the causal dependency in learnable queries negatively impact molecular understanding compared to bidirectional or other full-information modeling approaches?
No. We have compared the performance of a Q-Former with bidirectional self-attention to a Causal Q-Former with causal self-attention in the second and third rows of Table 5(a). The results indicate that queries with causal dependency outperform those with bidirectional dependency. This demonstrates that inputs with left-to-right causal dependency align better with the unidirectional attention mechanism in LLMs, resulting in improved performance.
Q2. Are the models for each downstream task trained separately or together?
- During Stage 1, we pretrain the Q-Former using the objective defined in Equation (7).
- During Stage 2, we pretrain the tokenizer using the objective defined in Equation (2).
- During Stage 3, we perform molecule-to-text and text-to-molecule autoregressive pretraining using the objective defined in Equation (3). These pretraining tasks are conducted sequentially.
- During Stage 4, we perform instruction tuning across seven tasks: molecular property prediction, molecule captioning, molecule-text retrieval, caption-guided molecule generation, reagent prediction, forward reaction prediction, and retrosynthesis. For each task, we train a separate LoRA adapter, with each adapter trained independently using task-specific data and loss functions tailored to the objectives of the respective task. This enhances the efficiency and versatility of UniMoT while ensuring high performance across all tasks.
Thank you for your feedback and detailed review of our work. We appreciate the opportunity to address the key points raised.
W1. The core components, such as the Q-Former and VQ-based tokenizer, appear to be direct adaptations from CV, which may limit the paper’s originality within the molecular modeling domain.
We argue that, to the best of our knowledge, UniMoT is the first to introduce a tokenizer-based architecture in the molecule-text domain that discretizes molecule features into tokens compatible with LLMs, allowing molecules to be processed alongside text tokens. While certain components, such as the Q-Former and VQ tokenizer, are used in other multimodal LLMs, we intentionally integrate them into a molecule-text framework to achieve our goals, as outlined in our Introduction section.
We emphasize several motivations here:
-
Tokenizer and Next-Token Prediction: Our approach is centered on achieving a Discrete-In-Discrete-Out (DIDO) model, where a unified token representation for molecules and text enables next-token prediction. This allows UniMoT to autoregressively generate molecule tokens directly, bypassing SMILES string generation and enhancing molecule generation accuracy.
-
Causal Q-Former: Directly discretizing molecule features into tokens leads to lengthy input sequences equivalent to the number of atoms. To address this, we introduced the Q-Former, which compresses the input length by mapping molecule features into a limited number of queries (typically 8-32). Moreover, we found that causal self-attention outperformed bidirectional self-attention for our tasks (see Section 4.3), leading us to employ a Causal Q-Former.
-
Reconstruction During Inference: To improve pretraining efficiency, we eliminate the need for direct molecule reconstruction during pretraining. We leverage discrete molecule tokens to conduct supervised fine-tuning during pretraining. During inference, our pretrained adapter and decoder reconstruct the final molecule, which enables accurate molecule generation.
The major difference between the molecule-text domain and the image-text domain is that molecules have canonical representations -- SMILES strings. We have incorporated several crucial designs specifically for molecules:
-
Tokenizer: We use an adapter to align the latent space of molecule tokens with the latent space of SMILES embeddings for reconstruction.
-
Pretraining: To capture the sequential information of molecules for better comprehension, we concatenate the molecule token sequence with the SMILES sequence and a prompt to form the multimodal input sequence during molecule-to-text pretraining.
-
Downstream Tasks: The molecule-text domain includes unique tasks, such as Molecular Property Prediction, Caption-guided Molecule Generation, Reagent Prediction, Forward Reaction Prediction, and Retrosynthesis. Our architecture excels in these specific tasks for molecules.
W2.1. The discrete tokenization approach could lead to information loss.
We acknowledge that the unavoidable quantization process in Discrete-In-Discrete-Out models can introduce minor information loss in discrete tokens. Consequently, some performance on comprehension tasks is sacrificed. However, adopting discrete token representation is essential to support LLM autoregressive training, allowing us to unify comprehension and generation tasks within a single model. Achieving this unification requires a trade-off on comprehension task performance.
Dear Reviewer 4vC6,
Thank you for your time and effort in reviewing our manuscript. We have carefully addressed the points you raised and provided detailed responses. We would greatly appreciate your feedback on whether our clarifications have adequately addressed your concerns. Please let us know if further clarification is needed.
Best regards,
Authors
A key limitation in current applications of large language models (LLMs) to molecule-text tasks is the imbalance in processing modalities: text is handled directly by the encoder, while molecules require separate adapters. To address this, the authors propose UniMoT, a model that employs a Vector Quantization (VQ) tokenizer to incorporate molecules into the LLM vocabulary as discrete tokens. This tokenizer works with a Q-Former, which transforms molecular features into causal sequences, allowing the model to train on interleaved molecular and textual data with a single next-word prediction objective. A four-stage training process is then introduced. Experimental results across tasks, including molecular property prediction, molecule captioning, molecule-text retrieval, and molecule generation, show that the proposed method can prompt the understanding of text and molecules, often better than established baselines.
优点
- The work is well-motivated and builds on a solid foundation, given existing multimodal studies in text-image tasks. The approach to integrating molecules and text in a unified framework is reasonable and leverages known multimodal strategies.
- The experiments cover multiple task types, showing the effectiveness of the proposed method across various molecule-related tasks. In most cases, the proposed approach outperforms established baselines.
- The paper is clearly presented and well-organized, with figures that illustrate key concepts, making the ideas easy to follow.
缺点
- Technically speaking, the proposed method shares overlaps with existing studies, such as SEED-LLaMA. While UniMoT targets molecule-text modalities, the frameworks overlap in modules like the discrete tokenizer, causal Q-Former, next-token prediction, and reconstruction during inference. This overlap raises questions about novelty.
- Considering the complexity of the four-stage training process, the performance improvements may not fully justify the added effort. In some tasks, for example, Table 2, its performance is no better than that of baseline methods and is sometimes worse.
- While the focus is on molecule-text, the paper needs a more comprehensive discussion of and comparison with existing multimodal research, even if primarily in text-image or multi-modality settings, such as AnyGPT and MIO.
问题
The fourth stage includes prompting and instruction following. Is this stage essential, considering the complexities of prompt generation and response parsing? If so, how does prompt engineering influence the final performance?
W3. The paper needs a more comprehensive discussion and comparison with existing multimodal research.
Thank you for your advice! We have added an Additional Related Work section in Section F. The tokenizer, pretraining, and downstream tasks differ from those in the image-text domain, as discussed in our response to Weakness 1.
Q1. Is the instruction tuning stage essential? If so, how does prompt engineering influence the final performance?
Yes, the instruction tuning stage is crucial for developing the instruction-following capabilities of the language model. This stage ensures that UniMoT can accurately interpret and respond to human instructions, making it versatile and effective for a wide range of molecular tasks.
Below, we present the molecule-text retrieval task performance on the PubChem dataset with and without instruction tuning. As shown, there is a noticeable performance drop when using the checkpoint from Stage-3 (without instruction tuning).
| Retrieval in batch M2T (%) | Retrieval in batch M2T (%) | Retrieval in batch T2M (%) | Retrieval in batch T2M (%) | Retrieval in test set M2T (%) | Retrieval in test set M2T (%) | Retrieval in test set T2M (%) | Retrieval in test set T2M (%) | |
|---|---|---|---|---|---|---|---|---|
| Acc | R@20 | Acc | R@20 | Acc | R@20 | Acc | R@20 | |
| Stage-3 ckpt | 88.4 | 99.5 | 86.9 | 99.0 | 51.1 | 92.3 | 53.5 | 93.4 |
| Stage-4 ckpt | 93.6 | 100.0 | 92.7 | 99.4 | 69.8 | 96.3 | 69.5 | 94.4 |
The instruction samples for comprehension and generation tasks are presented in Table 7. Due to limited computational resources, we did not perform extensive prompt engineering. However, simple prompts like “Could you give me a brief overview of this molecule?” have proven effective in achieving strong performance on downstream tasks.
Thank you for your feedback and detailed review of our work. We appreciate the opportunity to address the key points raised.
W1. Concern about novelty due to modules overlap with existing works.
We argue that, to the best of our knowledge, UniMoT is the first to introduce a tokenizer-based architecture in the molecule-text domain that discretizes molecule features into tokens compatible with LLMs, allowing molecules to be processed alongside text tokens. While certain components, such as the Q-Former and VQ tokenizer, are used in other multimodal LLMs, we intentionally integrate them into a molecule-text framework to achieve our goals, as outlined in our Introduction section.
We emphasize several motivations here:
-
Tokenizer and Next-Token Prediction: Our approach is centered on achieving a Discrete-In-Discrete-Out (DIDO) model, where a unified token representation for molecules and text enables next-token prediction. This allows UniMoT to autoregressively generate molecule tokens directly, bypassing SMILES string generation and enhancing molecule generation accuracy.
-
Causal Q-Former: Directly discretizing molecule features into tokens leads to lengthy input sequences equivalent to the number of atoms. To address this, we introduced the Q-Former, which compresses the input length by mapping molecule features into a limited number of queries (typically 8-32). Moreover, we found that causal self-attention outperformed bidirectional self-attention for our tasks (see Section 4.3), leading us to employ a Causal Q-Former.
-
Reconstruction During Inference: To improve pretraining efficiency, we eliminate the need for direct molecule reconstruction during pretraining. We leverage discrete molecule tokens to conduct supervised fine-tuning during pretraining. During inference, our pretrained adapter and decoder reconstruct the final molecule, which enables accurate molecule generation.
The major difference between the molecule-text domain and the image-text domain is that molecules have canonical representations -- SMILES strings. We have incorporated several crucial designs specifically for molecules:
-
Tokenizer: We use an adapter to align the latent space of molecule tokens with the latent space of SMILES embeddings for reconstruction.
-
Pretraining: To capture the sequential information of molecules for better comprehension, we concatenate the molecule token sequence with the SMILES sequence and a prompt to form the multimodal input sequence during molecule-to-text pretraining.
-
Downstream Tasks: The molecule-text domain includes unique tasks, such as Molecular Property Prediction, Caption-guided Molecule Generation, Reagent Prediction, Forward Reaction Prediction, and Retrosynthesis. Our architecture excels in these specific tasks for molecules.
W2.1. Considering the complexity of the four-stage training process, the performance improvements may not fully justify the added effort.
Our training strategy follows a common training recipe in multimodal LLMs: tokenizer pretraining, multimodal pretraining, and instruction tuning. Due to our use of a Causal Q-Former, we introduce a pretraining phase to optimize this module before tokenizer pretraining. Directly pretraining the Causal Q-Former with the VQ tokenizer can significantly impair the model’s capacity and task performance. Unlike in text-image LLMs, where pretrained image tokenizers are available, to the best of our knowledge, there are no off-the-shelf molecule tokenizers for LLMs. Thus, we need to train the tokenzier from scratch.
W2.2. In Table 2, the performance is no better than that of baseline methods and is sometimes worse.
We believe you are referring to Table 1 or Table 3. We acknowledge that UniMoT does not yield state-of-the-art results across all molecule comprehension tasks. This is due to the unavoidable quantization process in Discrete-In-Discrete-Out models, which can introduce minor information loss in discrete tokens during quantization. Consequently, some performance on comprehension tasks is sacrificed. However, adopting discrete token representation is essential to support LLM autoregressive training, allowing us to unify comprehension and generation tasks within a single model. Achieving this unification requires a trade-off on comprehension task performance.
Dear Reviewer hYSy,
Thank you for your time and effort in reviewing our manuscript. We have carefully addressed the points you raised and provided detailed responses. We would greatly appreciate your feedback on whether our clarifications have adequately addressed your concerns. Please let us know if further clarification is needed.
Best regards,
Authors
We sincerely appreciate the reviewers’ feedback and constructive suggestions. We believe UniMoT represents a significant step forward in molecular language modeling by addressing fundamental challenges in multi-modal alignment and molecule generation. Specifically, our tokenizer-based architecture offers a novel framework for unifying molecule and text modalities, demonstrating promising results across various tasks. Unlike other adapter-based models, which lack a supervision signal for the molecule modality and do not treat molecule and text modalities equally, UniMoT ensures a integrated token representation. Below, we outline the core innovations and provide responses to performance-related observations.
Novelty of UniMoT
UniMoT introduces an approach to molecular language modeling by leveraging a tokenizer-based architecture within the molecule-text domain, in contrast to traditional adapter-based or Q-Former-based architectures. Key innovations include:
- Unified Token Representation: Molecule and text modalities are treated equally by converting molecules into discrete tokens. This is achieved through a Vector Quantization (VQ)-driven tokenizer that generates causal molecule tokens, aligning with the autoregressive training paradigm of LLMs.
- Molecule Tokenizer:
- Causal Q-Former: Incorporates causal masking to create molecule tokens compatible with unidirectional attention in LLMs.
- Quantization: Uses a learnable codebook to discretize molecule representations, preserving high-level molecular and textual information.
- Autoregressive Pretraining:
- Molecule-to-Text and Text-to-Molecule: Combines molecule and text inputs in a unified autoregressive framework, enabling bidirectional comprehension and generation.
- Four-Stage Training Scheme: Includes pretraining the Q-Former, tokenizer optimization, unified molecule-text pretraining, and instruction tuning for task-specific applications.
- Enhanced Performance:
- Demonstrates state-of-the-art results in molecule comprehension (e.g., molecule captioning) and generation (e.g., caption-guided molecule generation), outperforming both projection-based and Q-Former-based models.
- Establishes a novel method for generating valid and structurally similar molecules as discrete tokens rather than SMILES strings, improving generation accuracy and alignment.
Performance on molecule comprehension tasks
We acknowledge that UniMoT does not yield state-of-the-art results across all molecule comprehension tasks. This is due to the unavoidable quantization process in Discrete-In-Discrete-Out models, which can introduce minor information loss in discrete tokens during quantization. Consequently, some performance on comprehension tasks is sacrificed. However, adopting discrete token representation is essential to support LLM autoregressive training, allowing us to unify comprehension and generation tasks within a single model. Achieving this unification requires a trade-off on comprehension task performance.
Dear Reviewers,
As the discussion phase is approaching its end, we kindly request your feedback on our submission. Your input is invaluable to ensure a constructive and comprehensive discussion.
We understand that some reviewers may have concerns about certain modules, such as the VQ tokenizer and Q-Former, being adapted from multi-modal LLMs. We would like to emphasize that successfully adapting methods or modules from one research community to another is a contribution in itself. Many key innovations in AI have emerged from such cross-domain adaptations.
For example, residual connections, a fundamental component in transformers widely used in NLP, were originally introduced in ResNet for computer vision tasks. Similarly, Masked Autoencoders (MAE) in computer vision draw direct inspiration from BERT, which originated in the NLP community, illustrating how methods developed in one field can address challenges in another. Vision Transformers (ViT), which have become a cornerstone in computer vision, are also adaptations of the transformer architecture initially designed for NLP. These examples illlustrate the transformative impact of cross-domain adaptations.
Moreover, many works in the molecule-text domain are similarly adapted from the image-text domain. For instance, MolCA [1], a widely cited baseline in this area, is directly adapted from BLIP-2 [2], while InstructMol [3] is derived from LLaVA [4], demonstrating the effectiveness of leveraging methods from the image-text domain. We hope the reviewers can recognize the contributions of our work in this context and respond to our rebuttal accordingly to ensure an effective and constructive discussion.
Thank you for your time and consideration!
Best,
Authors
[1] Liu, Z., Li, S., Luo, Y., Fei, H., Cao, Y., Kawaguchi, K., Wang, X. and Chua, T.S., 2023. MolCA: Molecular graph-language modeling with cross-modal projector and uni-modal adapter. arXiv preprint arXiv:2310.12798.
[2] Li, J., Li, D., Savarese, S. and Hoi, S., 2023, July. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning (pp. 19730-19742). PMLR.
[3] Cao, H., Liu, Z., Lu, X., Yao, Y. and Li, Y., 2023. Instructmol: Multi-modal integration for building a versatile and reliable molecular assistant in drug discovery. arXiv preprint arXiv:2311.16208.
[4] Liu, H., Li, C., Wu, Q. and Lee, Y.J., 2024. Visual instruction tuning. Advances in neural information processing systems, 36.
The paper introduces UniMoT, a Unified Molecule-Text Large Language Model (LLM) designed to improve how LLMs handle both molecular and textual data. The central claim is that by using a novel tokenizer-based architecture, UniMoT can treat molecules and text equally within the LLM framework. The paper clearly outlines the limitations of existing adapter-based architectures and justifies the need for a unified approach. The paper is generally well-written and organized, making it relatively easy to follow. Some reviewers point out that individual components are adapted from other domains (like computer vision), raising questions about the overall novelty. The quantization process introduces some information loss, leading to a dip in performance on certain molecule comprehension tasks compared to specialized models. The multiple-stage training scheme is somewhat complex, which could be a barrier to wider adoption. A more extensive comparison to relevant domain-specific models would strengthen the paper.
审稿人讨论附加意见
The authors actively engage with reviewers, providing detailed responses and clarifications, and updating the paper accordingly.
Reject