/10

Poster4 位审稿人

最低2最高4标准差0.7

ICML 2025

Hierarchical Graph Tokenization for Molecule-Language Alignment

Yongqiang Chen,Quanming Yao,Juzheng Zhang,James Cheng,Yatao Bian

OpenReview PDF

提交: 2025-01-20更新: 2025-07-24

TL;DR

We show that integrating the intrinsic hierarchical graph information is essential for molecule-language alignement.

摘要

关键词

molecular-language alignmentlarge language modelshierarchical graph neural networkstokenizationbiomolecular studiesmolecule

评审与讨论

审稿意见

评分: 32025-03-03

Previous LGLMs usually focus on the node level of molecules, ignoring the structure information in molecules. To address this, the paper proposes a novel strategy called Hierarchical Graph Tokenization (HIGHT), which uses a hierarchical graph tokenizer to encode the hierarchy of atom, motif, and molecular levels of informative tokens to improve the molecular perception of LLMs. The paper first uses VQ-VAE to obtain embeddings for atoms and motifs. Then, it concatenates the motif embeddings after the atom embeddings to feed into the models. The paper also uses a new augmented instruction tuning dataset, HiPubChem, enriched with the hierarchical graph information, to further enhance the molecule-language alignment. The paper also proposes the first hallucination benchmark MotifHallu based on common functional groups. Extensive experiments show the effectiveness of HIGHT in reducing hallucination by 40%, and significant improvements in various molecule-language downstream tasks.

给作者的问题

See above.

论据与证据

The paper claims they are the first to incorporate the hierarchical graph information into LGLMs. To my best knowledge, it seems to be true. However, the paper needs to include a related work section to compare itself against other previous models. Some other papers also try to capture hierarchical information in a similar way, such as [1,2]. A related work section can help authors future clarify their novelty.
The hallucination benchmark MotifHallu seems to be problematic. Given that the proposed methods already contain those function groups in their input, it is relatively more straightforward for such a model to answer correctly.

[1] Li, S., Liu, Z., Luo, Y., Wang, X., He, X., Kawaguchi, K., ... & Tian, Q. (2024). Towards 3d molecule-text interpretation in language models. ICLR 2024.

[2] Liu, Z., Li, S., Luo, Y., Fei, H., Cao, Y., Kawaguchi, K., ... & Chua, T. S. (2023). Molca: Molecular graph-language modeling with cross-modal projector and uni-modal adapter. EMNLP 2023.

方法与评估标准

The method to obtain node/motif embedding is based on Hierarchical Molecular Graph Self-supervised Learning (HiMol) https://www.nature.com/articles/s42004-023-00825-5. While it looks fine to use VQ-VAE to obtain motif embedding, it is unclear why the paper chose the VQ-VAE to obtain the node embedding instead of the regular SMILES/SELFIES string. It might be better to include an ablation study to use SMILE/SELFIES string for atom embedding to make a fair comparison for results in Table 2. Additionally, using VQ-VAE to obtain the embeddings seems to use more computational resources. The paper needs to report its GPU hours, parameter number, and other computation-related information to make a fair comparison. Otherwise, the proposed method might not be scalable.

理论论述

N/A

实验设计与分析

The paper seems to include relatively comprehensive baselines and downstream tasks to compare against. The paper also conducts ablation study and zero/few-shot testing. However, some problems still exist.

In Table 1, the paper compares the InstructMol-? against HIGHT-?. However, based on the ablation in this table, it seems that the performance gain is mainly from the instruction tuning dataset instead of the actual hierarchical tokenization part, which decreases the novelty of the proposed model. GIMLET, a method the paper also used as baseline in Table 3, should be added as a SMIES-based LGLM for Motif hallucination experiments to show the contribution of the hierarchical tokenization.
In Table 4, the paper claims that the paper fails to achieve SOTA because some of the previous methods used a large amount of pertaining data. However, the paper also used a large amount of instruction tuning data (MoleculeNet, CheBI-20, etc) in the second stage, which seems to be exactly the same as those methods. The paper needs to do some comparison between those two instruction tuning data. Otherwise, the previous methods seem to be better due to less parameters and simpler architecture.

补充材料

Yes, the paper provides dataset details, additional experiment results, and additional experiment details.

与现有文献的关系

The paper provides a new representation learning strategy for molecules. However, the paper fails to include a related work section, which makes readers confusing.

遗漏的重要参考文献

N/A

其他优缺点

Weaknesses:

The paper did not provide any code for reviewers to check.

其他意见或建议

N/A

作者回复

2025-04-01

Dear Reviewer VbFG,

Thank you for your time and suggestions to our work. Please find our detailed responses to your questions below:

Related work section

Thank you for acknowledging our contribution as the first to incorporate the hierarchical graph information. As for the referred works, we have revised Sec 2 as the related work section, and included the discussion of existing attempts:

Existing works also try to enrich the molecule-language alignment with additional modalities, such as 2D[2] and 3D[1] information. In contrast, we focus on the intrinsic hierarchical information of the molecules, such as motifs.

It is relatively more straightforward given the function groups in the inputs

We kindly refer Reviewer VbFG to Table 17 in the appendix, where we can find that, even with the functional groups in the inputs (i.e., HIGHT w/o HiPubChem), it remains challenging for LGLM to correctly recognize whether the functional groups exist in the input molecule.

In addition, it is not straightforward that the hierarchical alignment across atom-functional group-molecule is beneficial to broader downstream tasks.

Why choose VQVAE to obtain node embedding

We choose to use VQVAE because it is one of the simplest and standard architectures (e.g., Mole-Bert and other LGLMs as in here). We would like to kindly note that, when adding SMILES/SELFIES to LLMs, it does not directly output node embeddings. Nevertheless, in Table 16 and 17, we also consider adding SELFIES to LLMs, which slightly decreases the perception of functional groups, and increases the chemical reaction prediction performances.

Computational overhead

We have revised our manuscript to include the computational overhead report, including training and inference latency, as well as tunable parameters, which can be found here. Although HIGHT requires longer training time and relatively higher tunable parameters, the absolute values are not high. Moreover, during inference, as LLM latency consumes most of the computation, HIGHT can even reduce the inference latency by generating more concise answers.

Performance gain in Table 1

We kindly refer Reviewer VbFG to Table 17 in the appendix, where we can find that, merely using the HiPubChem for instruction tuning with InstructMol still suffers from severe hallucination. It means that both tokenizers and the instruction tuning in the alignment phase are necessary to reduce hallucination. In addition, in Table 16, we can find similar phenomena in downstream tasks of chemical reaction prediction, which further strengthens the necessity for the two components.

GIMLET in MotifHallu

We have evaluated two state-of-the-art SMILES based LGLMS, i.e., GIMLET and Galactica-6.7B in MotifHallu. The results are given here. Interestingly, those models demonstrate a high hallucination of negative classes. Therefore, we recommend reporting macro F1 scores (F1 scores averaged across classes). HIGHT demonstrates significant improvements in terms of macro F1 scores over all baselines, up to 14.

Comparison with the instruction data

We provided a comparison between the instruction tuning data used in HIGHT and other LGLMs here. It can be found that HIGHT uses significantly less information and data to achieve relatively good capabilities across multiple tasks.

code

We have provided AC an anonymous link to our code, and kindly asked AC to share to the reviewers according to the ICML policy.

审稿人评论

2025-04-02

Thank you for your comments. Unfortunately, I cannot see the updated version of the paper. However, given the updated anonymous link, I think the current clarification is enough. Therefore, I decided to raise my score to 3.

作者评论

2025-04-02

Hi Reviewer VbFG,

Thank you for acknowledging our clarification and for agreeing to change the score. Please feel assured that all the promised revisions and new discussion/results will be included in the updated version (which has already been done for our updated manuscript). Thank you again for your time and constructive comments, which helped a lot to improve our work!

Best, Authors

审稿意见

评分: 42025-03-11

This paper proposes HIGHT, a novel molecular graph tokenization and post-training framework for applying large language models to molecular graphs. The paper proposes a novel hierarchical tokenization method incorporating molecular motif information, and uses novel alignment pretraining strategy to train models before task-specific instruction tuning. Experiments and ablation studies show that this method achieves excellent performance in various molecular property prediction, molecular description and chemical reaction prediction benchmarks.

update after rebuttal

My concerns have been well addressed so I will keep my rate.

给作者的问题

While the proposed method makes valuable contributions in easing the application of large language models to molecules, one significant gap still exists --- the proposed method cannot encode 3D structure information of molecules into sequence representations, which may impacts molecular prediction performance (e.g., results in Table 2 are not as good as state-of-the-art 3D graph neural network models). I would like to hear author's opionion about how could the proposed the method be extend to encoding 3D molecular structures?

论据与证据

Yes, the claims made in the submission are supported by clear and convincing experiment results.

方法与评估标准

Yes, proposed methods and evaluation criteria make sense for molecular property prediction, molecular description, and chemical reaction prediction problem.

理论论述

There are no proofs or theoretical claims in the paper.

实验设计与分析

Yes, I have checked experiment settings, and experiment results and analyses in motif hallucination, molecular property prediction, molecular description and chemical reaction prediction experiments. All experiments are sound and valid to me.

补充材料

Yes, I have reviewed the part of instruction tuning datasets and prompts in Appendix A.

与现有文献的关系

The novelty contributions and empirical findings are very useful and valuable to broader chemical and biological science domains, as this paper proposes an effective method to make better use of the power of large language models on molecule data, potentially providing excellent technical tools for scientists in these domains for their domain-specific applications.

遗漏的重要参考文献

There are no essential related works not discussed.

其他优缺点

No other comments on strengths and weaknesses.

其他意见或建议

No other comments or suggestions.

作者回复

2025-03-31

Dear Reviewer zQSo,

Thank you for your time and insightful suggestions, as well as your acknowledgment of the value and convinceness of our work. Following your suggestions, we have revised our manuscript to include a discussion on future work regarding extending HIGHT to incorporate 3D information.

Hierarchical 3D Tokenizer: To incorporate the 3D hierarchical molecular information, we need to design a new tokenizer based on GNN backbones such as EGNN and SE(3) Transformers compatible with 3D information. For example, we can extend the hierarchical VQVAE in HIGHT, to accommodate 3D properties of motifs, such as SE(3) equivariance to the motif representations.
Scaling 3D Data for Alignment: When extending to 3D information, one could curate both the molecules, proteins and RNA data. The motifs in proteins then can be certain amino acids. In RNA, the motifs can be certain recurrent structures.
3D Tokenizer Training: To train the tokenizer to fully capture the 3D molecular information, we can design several self-supervised learning objectives such as 3D spatial position recovery, masked atom/motif prediction.
3D Alignment Tuning: To facilitate the alignment of molecule and language, we need to extend the instruction tuning dataset to include descriptions of 3D molecular properties, such as 3D positions. In addition to incorporating the 3D hierarchical information, we also need to incorporate language captions of the 3D hierarchical structures, such as properties of the amino acids. In addition, one could also incorporate pairwise hierarchical graph information, such as binding affinities of molecules to proteins.

Therefore, it is promising to align 3D graphs with languages for broader scentific tasks, for which HIGHT provides the foundation for incorporating 3D hierarchical graph information for better alignment of graphs and languages.

审稿意见

评分: 32025-03-14

The paper introduces a new representation of graphs (specifically, molecules) for the purpose of tokenization for LLMs. The key aspect of the new representation is that it not entirely just node-based, rather it captures features in the graph at both the node and the motif level. It is not clear to me whether there are actually more than two models in the hierarchy, and how the different levels communicate.

The authors demonstrate impressive performance on several benchmark datasets. However, setting up a model of this type involves several components, including the LLM itself and fine tuning, so it is not entirely clear what fraction of the improvement comes from the hierarchical tokenization itself.

给作者的问题

How does the hierarchical graph representation relate to BRICS exactly?
Does the representation have just two levels of hierarchy or more?
How do the different levels communicate with each other?

论据与证据

Yes

方法与评估标准

Yes

理论论述

n/a

实验设计与分析

Attempted to, to the extent that the experimental process of a complex pipeline involving LLMs can be verified based on just a paper.

补充材料

Briefly

与现有文献的关系

The authors survey the literature in some depth

遗漏的重要参考文献

There is a recent paper titled "Exploring Hierarchical Molecular Graph Representation in Multimodal LLMs" by Hu et al that addresses the same problem but does not seem to be cited. This might just be because the paper only appeared on ArXiV in November of last year. Nonetheless, I would very much like the authors to compare the two approaches.

其他优缺点

The hierarchical graph representation itself is not described in detail.

其他意见或建议

The intro seems to have been a little hastily written and its English could be improved.
The caption of Fig 1 says "all metrics have been improved a bit". What does this mean??
Rather than "high order molecular structures" I would say "higher order molecular structures".
In line 200, "concrete" should not be used as a verb

作者回复

2025-03-31

Dear Reviewer brfq,

Thank you for acknowledging our performance improvements and constructive suggestions. Please find our explanations to your questions below:

it is not entirely clear what fraction of the improvement comes from the hierarchical tokenization itself.

We kindly refer Reviewer brfq to Figure 3c and Table 16 in the appendix, where we conduct comprehensive ablation studies to analyze the contributions of each component in HIGHT. The differences between HIGHT and InstructMol lie in the positional encoding, hierarchical instruction tuning, and the hierarchical tokenizer. Thus, we study the performances of InstructMol plus those components, where each can also be considered as a variant of HIGHT without hierarchical tokenization and some components.

In experiments, one can find that, compared to HIGHT, merely incorporating positional encodings, using the hierarchical instruction tuning, or using a larger tokenizer in InstructMol can not bring improvements or even lead to decreased performance. It demonstrates that, while each component in HIGHT is critical to the performance, the hierarchical tokenizer contributes most to the alignment performance.

In addition, in Figure 3a, we also experiment with different base LLMs such as vicunna and Llama2, where HIGHT demonstrates consistent improvements.

Discussion between Hu et al. 2024.

First, we would like to kindly refer Reviewr brfq to the ICML reviewer instruction, that ``Authors cannot expect to discuss other papers that have only been made publicly available within four months of the submission deadline.''. Even Hu et al., came out in November 2024, it is within four months of the ICML deadline.

Nevertheless, we will revise our manuscript to include a discussion with Hu et al. Although both of our works incorporate the hierarchical information into LGLMs,

Hu et al., does not consider the influence of the hierarchical information on the molecule-language alignment, i.e., hallucinations.
They also do not consider incorporating the hierarchical information into the alignment tuning phase.
They evaluate the usefulness of the hierarchical graph information in a limited number of tasks.

Questions related to hierarchical graph representation

Hierarchical graph representation refers to aggregating the low-level (i.e., node-level) into meaningful higher-level abstractions, usually via subgraph aggregation. For example, in a social network, the higher-level abstraction could be a small group or a community. Although it is not limited to a restricted number of abstraction layers, in molecules, the hierarchy of atom-motif-molecule is the usual schema of the abstraction[1]. Hence, we simply follow the common practice in the community to obtain the hierarchical molecular representations, i.e., using BRICS to identify common functional groups[2,3]. Nevertheless, one could also apply subgraph learning strategies to enable more flexible hierarchies.

In fact, when going beyond small molecules to biomolecules such as proteins, more abstract subgraphs, such as the 3D folded architecture, or combined 3D structure between small drugs and protein pockets, are also critical to the functionalities of proteins[4].

Other minor suggestions

We have revised our manuscript according to your suggestions.

For the caption of Fig 1: HIGHT brings improvements to all tasks. Due to the heterogeneity of the evaluation metrics, we transform the numerical values a bit for better visualization. The new Figure 1b and the corresponding details are given here.

References

[1] The art and practice of structure-based drug design: A molecular modeling perspective. Medicinal Research Reviews 1996.

[2] Motif-based graph self-supervised learning for molecular property prediction, NeurIPS'21.

[3] Molecular representation learning via heterogeneous motif graph neural networks, ICML'22.

[4] Independent SE (3)-Equivariant Models for End-to-End Rigid Protein Docking, ICLR'22.

审稿意见

评分: 22025-03-17

This paper aims to address the issue of tokenization in existing LGLMs (large graph-language models) that neglect the essential hierarchical structures inherent in molecules. The hierarchical structures are reflected as motifs or functional groups that are subgraphs within the larger molecular graph.

The proposed solution is to introduce "motif tokens" as input to LLMs, in addition to the atom tokens and text tokens as usual. The "motif tokens" are obtained by an existing VQVAE approach (Xia et al., 2023).

The evaluation of the proposed method is conducted across various downstream tasks, including property classification/regression, molecular captioning, and a custom-designed task focused on answering yes/no questions regarding the presence of functional groups.

给作者的问题

In “Given the imbalance of positive and negative samples, we separately report the F1 scores for different classes.” How are the F1 scores for the positive and negative classes computed? In a binary classification problem, the F1 score is typically reported as a single value, considering both classes. “Given the imbalance of positive and negative samples”, what is the ratio of positive vs negative? If it is super imbalanced, why the Acc is still reported? And Acc does not given a misleading high value?

论据与证据

The paper claims that Hierarchical Graph Tokenization could improve LGLM understanding of molecular structures, thereby benefiting various downstream tasks.

However, in molecular property prediction (classification), the proposed method performs significantly worse than other baselines (though slightly better than LLaMA-2 in some cases). A similar lack of supporting evidence is observed in chemical reaction prediction tasks, where the proposed method fails to demonstrate substantial advantages.

This weakens the argument that introducing motif tokens inherently improves molecular understanding across tasks. As shown in Figure 1b, the proposed method significantly outperforms baselines only on MotifHallu, the benchmark dataset designed specifically for motif-related tasks. However, the proposed method has been specifically designed to enhance functional group-related understanding by introducing "motif tokens" as input to LLMs. By explicitly incorporating these motif tokens, the input itself becomes aware of functional groups, making it understandable that the model performs more accurately on motif-specific questions. But it cannot address the key question: how the inclusion of motif tokens contributes meaningfully to other downstream tasks beyond functional group recognition.

方法与评估标准

The concept is presented as “Hierarchical Graph Tokenization”, but in practice, its implementation is limited to adding motif embeddings as tokens alongside atom tokens in the input to LLMs. Notably, motifs are pre-identified using the Breaking of Retrosynthetically Interesting Chemical Substructures (BRICS) algorithm, rather than being learned within the model itself.

Furthermore, atoms and motif tokens are obtained separately using distinct VQVAEs, meaning that hierarchy is not explicitly reflected in the tokenization process.

Additionally, the tokenization of both atoms and motifs relies on an existing VQVAE approach (Xia et al., 2023), making the primary novelty of this work the incorporation of “k” motif embeddings via VQVAE, rather than a fundamentally new tokenization strategy.

In the evaluation on MotifHallu, the metrics were not clearly defined.

理论论述

There is no theoretical claim.

实验设计与分析

The experimental evaluation was conducted on various tasks, which are sufficient.

补充材料

I go through them, and have no questions.

与现有文献的关系

If the claim could be supported, the contribution of this work can be a valuable addition to the field of LLM models for chemical tasks.

遗漏的重要参考文献

None

其他优缺点

None

其他意见或建议

None

作者回复

2025-03-31

Dear Reviewer eiUy,

Thank you for your time and insightful suggestions for our paper. Please find our responses to your concerns below.

In molecular property prediction and chemical reaction tasks, HIGHT does not demonstrate substantial advantages.

We need to clarify that the performance gaps between HIGHT and the baselines lie in the differences of the base models. For example, Uni-Mol is built on a T5 architecture pretrained on large-scale 3D molecular data. GIMLET is built on a T5 pretrained on molecular property prediction-focused data.

We present a detailed table listing the information and pretrained data used by HIGHT and baselines. It can be found that some "baselines" are pretrained onto significantly larger datasets with additional information, which are not fair baselines for comparison. We report their performances in order to have an overview of the overall progress.

Therefore, our direct baseline is InstructMol, which uses the same architecture and pretraining data. Compared to InstructMol, HIGHT demonstrates significant improvements.

how does the inclusion of motif tokens contribute meaningfully to other downstream tasks beyond functional group recognition?

The visualization of Figure 1b may introduce some perceptual biases about the improvements brought about by the different scales of evaluation metrics used in the method. Due to the heterogeneity of the evaluation metrics, we transform the numerical values a bit for better visualization. The new Figure 1b and the corresponding details are given here.

Comparing InstructMol and HIGHT, indeed, incorporating motif-related information improves other downstream tasks.

Limitatiions of tokenization by BRICS; Hierarchy in the tokenization process; Primary novelty of the work

We need to clarify that the primary novelty of this work lies in the introduction of the hierarchical tokenization. It is well known that motif information is critical to molecular understanding. Without that, LGLMs will exhibit high hallucination and perform subpar on molecule-language alignment.

We acknowledge the limitations of BRICS in tokenization, while it's sufficient to demonstrate the usefulness of hierarchical graph information. Despite the simplicity of our methodology, it brings significant improvements compared to LGLMs without hierarchical graph information. Meanwhile, our method possesses a high degree of extensibility and can be augmented by incorporating advanced motif extraction techniques (such as[1,2]), or learnable motif extraction, which could further boost the performance of HIGHT.

In addition, we would also kindly note a key technical contribution in HIGHT lies in not only the tokenization, but also in the alignment tuning process via HiPubChem. We feel the original title may introduce confusion, thus we would like to change our title to ``Hierarchical Molecule-Language Alignment'' in order to highlight the key focus of our work, i.e., the importance of hierarchical information for molecule-language alignment.

Metrics in MotifHallu

We have supplemented the details in our manuscript:

To calculate the F1 scores for positive and negative classes, we first divide the motif hallu dataset according to the class label. For the positive F1 score, the calculation is normal. For the negative F1 score, we reverse the labels and predictions.
Motif Hallu is quite imbalanced. There are 4,124 positive and 19,800 negative samples.
As Acc will be biased due to the imbalance, we updated the table to report classwise averaged F1 score, i.e., macro F1. To obtain an overview of the results, we present a detailed table here with various metrics. It can be found that, HIGHT still obtains significant improvements up to 14 in terms of macro F1.

References

[1] Motif-based graph self-supervised learning for molecular property prediction, NeurIPS'21.

[2] Molecular representation learning via heterogeneous motif graph neural networks, ICML'22.

最终决定Accept (poster)

2025-05-01

The paper proposes a graph tokenization approach to enable graph input with LLMs. The proposed approach claims to be specifically hierarchical, yet the paper is missing discussion of a very related work [1], which also focuses on considering multiple graph levels, and others the reviewers mention. In this respect, the claim that this paper is the first hierarchical tokenization seems wrong. Main reviewer criticism concerns the lack of comparison with baselines (some was provided in the rebuttal) and in how far specific components are critical for performance on specific types of data. The reviewers opinions generally are mixed/favorable towards the paper with ratings between weak reject and accept.

[1] LLaMo: Large Language Model-based Molecular Graph Assistant, Neurips'24