PaperHub
5.4
/10
Rejected5 位审稿人
最低5最高6标准差0.5
6
5
6
5
5
2.8
置信度
正确性2.4
贡献度2.4
表达3.2
ICLR 2025

Modeling All-Atom Glycan Structures via Hierarchical Message Passing and Multi-Scale Pre-training

OpenReviewPDF
提交: 2024-09-17更新: 2025-02-05
TL;DR

This work proposes an all-atom-wise glycan encoder GlycanAA and a pre-trained version of it PreGlycanAA.

摘要

关键词
Glycan Machine LearningHeterogeneous Graph ModelingSelf-Supervised Pre-training

评审与讨论

审稿意见
6

The paper introduces GlycanAA, a novel model for all-atom-wise glycan modeling, which captures both atomic-level and monosaccharide-level interactions in a hierarchical manner. Unlike previous models that focus on the glycan backbone, GlycanAA uses a heterogeneous graph representation to capture detailed atomic interactions crucial for understanding glycan properties. It introduces hierarchical message passing, modeling atom-atom, atom-monosaccharide, and monosaccharide-monosaccharide interactions. To enhance the model's capabilities, the authors present PreGlycanAA, a pre-trained version of GlycanAA that uses self-supervised learning on a curated glycan dataset. The pre-training involves a multi-scale mask prediction algorithm that helps the model learn dependencies at different scales.

优点

  1. The paper provides a novel approach to glycan modeling by introducing GlycanAA, which captures both atomic and monosaccharide-level interactions hierarchically. The introduction of self-supervised pre-training with PreGlycanAA also adds a unique contribution to glycan modeling.
  2. The hierarchical message-passing mechanism and the pre-training strategy are evaluated with extensive benchmarking, which includes ablation studies that highlight the effectiveness of the proposed techniques. The experimental results provide evidence of the model’s performance compared to existing baselines.
  3. The paper is structured clearly, with explanations that are easy to follow for readers.
  4. The paper addresses a notable limitation in glycan modeling by incorporating atomic-level information.

缺点

  1. The technical novelty of the GlycanAA model could be considered incremental. Although the paper introduces a novel application of hierarchical message-passing for modeling all-atom glycan structures, similar hierarchical techniques have already been employed in other biomolecular modeling contexts, such as for proteins and small molecules. The concept of using graph-based models to represent complex biological structures and applying multi-level message passing has been explored in prior works (cited in the paper).

  2. While the paper claims, "Despite these advances, the potential of SSP in glycan modeling remains largely unexplored, presenting a new area of opportunity," it is not clear why the proposed losses from prior works, like those in [1], cannot be effectively leveraged for glycan modeling. A more comprehensive explanation of the unique challenges posed by glycan structures that necessitate a new pre-training approach would strengthen this claim.

  3. This ambiguity raises additional questions about how the existing losses from prior works, if adapted for glycan modeling, would compare in performance to the proposed pre-training method. It would be valuable for the paper to explore this comparison or provide empirical evidence demonstrating the superiority of the proposed approach.

[1] Strategies for Pre-training Graph Neural Networks, ICLR 2020.

问题

Could you clarify why the self-supervised pre-training (SSP) losses from prior works, such as those in [1], cannot be directly applied or effectively adapted for glycan modeling? Are there specific structural properties of glycans or unique challenges that these methods fail to address? A more detailed comparison or analysis would be helpful in understanding the necessity of your custom pre-training approach.

[1] Strategies for Pre-training Graph Neural Networks, ICLR 2020.

评论

Thanks for your valuable comments and constructive suggestions! We respond to your questions as below:

Q1: The technical novelty of the GlycanAA model could be considered incremental.

We appreciate the perspective on the novelty of the GlycanAA model and would like to emphasize the unique contributions our work bring to the field of glycan modeling. Glycans, unlike proteins or small molecules, possess a distinct hierarchical structure characterized by a diverse array of monosaccharides and intricate branching patterns. The GlycanAA model innovatively captures this complexity by (1) integrating atomic-level and monosaccharide-level interactions within a single heterogeneous graph framework and (2) performing hierarchical message passing to seamlessly bridge these two levels of structural information. The effectiveness of these two contributions is further verified in our ablation study (Section 5.3 of the paper).

Q2: The necessity of the custom pre-training approach should be better justified.

This suggestion is great. Indeed, the pre-training methods proposed in [a], i.e., attribute masking and context prediction algorithms, can be adapted to pre-train glycan representations. Therefore, we employ them to pre-train the GlycanAA model and name the pre-trained models as GlycanAA-Attribute and GlycanAA-Context, respectively. According to the results in the Table 1 of revised paper (updated on OpenReview), both pre-trained models show performance decay compared to the GlycanAA model without pre-training. We suggest that these two pre-training methods actually lead to trivial tasks during pre-training, which mainly causes the negative results. Specifically, the attribute masking method does not consider the correlation between atom and monosaccharide nodes during masking, and thus leads to the trivial prediction of a masked monosaccharide based on some of its characteristic atoms; similarly, the context prediction method could select highly correlated center and anchor nodes in an all-atom glycan graph, leading to a trivial prediction task. By comparison, the proposed PreGlycanAA model performs multi-scale masking carefully to ensure as little correlation left in the unmasked nodes as possible, leading to clearly better performance than the GlycanAA without pre-training.

[a] Hu, Weihua, et al. "Strategies for pre-training graph neural networks." ICLR, 2020.

评论

Dear Reviewer 2kz8,

We thank again for your contributions to the reviewing process.

The responses to your concerns and the corresponding paper revision have been posted. We kindly remind that the author-reviewer discussion period will end in two days. We look forward to your reply and welcome any further questions.

Best regards,

Authors of Paper 1249

审稿意见
5

The paper presents a novel multi-scale mask-based pretraining approach for a graph model tailored to glycans, called GlycanAA. This method introduces hierarchical message passing to capture interactions at both local atomic and global monosaccharide levels, enabling a comprehensive all-atom-wise modeling of glycan properties. The GlycanAA model extends GNN pretraining techniques to predict biologically relevant glycan properties, providing a more detailed representation than previous models that focused solely on monosaccharide-level structures. Through extensive evaluation on the GlycanML benchmark, the results demonstrate GlycanAA's superior performance over existing glycan encoders, showcasing its effectiveness in predicting complex glycan properties.

优点

  1. The paper represents a significant breakthrough in applying graph pretraining methods to the biological domain, particularly for glycan modeling. By incorporating essential domain-specific information—namely, atomic-level interactions and monosaccharide-level interactions—the proposed approach addresses a critical need for predicting glycan properties, which are highly dependent on these multi-level structures.
  2. The paper provides a comprehensive comparison with baselines, and the experimental results are well-founded. The evaluations demonstrate the model's effectiveness and provide convincing evidence of its superior performance in glycan property prediction.

缺点

  1. The novelty of the paper is limited, as it largely applies existing pretraining strategies to a new graph structure without substantial adaptation, aiming to model a hierarchical glycan property prediction scenario. The approach lacks customized design elements and theoretical innovation, which could have enhanced its contribution beyond a straightforward extension of conventional graph pretraining methods.
  2. In some cases, the experimental results did not reach optimal performance (e.g., GlycanAA outperformed the best baseline in only 7 out of 11 tasks). The paper does not provide sufficient explanation for these discrepancies.

问题

Why is it considered unpromising to directly apply small molecule encoders or monosaccharide-level glycan encoders to all-atom glycan modeling? Has any theoretical or case analysis been provided to support this claim?

评论

Thanks for your valuable feedbacks! We respond to your questions as below:

Q1: The novelty of the pre-training method is limited.

We would like to clarify that the novelty of our pre-training method lies in the adaptation and customization specifically tailored for the hierarchical and complex nature of glycan structures. On one hand, we design the multi-scale masking strategy where the masking processes of monosaccharides and their corresponding atoms are interactive, such that the model is facilitated to understand both the local atomic-level structures and the global monosaccharide-level structures. On other hand, such a pre-training method is well tailored to enhance the representation power of the proposed hierarchical encoder GlycanAA. Through the multi-scale masking-recovery process, the GlycanAA model learns to effectively perform hierarchical message passing so as to capture from local atomic-level to global monosaccharide-level dependencies. This claim about the coupling of GlycanAA and the pre-training method is supported by the results in Table 1 of revised paper (updated on OpenReview), where the competitive baseline RGCN benefits less from the pre-training method.

Therefore, our pre-training method is carefully designed for the all-atom structures of glycans and also the proposed hierarchical encoding approach, showing its novelty both in application and customization.

Q2: On some tasks, GlycanAA does not outperform the best baseline, which should be explained.

Thanks for pointing this out. Compared with the best baseline method RGCN, the performance of GlycanAA is equal or lower on 4 out of 11 tasks, i.e., phylum prediction, family prediction, immunogenicity prediction and glycosylation type prediction. The performance difference is not significant on phylum, family and immunogenicity prediction based on the one tailed t-test (α=0.025\alpha = 0.025), while the performance difference is significant on the glycosylation type prediction task. Compared to RGCN, GlycanAA contains more parameters, i.e.,13.3M parameters (GlycanAA) v.s. 4.2M parameters (RGCN). Therefore, on a relatively small dataset like the dataset of glycosylation type prediction with 1,356 training, 163 validation and 164 test samples, GlycanAA is easier to overfit the training set, leading to inferior test performance. Such an overfitting phenomenon can be mitigated by the proposed pre-training method, where, after pre-training, the PreGlycanAA model outperforms RGCN on glycosylation type prediction. In the Section 5.2 of revised paper (updated on OpenReview), we supplement this analysis.

Q3: Why is it considered unpromising to directly apply small molecule encoders or monosaccharide-level glycan encoders to all-atom glycan modeling?

For applying small molecule encoders to all-atom glycan modeling, the main obstacle is the gap of molecular scale. These encoders are originally designed to model a small molecule with tens of atoms, while a glycan commonly contains hundreds of atoms, which greatly challenges the encoders by requiring them to model a much larger system. As shown in our experiments (Table 1 of the paper), performant small molecule encoders (e.g., Graphormer and GraphGPS) do not perform well when applied to model the all-atom structures of glycans.

For applying monosaccharide-level glycan encoders to all-atom glycan modeling, the main challenge is also the adaptation to a much larger system. Specifically, the encoders are originally designed to model a coarse-grained glycan structure with tens of elements (i.e., monosaccharides), and it is hard to directly apply them to a fine-grained all-atom glycan structure with hundreds of elements (i.e., atoms). In our experiments (Table 1 of the paper), we verify this point by applying a performant monosaccharide-level glycan encoder RGCN to all-atom glycan modeling (i.e., All-Atom RGCN in the table), where obvious performance decay is observed after such an adaptation from monosaccharide level to all-atom level.

评论

Dear Reviewer JUXn,

We thank again for your contributions to the reviewing process.

The responses to your concerns and the corresponding paper revision have been posted. We kindly remind that the author-reviewer discussion period will end in two days. We look forward to your reply and welcome any further questions.

Best regards,

Authors of Paper 1249

评论

Based on your insightful comments, we have posted the responses and paper revision. Please check them out.

We welcome any further questions and discussions that affect your rating.

审稿意见
6

Summary

The topic of this paper is modeling the glycan structure. The authors claim that the previous papers neglected the atomic structures underlying each monosaccharide. To this end, they propose GlycanAA, which models glycan as a heterogeneous graph via local and global representations. Then, the hierarchical message passing is performed to capture the atomic-level interaction and monosaccharide-level interaction. In addition, the authors pretrain the model on a high-quality unlabeled glycan dataset in a self-supervised manner. Extensive experiments demonstrate the superiority.

优点

Strength

  1. The research direction is interesting and practical to life science.
  2. The method is easy to follow and understand, like Figure1.
  3. The experiments are comprehensive, and the analyses are interesting.
  4. The dataset contribution is good.

缺点

Weakness

  1. From the perspective of self-supervised learning and graph neural networks, the novelty of the proposed method is limited. The techniques already existed, like massage passing, masking-recovery, pretraining-finetuning. But maybe the application scenario is new.

  2. The color presentation in Table 1 is not beautiful. Recommend using bold values and underlined values to represent the best and runner-up results.

  3. Efficiency experiments are missing, e.g., memory cost and time cost during the training and testing stages.

  4. The compared baselines are relatively old. Recommend comparing with more new methods published in 2023 and 2024, like Pre-Training Protein Bi-level Representation Through Span Mask Strategy On 3D Protein Chains.

问题

See weakness

评论

Appreciate for your insightful comments and golden suggestions! We respond to your questions as below:

Q1: The novelty of the proposed method is limited.

We acknowledge the observation regarding the established nature of techniques such as message passing, masking-recovery, and pretraining-finetuning. However, we would like to emphasize that the novelty of our work lies in the innovative application and adaptation of these techniques to the complex and underexplored domain of glycan modeling.

Glycans present unique challenges due to their hierarchical structure, comprising both atomic and monosaccharide levels, which are crucial for accurately capturing their biological functions and interactions. The proposed GlycanAA model leverages a hierarchical approach that specifically addresses these challenges by integrating local atomic-level and global monosaccharide-level interactions within a heterogeneous graph framework. This hierarchical approach enables us to capture the intricate covalent and glycosidic bonds that are essential for understanding glycan functionality, which is not achievable with existing models.

Furthermore, our self-supervised pretraining strategy is tailored to harness the rich and unlabeled glycan data. The proposed multi-scale masking and multi-scale recovery methods well fit the hierarchical architecture of GlycanAA and guide it to capture the intricate dependencies within glycans, further facilitating the generation of highly informative representations.

Thus, while the underlying techniques are established, their novel application to glycan modeling greatly advances the field, offering new insights and capabilities that were previously unattainable.

Q2: The color presentation in Table 1 should be improved.

Thanks for the good advice. As you suggested, in the Table 1 of revised paper (updated on OpenReview), we use bold values, underlined values and italic values to represent the best, runner-up and third-place results, respectively.

Q3: Efficiency experiments are missing.

Thanks for pointing this out. In the Section 5.4 of revised paper (updated on OpenReview), we supplement the efficiency comparison between GlycanAA and RGCN, where our proposed GlycanAA performs hierarchical message passing on the all-atom level, while RGCN is a typical method with only monosaccharide-level message passing. Specifically, we evaluate their training and inference speed in terms of throughput (i.e., the number of samples processed in one second) and their training and inference memory cost in terms of Mebibyte (MiB). The evaluation is performed on the dataset of glycan taxonomy prediction for its good coverage of different kinds of glycans (#training/validation/test samples: 11,010/1,280/919, average #monosaccharides per glycan: 6.39, minimum #monosaccharides per glycan: 2, maximum #monosaccharides per glycan: 43). All experiments are conducted on a machine with 32 CPU cores and 1 NVIDIA GeForce RTX 4090 GPU (24GB), and the batch size is set as 256 for both models.

According to the results in the Table 2 of revised paper (updated on OpenReview), it is observed that, in terms of both speed and memory cost, GlycanAA does not introduce too much extra cost compared to RGCN during both training and inference. Specifically, for training/inference speed, GlycanAA is about 22% slower than RGCN, and, for training/inference memory cost, GlycanAA consumes about 19% more memory than RGCN. Such a moderate extra cost brings the superior performance of GlycanAA over RGCN in terms of weighted mean rank on the GlycanML benchmark, illustrating the worth of the hierarchical modeling approach of GlycanAA.

评论

Q4: The compared baselines are relatively old.

In the Table 1 of revised paper (updated on OpenReview), we additionally compare with one recent competitive small molecule encoder, i.e., Uni-Mol+ [a], and three recent competitive protein encoders, i.e., GearNet [b], GearNet-Edge [b] and VabsNet [c]. According to the experimental results, GearNet-Edge performs best among these four models, while it is still clearly inferior to the proposed GlycanAA model in terms of weighted mean rank, i.e., 11.44 (GearNet-Edge) v.s. 4.66 (GlycanAA). This result demonstrates that it is unpromising to directly apply small molecule encoders and protein encoders to glycan modeling, where the unique structural complexities of glycans (e.g., a diverse array of monosaccharides and intricate branching patterns) greatly challenge these models that are originally designed for other biomolecules.

[a] Lu, Shuqi, et al. "Data-driven quantum chemical property prediction leveraging 3D conformations with Uni-Mol+." Nature Communications, 2024.

[b] Zhang, Zuobai, et al. "Protein representation learning by geometric structure pretraining." ICLR, 2023.

[c] Zhuang, Wanru, et al. "Pre-Training Protein Bi-level Representation Through Span Mask Strategy On 3D Protein Chains." ICML, 2024.

评论

Dear Reviewer JYmY,

We thank again for your contributions to the reviewing process.

The responses to your concerns and the corresponding paper revision have been posted. We kindly remind that the author-reviewer discussion period will end in two days. We look forward to your reply and welcome any further questions.

Best regards,

Authors of Paper 1249

评论

Thanks for your detailed feedback.

Some of my concerns have been addressed (efficiency experiments, compared baselines, presentation).

For the novelty, I mentioned I mainly focus on the methodology itself rather than the application. I decide to keep my score.

审稿意见
5

This paper focuses on the topic of modeling glycans. Glycans are hierarchical in structure, with each containing a number of monosaccharide and each monosaccharide containing many atoms. The authors note that previous SOTA ignores the atom-level information and only considers the monosaccharides. It has also been shown that methods that do model glycans from the atom-level tend to perform poorly. In order to properly account for both the monosaccharide and atom-level information, the authors propose to model each glycan in a hierarchical nature. Specifically, message passing is done between: monosaccharides, atoms in a monosaccharide, and between monosaccharides and it's children atoms. They further propose a pre-training strategy to enhance the downstream performance of their method. They show that their method can achieve improvement on the GlycanML benchmark.

Overall I like the motivation and simple design of this method. However, I find the performance improvements of the proposed architecture (GlycanAA) to be marginal. Furthermore, I think the pre-training strategy should be applied to other models, as it's not unique to GlycanAA when ignoring the atoms. My general feeling is that I'm unconvinced that this paper can (a) truely leverage the power of the hierarchical structure or (b) the atom-level isn't very important to the downstream task.

优点

  1. The paper is written well. It is very clear and easy to understand.

  2. The motivation behind their framework is good. Given the hierarchical nature of glycans, it makes sense that this information should not be ignored.

  3. The model design is intuitive. It follows directly from the motivation (i.e., hierarchical modeling). I also like that it is simple, as it provides us with an easy to understand method that doesn't contain too many components or parameters that we need to tune.

缺点

  1. The performance gain of GlycanAA is quite small. In terms of the weighted mean rank, it is only slightly ahead of RGCN. This suggesets that the current framework may not be able to effectively leverage the atom-level information. Or that it may not be crucial to the downstream task. This is noteworthy as I expect GlycanAA to be less efficient than other methods since it incorporates more information (the atoms in addition to the monosaccharides).

  2. The pre-training strategy is not entirely unique to the proposed architecture. Essentially, a similar strategy can be used for other methods like RGCN, where we only mask the monosaccharides. Ideally this would also be included in the paper. For all we know, pre-training RGCN can even outperform PreGlycanAA. As I mentioned in my last point, it can be that considering the atom-level isn't that important. Pre-training RGCN would be helpful as it'll test us if we can achieve comparable improvements ny just masking the monosaccharides.

  3. No complexity or runtime analysis is given. I only mention this as I expect GlycanAA to be more computationally expensive than monosaccharide-only methods. Furthermore, because the performance improvement is modest, it's important to consider the efficiency to see if it's "worth" considering the hierarchical structure. E.g., If it is much slower than RGCN then it may not be worth the additional time running it.

问题

  1. Could you compare the runtime of GlycanAA and RGCN?
  2. I'm curious if you can try the pre-training strategy on RGCN (where you of course only mask the monosaccharides). I think if the improvement isn't comparable to PreGlycanAA, that tells us that the hierarchical structure is important to pre-training, which would be a good motivation for this work.
评论

Thank you for the response. I've raised my score to a 5.

However, I still feel that these results don't demonstrate the importance of the hierarchical structure. The difference between PreRGCN and PreGlycanAA is quite marginal (and the same for the non pretrained versions). To me this doesn't definitively show that all-atom glycan encoders are truly necessary.

评论

Thanks for your insightful review! We respond to your questions as below:

Q1: Could you compare the runtime of GlycanAA and RGCN?

This advice is great. In the Section 5.4 of revised paper (updated on OpenReview), we supplement the efficiency comparison between GlycanAA and RGCN. Specifically, we evaluate their training and inference speed in terms of throughput (i.e., the number of samples processed in one second) and their training and inference memory cost in terms of Mebibyte (MiB). The evaluation is performed on the dataset of glycan taxonomy prediction for its good coverage of different kinds of glycans (#training/validation/test samples: 11,010/1,280/919, average #monosaccharides per glycan: 6.39, minimum #monosaccharides per glycan: 2, maximum #monosaccharides per glycan: 43). All experiments are conducted on a machine with 32 CPU cores and 1 NVIDIA GeForce RTX 4090 GPU (24GB), and the batch size is set as 256 for both models.

According to the results in the Table 2 of revised paper (updated on OpenReview), it is observed that, in terms of both speed and memory cost, GlycanAA does not introduce too much extra cost compared to RGCN during both training and inference. Specifically, for training/inference speed, GlycanAA is about 22% slower than RGCN, and, for training/inference memory cost, GlycanAA consumes about 19% more memory than RGCN. Such a moderate extra cost brings the superior performance of GlycanAA over RGCN on 7 out of 11 benchmark tasks and also on the weighted mean rank, illustrating the “worth” of modeling glycans on the all-atom level.

Q2: How about the performance of the RGCN with the pre-training strategy.

Thanks for this good suggestion. To study the necessity of all-atom glycan modeling more in depth, we additionally pre-train the competitive RGCN model using a similar mask prediction algorithm and evaluate the pre-trained RGCN, denoted as PreRGCN, on the GlycanML benchmark. According to the results in the Table 1 of revised paper (updated on OpenReview), in terms of the weighted mean rank, PreRGCN outperforms RGCN, showing the benefit of pre-training. However, PreRGCN is just comparable to GlycanAA and clearly inferior to PreGlycanAA, and its improvement over RGCN is smaller than the improvement of PreGlycanAA over GlycanAA. These results verify the value of modeling glycans on the all-atom level and also illustrate the importance of hierarchical structures to our pre-training method.

审稿意见
5

The paper introduces GlycanAA for all-atom modeling of glycan structures. GlycanAA uses hierarchical message passing within a heterogeneous graph for representation learning from both atomic and monosaccharide levels. Also, a pre-trained version called PreGlycanAA is developed using a multi-scale masking pre-training method, which leverages unlabeled glycan data to enhance model performance on downstream tasks. The model is evaluated on the GlycanML benchmark, showing competitive performance compared to baselines.

优点

  1. A hierarchical message-passing approach is proposed for modeling both local and global structures in glycans, which is crucial for complex biomolecules.
  2. The pretrained version PreGlycanAA based on multi-scale masking can help utilize unlabeled data for more robust representations.
  3. The paper is overall well written and organized.

缺点

  1. While the model performs well on the GlycanML benchmark, there is limited discussion on what structural or relational insights it captures at the glycan level.
  2. The hierarchical process may introduce extra computational complexity, but no experiment is provided regarding efficiency.
  3. It seems that the most recent baseline comes from 2022, which could be a little out-of-date. Is there any more recent and competitive baselines? Besides, the improvements over a typical GNN encoder (RGCN) are not very significant, especially for the version without pre-training.

问题

  1. Has the model been tested for scalability on larger datasets, where glycan structures may be more diverse or complex?
  2. Is there any differences between the neural architecture modelings of glycan and protein? Or whether the recent advances in protein modeling can be transferred to this scenario?
评论

Appreciate for your valuable comments and golden suggestions! We respond to your questions as below:

Q1: What structural or relational insights the proposed model captures at the glycan level?

We provide such an analysis through Visualization experiments (Section 5.5). According to the visualization of glycan-level representations on downstream task datasets, we have following observations and insights: (1) The GlycanAA model with randomly initialized weights can, to some extent, separate the glycans with different properties (e.g., immunogenic and non-immunogenic glycans). This result illustrates the ability of GlycanAA on capturing the structural differences between the glycans with different properties, resulting in discriminative glycan-level representations. (2) The pre-trained PreGlycanAA model can better distinguish the glycans with different properties, where the samples with the same property are gathered closer, and the samples with different properties are separated more far apart. This result demonstrates the effectiveness of the proposed pre-training method, during which the model well learns to cluster glycans based on their structural relevance.

Q2: The experiments regarding efficiency should be provided.

Thanks for pointing this out. To evaluate the extra computational cost brought by the hierarchical message passing process of GlycanAA, we compare it with RGCN, a typical method with only monosaccharide-level message passing, on their computational efficiency. This study is supplemented to the Section 5.4 of revised paper (updated on OpenReview). Specifically, we evaluate their training and inference speed in terms of throughput (i.e., the number of samples processed in one second) and their training and inference memory cost in terms of Mebibyte (MiB). The evaluation is performed on the dataset of glycan taxonomy prediction for its good coverage of different kinds of glycans (#training/validation/test samples: 11,010/1,280/919, average #monosaccharides per glycan: 6.39, minimum #monosaccharides per glycan: 2, maximum #monosaccharides per glycan: 43). All experiments are conducted on a machine with 32 CPU cores and 1 NVIDIA GeForce RTX 4090 GPU (24GB), and the batch size is set as 256 for both models.

According to the results in the Table 2 of revised paper (updated on OpenReview), it is observed that, in terms of both speed and memory cost, GlycanAA does not introduce too much extra cost compared to RGCN during both training and inference. Specifically, for training/inference speed, GlycanAA is about 22% slower than RGCN, and, for training/inference memory cost, GlycanAA consumes about 19% more memory than RGCN. Such a moderate extra cost brings the superior performance of GlycanAA over RGCN in terms of weighted mean rank on the GlycanML benchmark, illustrating the worth of the hierarchical modeling approach of GlycanAA.

Q3: Is there any more recent and competitive baselines?

In the Table 1 of revised paper (updated on OpenReview), we additionally compare with one recent competitive small molecule encoder, i.e., Uni-Mol+ [a], and three recent competitive protein encoders, i.e., GearNet [b], GearNet-Edge [b] and VabsNet [c]. According to the experimental results, GearNet-Edge performs best among these four models, while it is still clearly inferior to the proposed GlycanAA model in terms of weighted mean rank, i.e., 11.44 (GearNet-Edge) v.s. 4.66 (GlycanAA). This result demonstrates that it is unpromising to directly apply small molecule encoders and protein encoders to glycan modeling, where the unique structural complexities of glycans (e.g., a diverse array of monosaccharides and intricate branching patterns) greatly challenge these models that are originally designed for other biomolecules.

[a] Lu, Shuqi, et al. "Data-driven quantum chemical property prediction leveraging 3D conformations with Uni-Mol+." Nature Communications, 2024.

[b] Zhang, Zuobai, et al. "Protein representation learning by geometric structure pretraining." ICLR, 2023.

[c] Zhuang, Wanru, et al. "Pre-Training Protein Bi-level Representation Through Span Mask Strategy On 3D Protein Chains." ICML, 2024.

评论

Q4: Has the model been tested for scalability on larger datasets?

For pre-training, we collect all available glycan data that are deposited in the public glycan repository GlyTouCan. After filtering for data quality, data integrity and potential of data leakage, we get a set of 40,781 glycans.

For downstream evaluation, we adopt the GlycanML benchmark datasets for their abundance of annotated glycan data. For example, for the dataset of glycan taxonomy prediction, GlycanML collects all the glycans with taxonomy annotation at the time that GlycanML is curated (around 2024), summing up to 13,209 glycans. Therefore, in downstream evaluation, we choose the datasets that contain as many annotated glycans as possible, whose data scale is comparable to that of the pre-training dataset where all existing glycans are collected.

Q5: Whether the recent advances in protein modeling can be transferred to glycan modeling?

Indeed, glycans share some similar characteristics as proteins, and thus protein modeling techniques can be partially transferred to glycan modeling. Some connections are as below: (1) Both proteins and glycans can be modeled in a hierarchical way. For a protein, each of its amino acid consists of atoms, and different amino acids further make up the protein. Similarly, for a glycan, each of its monosaccharide is composed of atoms, and different monosaccharides further make up the glycan. Therefore, the hierarchical modeling methods developed for proteins [d,e] are also be adapted to model glycans. (2) Both proteins and glycans can be modeled using biological sequence modeling techniques. Proteins can be represented as amino acid sequences, and glycans can be represented as IUPAC-condensed sequences. Therefore, the progress of protein language models [f,g] can somehow be transferred to understanding the language of glycans.

However, the structures of glycans are actually more complex than proteins and thus harder to be modeled. Specifically, proteins are made up of 20 types of common amino acids and a single type of peptide bond, while glycans consist of 143 types of monosaccharides and 84 types of glycosidic bonds. Therefore, the structural patterns of glycans are much more diverse than those of proteins, calling for novel techniques dedicated to glycan modeling.

[d] Hermosilla, Pedro, et al. "Intrinsic-extrinsic convolution and pooling for learning on 3d protein structures." ICLR, 2021.

[e] Wang, Limei, et al. "Learning hierarchical protein representations via complete 3d graph networks." ICLR, 2023.

[f] Rives, Alexander, et al. "Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences." Proceedings of the National Academy of Sciences 118.15 (2021): e2016239118.

[g] Elnaggar, Ahmed, et al. "Prottrans: Toward understanding the language of life through self-supervised learning." IEEE transactions on pattern analysis and machine intelligence 44.10 (2021): 7112-7127.

评论

Dear Reviewer 44rZ,

We thank again for your contributions to the reviewing process.

The responses to your concerns and the corresponding paper revision have been posted. We kindly remind that the author-reviewer discussion period will end in two days. We look forward to your reply and welcome any further questions.  

Best regards,

Authors of Paper 1249

评论

Thanks for the extra experiments and the detailed response. I think the additional results can partially address my concerns, but the gains seem a little marginal (not sure how many columns in Table 1 can pass significant tests).

评论

In the last post, we have posted the significance test as you suggested. Please check it out.

We welcome any further questions and discussions that affect your rating.

评论

Table A: Results of one tailed t-test on performance improvements (significance level α=0.025\alpha = 0.025). “-” indicates our model does not surpass the best baseline on this task.

TaskDomainKingdomPhylumClassOrderFamilyGenusSpeciesImmunogenicityGlycosylationInteraction
t-statistic (GlycanAA vs Best baseline)2.874.85-4.984.92-3.453.38--2.80
t-statistic (PreGlycanAA vs Best baseline)-3.203.295.152.834.783.903.81-2.923.76

In Table A, we conduct one tailed t-test with significance level α=0.025\alpha = 0.025 to verify the significance of the proposed models’ improvements over the best baseline method on all benchmark tasks, where both our model and the baseline run for 3 times on each task. It is observed that, on 7 out of 11 tasks, GlycanAA outperforms the best baseline and at the same time shows statistical significance (i.e., with t-statistic surpassing the critical value 2.78 of this test), and, on 9 out of 11 tasks, PreGlycanAA outperforms the best baseline and at the same time shows statistical significance (i.e., with t-statistic surpassing the critical value 2.78 of this test). These results illustrate the significance of the performance improvements achieved by our models.

We hope this response can address your concern, and we welcome any further questions.

评论

We appreciate all reviewers for your golden suggestions and insightful comments on our paper!

We have posted the responses to your questions and revised the paper for more experimental verification and better presentation, where the revisions are marked in RED in the paper. Here is a brief summary of important points:

  1. Adding efficiency study (Reviewer 44rZ, MwTz, JYmY): We add an efficiency comparison between the proposed GlycanAA and the competitive RGCN on their speed and memory cost during both training and inference.

  2. More recent and competitive baselines (Reviewer 44rZ, JYmY): We additionally compare with one recent competitive small molecule encoder, i.e., Uni-Mol+ (Nature Communications 2024), and three recent competitive protein encoders, i.e., GearNet (ICLR 2023), GearNet-Edge (ICLR 2023) and VabsNet (ICML 2024).

  3. Studying pre-training more in depth (Reviewer MwTz, 2kz8): We compare the proposed PreGlycanAA with (1) a pre-trained RGCN and (2) two GlycanAA models pre-trained with previous pre-training algorithms, which justifies the necessity and importance of our pre-training method.

AC 元评审

Overall, the paper is well written and organized. The hierarchical message-passing approach is well motivated and intuitive.

However, the novelty of the methodology is somewhat incremental. Some reviewers also find that the improvement are marginal in some cases, and the results do not sufficiently demonstrate the importance of the hierarchical structure.

审稿人讨论附加意见

While some concerns have been addressed, reviewers mainly have issues with the following:

  1. Novelty of the approach, which appears incremental (Reviewer JYmY, JUXn, 2kz8). My assessment is that the novelty of the paper lies more in the application than in the technical approach. Overall, it is borderline to me in terms of novelty.

  2. Significance of results / margin of improvements (Reviewers 44rZ, MwTz)

最终决定

Reject