5.7

/10

Poster3 位审稿人

最低5最高6标准差0.5

3.7

置信度

正确性2.7

贡献度2.3

表达2.7

NeurIPS 2024

ProtGO: Function-Guided Protein Modeling for Unified Representation Learning

Bozhen Hu,Cheng Tan,Yongjie Xu,Zhangyang Gao,Jun Xia,Lirong Wu,Stan Z. Li

OpenReview PDF

提交: 2024-05-11更新: 2024-11-06

摘要

关键词

Teacher-studentKnowledge DistillationProtein Function

评审与讨论

审稿意见

评分: 5置信度: 42024-07-03

The authors distill the knowledge from function annotation. ProtGO acquires performance improvement through knowledge distillation. Compared with other methods only rely on structure and sequence, ProtGO outperforms these baselines.

优点

ProtGO introduces a novel method that can utilize function information, indicating improved performance.

缺点

Protst[1] also utilizes function information of proteins, should also be listed as a baseline. Protst also does not need function information and only use protein sequence after alignment.
It's confusing to find, in ablation study, ProtGO still outperforms all the baselines without teacher. The authors should explain the difference between backbone GNN model and other baselines. Furthermore, the improvement of teacher-student module is trivial compared with the improvement of the backbone GNN. This needs to be clarified.
A question is that if it's necessary to introduce domain adaptation. The ablation study about this is missing.

[1]. Protst: Multi-modality learning of protein sequences and biomedical texts

问题

See Weaknesses.

局限性

None.

作者回复

2024-08-07

Dear Reviewer qMp2,

We are grateful for your thorough review. Your comments are highly valued, and we would like to express our heartfelt gratitude. We do our utmost to address the questions you have raised:

Q1 Protst[1] also utilizes function information of proteins, should also be listed as a baseline.

A1 Thank you for your valuable feedback! In actuality, ProtGO does not function as a pre-training model, making it unjust to draw comparisons with pre-training methodologies. Consequently, we integrate ESM-2 (650M) [1] with ProtGO, denoted as ProtGO-ESM, where the ESM embeddings serve as a component of graph node features. Our evaluation juxtaposes ProtGO-ESM against pre-training techniques in tasks like protein function prediction and EC number prediction. This includes a spectrum of methods: sequence-based approaches such as ESM-1b [2] and ESM-2; sequence-function based models like ProtST [3]; and sequence-structure based methodologies like GearNet-ESM [4], SaProt [5], and GearNet-ESM-INR-MC [6]. The comparative findings are detailed in Table 1 of the one-page rebuttal pdf. Notably, our proposed model, ProtGO-ESM, emerges as the top performer across sequence-based, sequence-structure based, and sequence-function based pre-training strategies.

Q2 It's confusing to find, in ablation study, ProtGO still outperforms all the baselines without teacher. The authors should explain the difference between backbone GNN model and other baselines. Furthermore, the improvement of teacher-student module is trivial compared with the improvement of the backbone GNN. This needs to be clarified.

A2 Thank you for your informative reviews! (1) The student model is meticulously crafted as a Graph Neural Network (GNN) model capable of encoding protein sequences and structures simultaneously, demonstrated in Eq.4 of the manuscript. This robust protein encoder adeptly integrates protein sequences and structures, outperforming alternative sequence-structure methodologies. A sequence pooling layer is utilized to condense sequence length effectively, facilitating the aggregation of crucial patterns. In this GNN architecture, each pair of message-passing layers is succeeded by an average sequence pooling layer, totaling eight message-passing layers in the model. (2) The sequence average pooling functions execute tailored average pooling operations on input tensors based on calculated indices, dividing the sequence length by two and flooring the result. These functions aggregate and synthesize information from input tensors through scatter operations to generate output tensors. Following each average pooling layer, the number of residues is halved, expanding the radius threshold $r_s$ to $2r_s$ post-pooling. This extension enables center nodes' neighbors to encompass progressively distant and infrequent nodes, concurrently reducing computational complexity. These operations enable the model to capture both local and global features effectively. (3) The ablation study's Table 5 showcases the student model's autonomous performance without teacher guidance, demonstrating its ability to independently model protein sequences and structures. (4) Additionally, Figure 3 in the manuscript illustrates the enhancements resulting from incorporating functional information, affirming the advantages of this augmentation.

Q3 A question is that if it's necessary to introduce domain adaptation. The ablation study about this is missing.

A3 Thank you for your reviews! The theory of domain adaptation forms the foundation for deriving Eq.10 and Eq.11, as detailed in Appendix F of the manuscript. In the realm of domain adaptation, employing a supervised loss is preferred when the student model operates with distinct task labels [2].

Thank you again for all the efforts that helped us improve our manuscript! We have tried our best to address your concerns as it is important for my graduation; we respectfully thank you for supporting the acceptance of our work. Also, please let us know if you have any further questions. Look forward to further discussions!

[1] Xu, M., et al. Protst: Multi-modality learning of protein sequences and biomedical texts. In International Conference on Machine Learning. 2023.

[2] Berthelot, D., et al. Adamatch: A unified approach to semi-supervised learning and domain adaptation. arXiv preprint arXiv:2106.04732.

2024-08-09

Dear Reviewer qMp2,

We are especially encouraged by your initial review. Thank you for raising the constructive question about the effects of the AlphaFold predictions. Your inquiry provided us with the opportunity to clarify this crucial aspect of our study. We have thoroughly addressed your concerns in the rebuttal and hope to address your concerns.

Once again, we extend our heartfelt thanks for your time and effort during the author-reviewer discussion period.

Sincerely,

The Authors

2024-08-10

Dear Reviewer qMp2,

Thanks for your review. We have tried our best to address your questions and we respectfully thank you for supporting the acceptance of our work. Also, please let us know if you have any further questions. Look forward to further discussions!

Sincerely,

The Authors

2024-08-10

I appreciate the efforts made by the authors during the rebuttal. Most of my concerns are addressed. I acknowledge the proposed method is effective for learning the representations of proteins. I will raise my score as positive.

2024-08-10

Thank you so much for your prompt feedback. Your initial review was incredibly encouraging, and your valuable suggestions have been instrumental for us. We sincerely appreciate your time, effort, and kind words.

审稿意见

评分: 6置信度: 32024-07-08

This paper proposes to learn hybrid embeddings for the protein and the GO terms. By further applying the teacher-student training schedule, during inference, the additional input of GO terms is not necessary. Experimentally, the authors demonstrates that the model achieves better results on severls tasks, e.g., folding classification, reaction classification and GO term classification.

优点

[+] ProtGO proposes to improve the protein representation with extra side information, the function descriptions. [+] With teacher-student approach, it is unnecessary to have additional functions as input for the student network, simplifying the inference process. [+] Benchmark experiments demonstrate that ProtGO significantly outperforms state-of-the-art baselines.

缺点

[-] The proposed method only demonstrate results on several small benchmarks. More results on larger scale benchmarks can be useful, e.g., residue prediction, mutation effect prediction, etc. Functions are related with the folding / reaction / GO term / EC numbers, I wonder whether it is helpful to demonstrate the performance on other problems, to demonstrate the generalization of the proposed method. [-] The proposed method should learn better functions, therefore, binding affinity prediction may be a better downstream or zero-shot task to demonstrate the model performance. [-] The experiment results can introduce more about the error bars.

问题

I wonder how the authors avoid data leakage. What's the sequence and structure similarity (which can use AlphaFold DB to measure) between the functional annotation dataset and the other datasets.

局限性

n/a

作者回复

2024-08-07

Dear Reviewer oeK7,

We are grateful for your thorough review. Your comments are highly valued, and we would like to express our heartfelt gratitude. We do our utmost to address the questions you have raised:

Q1 The proposed method only demonstrate results on several small benchmarks. More results on larger scale benchmarks can be useful, e.g., residue prediction, mutation effect prediction, etc. Functions are related with the folding / reaction / GO term / EC numbers, I wonder whether it is helpful to demonstrate the performance on other problems, to demonstrate the generalization of the proposed method.

A1 Thank you for your valuable feedback! Protein design involves the computational creation of amino acid sequences that can fold into a specific protein structure. Methods like ESM-IF1 [1], PiFold [2], and VFN-IF [3] are dedicated to protein design, distinct from protein representation learning for function prediction. By focusing on protein inverse folding, we apply our approach to CATH 4.2 dataset, as detailed in Table 2 of the one-page rebuttal pdf. Our method demonstrates exceptional performance in this context, which demonstrates the generalization of the proposed method.

Q2 The proposed method should learn better functions, therefore, binding affinity prediction may be a better downstream or zero-shot task to demonstrate the model performance.

A2 Thank you for your informative reviews! We use the binding affinity prediction dataset used in [4] and [5]. DNA Binding Site Prediction Result trained on DNA-573 Train, tested on DNA-129 Test. RNA Binding Site Prediction Result trained on RNA-495 Train, tested on RNA-117 Test. We use the AUC for evaluation. Our proposed achieves best results on the binding affinity prediction, illustrating its generalization ability.

Method	DNA (AUC)	RNA (AUC)
SVMNUC [6]	0.812	0.729
Coach-D [7]	0.761	0.663
NucBind [6]	0.797	0.715
GraphBind [8]	0.927	0.854
VABS-Net [5]	0.912	0.834
ProtGO (Ours)	0.941	0.878

Q3 The experiment results can introduce more about the error bars.

A3 Thank you for your reviews! The mean values are reported, we have calculated the ariane of results on protein function prediction. Our results have low error bars.

Method	GO-BP	GO-MF	GO-CC	EC
ProtGO(Student)	0.464 (0.005)	0.667(0.002)	0.492(0.006)	0.857(0.008)

Q4 I wonder how the authors avoid data leakage. What's the sequence and structure similarity (which can use AlphaFold DB to measure) between the functional annotation dataset and the other datasets.

A4 Thank you for your questions! (1) Employing the teacher-student framework, the student model learns from the teacher model's latent embeddings through knowledge distillation loss. To ensure data integrity, the test sets utilized in downstream tasks are excluded, addressing concerns regarding data leakage. (2) For the functional annotation dataset used in the teacher model and downstream datasets used in the student model, the sequence similarity between it and EC number prediction dataset is only 25%, the structural similarity is 19%; the sequence similarity between it and GO term prediction dataset is 75%. (3) Adhering to the GearNet [9] protocol, the test sets for GO term and EC number prediction exclusively comprise PDB chains with less than 95% sequence identity to the training set, allowing for the generation of varied cutoff splits.

[1] Hsu, C., et al. Learning inverse folding from millions of predicted structures. ICML, 2022.

[2] Gao, Z., et al. PiFold: Toward effective and efficient protein inverse folding. ICLR, 2022a.

[3] Mao, W., et al. De novo protein design using geometric vector field networks. arXiv, 2023.

[4] Zhang, C., et al. Us-align: universal structure alignments of proteins, nucleic acids, and macromolecular complexes. Nature methods, 2022.

[5] Zhuang, W., et al. Pre-Training Protein Bi-level Representation Through Span Mask Strategy On 3D Protein Chains. ICML. 2024.

[6] Su, H., et al. Improving the prediction of protein–nucleic acids binding residues via multiple sequence profiles and the consensus of complementary methods. Bioinformatics, 2019.

[7] Wu, Q., et al. Coach-d: improved protein–ligand binding sites prediction with re- fined ligand-binding poses through molecular docking. Nucleic acids research, 2018.

[8] Xia, Y., et al. Graphbind: protein structural context embedded rules learned by hierarchical graph neural networks for recognizing nucleic- acid-binding residues. Nucleic acids research, 2021.

[9] Zhang, Z., et al. Protein representation learning by geometric structure pretraining. ICML, 2022b.

2024-08-09

Dear Reviewer oeK7,

We express our sincere gratitude for your constructive feedback in the initial review. It is our hope that our responses adequately address your concerns. Your expert insights are invaluable to us in our pursuit of elevating the quality of our work. We are fully aware of the demands on your time and deeply appreciate your dedication and expertise throughout this review.

We eagerly anticipate your additional comments and are committed to promptly addressing any further concerns.

Once again, we extend our heartfelt thanks for your time and effort during the author-reviewer discussion period.

Sincerely,

The Authors

2024-08-10

Dear Reviewer oeK7,

Sincerely,

The Authors

评论- Response

2024-08-14

Thanks for your feedback. These baselines and more experiments can address my questions. The discussion about data leakage is helpful. I will change my score to 6.

2024-08-14

Thank you for the review service and your response is greatly appreciated.

审稿意见

评分: 6置信度: 42024-07-11

In the paper entitled "ProtGO: Function-Guided Protein Modeling for Unified Representation Learning", the authors proposed a KD-based framework to incorporate GO knowledge to learn an unified, multi-modal embedding for a given protein. The cross-domain knowledge make the embeddings performs good in various downstream tasks.

优点

The paper is well-written and easy-to-follow.
The proposed unified framework is novel and interest, it may lay as foundations for futural research on protein embedding.
The introduction of KD for transferring GO knowledges could potentially overcome the data insufficient problem.

缺点

Lack of sufficient baseline method, as I believe in all modalities (sequence/structure/GO), there exists other powerful methods that the authors may ignore.

Minor: In line 4, the authors wrote "including sequence, structure, domains, motifs, and...", generally domains and motifs are structural annotations, they should not be listed together with sequence and structure.

问题

In table 1 and table 2, the authors may consider add ESM-1b/ESM-2/ESM-3 as sequence modality baselines.
Also table 1 and 2, the authors may consider add ESM-IF1 as structural modality baselines.
Also table 1 and 2, the authors may consider add GO/sequence+GO modality as input, add DeepGO-SE as baselines. (https://www.nature.com/articles/s42256-024-00795-w)
I am interested in the downstream task performance differences for proteins with GOs (where the teacher can teach the student well) and proteins without GOs (where the teacher model may fail). Moreover, GO terms consist of 3 parts, molecular functions, cellular components, and biological processes, I wonder whether the authors could explore the effectiveness of each part through comprehensive downstream experiments.

局限性

The authors have addressed the limitations.

作者回复

2024-08-07

Dear Reviewer kUCf,

We are grateful for your thorough review. Your comments are highly valued, and we would like to express our heartfelt gratitude. We do our utmost to address the questions you have raised:

Q1 In table 1 and table 2, the authors may consider add ESM-1b/ESM-2/ESM-3 as sequence modality baselines.

A1 Thank you for your valuable feedback! In actuality, ProtGO does not function as a pre-training model, making it unjust to draw comparisons with pre-training methodologies. Consequently, we integrate ESM-2 (650M) [1] with ProtGO, denoted as ProtGO-ESM, where the ESM embeddings serve as a component of graph node features. Our evaluation juxtaposes ProtGO-ESM against pre-training techniques in tasks like protein function prediction and EC number prediction. This includes a spectrum of methods: sequence-based approaches such as ESM-1b [2] and ESM-2; sequence-function based models like ProtST [3]; and sequence-structure based methodologies like GearNet-ESM [4], SaProt [5], and GearNet-ESM-INR-MC [6]. Notably, the outcomes of ESM-3 [7] are pending due to resource constraints. The comparative findings are detailed in Table 1 of the one-page rebuttal pdf. Notably, our proposed model, ProtGO-ESM, emerges as the top performer across sequence-based, sequence-structure based, and sequence-function based pre-training strategies.

Q2 Also table 1 and 2, the authors may consider add ESM-IF1 as structural modality baselines.

A2 Thank you for your informative reviews! Protein design involves the computational creation of amino acid sequences that can fold into a specific protein structure. Methods like ESM-IF1 [8], PiFold [9], and VFN-IF [10] are dedicated to protein design, distinct from protein representation learning for function prediction. By focusing on protein inverse folding, we apply our approach to CATH 4.2 dataset, as detailed in Table 2 of the one-page rebuttal pdf. Our method demonstrates exceptional performance in this context, nearly outperforming other approaches in this task.

Q3 Also table 1 and 2, the authors may consider add GO/sequence+GO modality as input, add DeepGO-SE as baselines.

A3 Thank you for your reviews! We have compared our model, ProtGO-ESM, with pre-training methods with sequence, sequence-structure and sequence function as inputs; the results are shown in Table 1 in the one-page rebuttal pdf.

Q4 I am interested in the downstream task performance differences for proteins with GOs (where the teacher can teach the student well) and proteins without GOs (where the teacher model may fail). Moreover, GO terms consist of 3 parts, molecular functions, cellular components, and biological processes, I wonder whether the authors could explore the effectiveness of each part through comprehensive downstream experiments.

A4 Thank you for your reviews! (1) The effectiveness of the teacher model in instructing the student is notably higher for GO terms with a higher frequency. Conversely, our experiments reveal that when the frequency of a GO term falls below 50, the teacher model may struggle. For instance, in the case of the GO term GO:0030027, denoted as lamellipodium, the performance of the teacher model is suboptimal. (2) Through experimental analysis, we segmented GO terms into three categories: molecular functions (MF), cellular components (CC), and biological processes (BP). By utilizing only one category of GO terms as input for the GO encoder of the teacher model, we aimed to assess the efficacy of each category. Our findings indicate that focusing on a single category, such as MF, primarily enhances predictions related to MF in subsequent tasks. While specialization in a specific category can improve accuracy within that domain, it may not directly translate to improved predictions in the other two categories (BP and CC). This divergence stems from the distinct biological aspects represented by each category, each with its unique characteristics and interrelations.

Q5 Minor: In line 4, the authors wrote "including sequence, structure, domains, motifs, and...", generally domains and motifs are structural annotations, they should not be listed together with sequence and structure.

A5 Thank you for your suggestion! We will rectify this in the revised version.

[1] Lin, Z., et al. Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv, 2022.

[2] Rives, A., et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. PNAS, 2021.

[3] Xu, M., et al. Protst: Multi-modality learning of protein sequences and biomedical texts. ICML, 2023.

[4] Zhang, Z., et al. Protein representation learning by geometric structure pretraining. ICML, 2022b.

[5] Su et al. SaProt: Protein Language Modeling with Structure-aware Vocabulary. ICLR, 2024.

[6] Lee, Y., et al. Pre-training sequence, structure, and surface features for comprehensive protein representation learning. 2023.

[7] Hayes, T., et al. Simulating 500 million years of evolution with a language model. bioRxiv, 2024.

[8] Hsu, C., et al. Learning inverse folding from millions of predicted structures. ICML, 2022.

[9] Gao, Z., et al. PiFold: Toward effective and efficient protein inverse folding. ICLR, 2022a.

[10] Mao, W., et al. De novo protein design using geometric vector field networks. arXiv, 2023.

2024-08-09

Dear Reviewer kUCf,

We eagerly anticipate your additional comments and are committed to promptly addressing any further concerns.

Once again, we extend our heartfelt thanks for your time and effort during the author-reviewer discussion period.

Sincerely,

Authors

2024-08-11

The authors have addressed my concerns. I will keep my score unchanged.

2024-08-12

Thank you for the review service and the prompt response is greatly appreciated.

作者回复

2024-08-07

First and foremost, we would like to express our sincere gratitude for the insightful and constructive feedback provided by the reviewers on our manuscript. We greatly appreciate their positive reception of ProtGO's potential and its timely relevance in the field of protein research.

We are particularly thankful for Reviewers kUCf and qMp2's recognition of the novelty of our study, which may lay as foundations for futural research on protein embedding; We also appreciate their acknowledgment of the introduction of knowledge distillation for transferring GO knowledges, which could potentially overcome the data insufficient problem (Reviewer kUCf, oeK7). The reviewers have acknowledged the promising quality and significance of our work, achieving better results on tasks, e.g., folding classification, reaction classification and GO term classification (Reviewer kUCf, oeK7). Additionally, Reviewer kUCf has expressed the acknowledgment of our wirting, which is well-organized and easy-to-follow.

Comparisons with pre-training methods: We appreciate the feedback received, particularly addressing the absence of comparisons with pre-training methods. It is essential to note that ProtGO does not operate as a pre-training model, making direct comparisons with pre-training methodologies inappropriate. To address this, we integrate ESM-2 (650M) [1] with ProtGO, forming ProtGO-ESM, where ESM embeddings enhance graph node features. Our assessment contrasts ProtGO-ESM with pre-training techniques in protein function and EC number prediction tasks. This evaluation encompasses diverse methods, including ESM-1b [2], ESM-2 for sequence-based approaches, ProtST [3] for sequence-function models, and GearNet-ESM [4], SaProt [5], and GearNet-ESM-INR-MC [6] for sequence-structure methodologies. Detailed comparative results are outlined in Table 1 of the one-page rebuttal PDF. Notably, our proposed model, ProtGO-ESM, emerges as the top performer across various pre-training strategies.

Generalization Ability: In the one-page rebuttal PDF, we present results on protein design, showcased in Table 2, underscoring the generalization capacity of our method. Additionally, we conducted binding affinity prediction experiments on DNA [7] and RNA [8] datasets. As depicted in the table below, our method excels in binding affinity prediction, emphasizing its robust generalization capabilities.

Method	DNA (AUC)	RNA (AUC)
SVMNUC [9]	0.812	0.729
Coach-D [10]	0.761	0.663
NucBind [9]	0.797	0.715
GraphBind [11]	0.927	0.854
VABS-Net [8]	0.912	0.834
ProtGO (Ours)	0.941	0.878

Once again, we sincerely appreciate the reviewers' feedback and remain committed to continuously improving our research and manuscript based on their valuable insights. Thank you again!

[1] Lin, Z., et al. Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv, 2022.

[2] Rives, A., et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. PNAS, 2021.

[3] Xu, M., et al. Protst: Multi-modality learning of protein sequences and biomedical texts. ICML, 2023.

[4] Zhang, Z., et al. Protein representation learning by geometric structure pretraining. ICML, 2022b.

[5] Su et al. SaProt: Protein Language Modeling with Structure-aware Vocabulary. ICLR, 2024.

[6] Lee, Y., et al. Pre-training sequence, structure, and surface features for comprehensive protein representation learning. 2023.

[7] Zhang, C., et al. Us-align: universal structure alignments of proteins, nucleic acids, and macromolecular complexes. Nature methods, 2022.

[8] Zhuang, W., et al. Pre-Training Protein Bi-level Representation Through Span Mask Strategy On 3D Protein Chains. ICML. 2024.

[9] Su, H., et al. Improving the prediction of protein–nucleic acids binding residues via multiple sequence profiles and the consensus of complementary methods. Bioinformatics, 2019.

[10] Wu, Q., et al. Coach-d: improved protein–ligand binding sites prediction with re- fined ligand-binding poses through molecular docking. Nucleic acids research, 2018.

[11] Xia, Y., et al. Graphbind: protein structural context embedded rules learned by hierarchical graph neural networks for recognizing nucleic- acid-binding residues. Nucleic acids research, 2021.

最终决定Accept (poster)

2024-09-25

This paper proposes a protein representation learning framework based on a knowledge distillation scheme, where a GNN teacher with trimodal knowledge of the protein (sequence, structure, and functionality) distills the knowledge into a GNN student with only sequence-structure information. The idea is that the functional annotations of the proteins are important but may be missing in many downstream tasks. The proposed method achieves superior performance over existing methods on multiple benncharks.

The reviewers found the proposed work as well-motivated, the framework as novel and effective based on the strong benchmark performance. However, there were concerns regarding missing large-scale experiments, rationale on why the proposed framework performs well even without KD, and discussion of a similar work. However, those concerns were addressed during the rebuttal period and there was a consensus to accept the paper.