7.3

/10

Spotlight3 位审稿人

最低6最高8标准差0.9

4.3

置信度

ICLR 2024

SaProt: Protein Language Modeling with Structure-aware Vocabulary

Jin Su,Chenchen Han,Yuyang Zhou,Junjie Shan,Xibin Zhou,Fajie Yuan

OpenReview PDF

提交: 2023-09-19更新: 2024-04-19

摘要

关键词

Protein Language ModelsUniversal RepresentationsDownstream TasksProtein Structure Modeling

评审与讨论

审稿意见

评分: 8置信度: 52023-10-20

Pretrained protein language models trained on protein residue sequences have become useful primitives in bioinformatics, as, much like their cousins in NLP, they can easily achieve good performance on diverse downstream tasks. In this paper, the authors train a protein language model on sequences of residue tokens augmented with VQ-VAE tokens encoding each protein's structure (or, lacking that, the AlphaFold2 prediction of its structure). Compared to traditional PLMs, the structure-aware language model performs better on a large suite of protein-related tasks.

优点

The paper is clear and well-written, the evaluations extensive, and the results very good. The method is well-motivated and a very natural application of the VQ-VAE in Foldseek. Furthermore, some of the results are fairly surprising; it's unintuitive to me, for example, how much better this model is than the largest ESM-2 models. Beyond structural biology, I think this paper is also a nice addition to the literature on multimodal language modeling.

缺点

I don't have much negative to say about this. Insofar as the MSA Transformer is also a "structure-aware" language model, it would be interesting to see comparisons between SaProt and that model. I know they never released those parameters, but if it would be possible to use any of their self-reported performance figures, that would be nice.

It might also be interesting to see Figure 4 include an analogous experiment for the residue tokens.

问题

Do you think SaProt would do well at this: https://mastodon.cloud/@sokrypton/109560748589299528? It seems to me like adding the structure tokens could eliminate the need to manually account for P(structure | sequence) as Sergey has to here.

Did you ever measure the correlation between AlphaFold2 pLDDT and the success rate of this model on downstream tasks? Also, given that you mask low-pLDDT regions during training, do you do the same during inference?

评论- To reviewer-LEd3

2023-11-19

Dear reviewer LEd3,

Thank you for your support and constructive comments!

I don't have much negative to say about this. Insofar as the MSA Transformer is also a "structure-aware" language model, it would be interesting to see comparisons between SaProtand that model. I know they never released those parameters, but if it would be possible to use any of their self-reported performance figures, that would be nice.

Thanks for this question. We agree with you that MSA Transformer can be regarded as a "structure-aware" language model. But It's very costly for us to search for a large quantity of MSAs and the performance of MSA Transformer heavily depends on the quality of MSAs, so we didn't evaluate its performance in our initial version. But given that ProteinGym and ClinVar datasets have released MSA files, we could evaluate MSA Transformer in the zero-shot tasks. We have updated our paper to add additional zero-shot result of MSA Transformer. Please check the Table 1 for more details. In fact, a paper[1] in NeurIPS 2022 also evaluated MSA Transformer for mutational effect prediction. The results show that MSA Transformer may not outperform ESM1b.

[1] Exploring evolution-aware & -free protein language models as protein function predictors. NeurIPS 2022

It might also be interesting to see Figure 4 include an analogous experiment for the residue tokens.

Sure! We have did the experiment and plotted the result in Figure 4 (now it is Figure 3 as we have rearranged our paper suggested by reviewer 1519). Please check out our updated PDF. As expected, the zero-shot performance decreases when we substitute either residue or structure tokens.

Do you think SaProt would do well at this: https://mastodon.cloud/@sokrypton/109560748589299528? It seems to me like adding the structure tokens could eliminate the need to manually account for P(structure | sequence) as Sergey has to here.

We think that SaProt has advantage in performing protein inverse folding, as it was pre-trained to predict residue tokens given structures. From this perspective, we think SaProt could eliminate the need to further account for P(structure | sequence). Additionally, we think that the new residue-structure vocabulary can be directly utilized for inverse folding during generation. This is because the vocabulary incorporates structure constraints, allowing for the automatic inclusion of structural information when generating residue tokens.

Did you ever measure the correlation between AlphaFold2 pLDDT and the success rate of this model on downstream tasks? Also, given that you mask low-pLDDT regions during training, do you do the same during inference?

Thank you for this great question. While we didn't perform the exact same experiments, we did conduct experiments to assess the impact of masking low-pLDDT regions (using a threshold of 70) in downstream tasks on SaProt's performance. We consistently observed that masking structure tokens with pLDDT < 70 led to improved performance compared to not masking them, which aligns with our pre-training strategy. In both zero-shot prediction and supervised fine-tuning, we maintained the practice of masking all structure tokens with pLDDT values below 70, as described in paragraph 2 of Appendix B.

Thanks again for your support and insightful questions.

2023-11-21

Thanks for the reply! Looking forward to seeing this at the conference.

审稿意见

评分: 8置信度: 52023-10-23

This paper introduces a structure-aware vocabulary approach to incorporate protein structure information into protein sequence representation learning models. It discretizes protein structure information using Foldseek to obtain sequence representations of protein structures. It then combines protein structure sequences with amino acid sequence information to create Structure-aware Sequences. To ensure that the model learns the semantics of amino acid tokens and structure information tokens effectively, the paper designs various masking strategies. The paper validates the model's performance on multiple downstream tasks.

优点

The paper proposes structure-aware vocabulary that cleverly integrates protein structure information with sequence information, enabling protein sequence models to capture structural semantic information. This is a positive development for protein representation learning models.
Figure 2 compares various methods for modeling structural semantic information, providing insights into designing better protein structure models.
Results in Table 1 and Table 2 demonstrate a significant improvement in the performance of multiple downstream protein tasks with the introduction of the structure-aware vocabulary.

缺点

Lack of ablation experiments: The experimental results of this paper do not effectively validate the model's performance in scenarios where structural information is not provided, such as comparing the performance of SaProt and ESM-2 on contact prediction tasks when structural information is not given.

问题

The model structure proposed in this paper is consistent with ESM-2. Did the paper attempt to extend the structure-aware vocabulary on top of a pre-trained ESM-2 model to train a new model? Would this potentially lower the training cost?
Foldseek focus more on local protein structure information when constructing structure information sequences. Could this lead to the model not capturing a comprehensive set of structural information?

评论- To reviewer-WitX

2023-11-19

Dear reviewer WitX,

Thank you for your constructive comments! We have taken them into careful consideration and would like to provide a thorough response addressing your main concerns.

Lack of ablation experiments: The experimental results of this paper do not effectively validate the model's performance in scenarios where structural information is not provided, such as comparing the performance of SaProt and ESM-2 on contact prediction tasks when structural information is not given.

Thanks for this question. In our original submission, we did some ablation experiments to validate SaProt's ability when structures were not available. They are in Figure 4 (now it is Figure 3 as we have rearranged our paper suggested by reviewer 1519) and Appendix E.2. We masked a certain percentage (from 0 to 1) of structure tokens and evaluated SaProt's performance. In Appendix E.3, we tested SaProt on supervised mutation datasets where the protein structures were not available.

Additionally, we have made updates to the table regarding the contact prediction task. We have included the results of SaProt with all structure tokens masked. Please refer to our updated PDF for the details. Also note that the results differ between the zero-shot and supervised learning tasks. In the supervised learning task, the model is retrained on downstream tasks, and even without the structure tokens, it can still perform well, similar to residue-only ESM models. However, for zero-shot learning tasks, the absence of structural tokens can significantly impact performance.

Furthermore, we have conducted additional experiments replacing AF2 tokens with ESMFold tokens in Appendix E.4. ESMFold predicts protein structures faster but has lower overall accuracy compared to AF2 when MSA data is provided. Please refer to the corresponding section for more information.

The model structure proposed in this paper is consistent with ESM-2. Did the paper attempt to extend the structure-aware vocabulary on top of a pre-trained ESM-2 model to train a new model? Would this potentially lower the training cost?

Thanks for this suggestion. We indeed considered the option of pre-training SaProt with ESM-2 weights before initiating our own pre-training process. However, it would be challenging to seamlessly continue the pre-training of SaProt using ESM-2 weights. ESM-2 was originally pre-trained with an amino acid vocabulary, and its weights were well-aligned with its token embeddings. However, when we replaced the vocabulary with our structure-aware vocabulary, the token embeddings were randomly initialized, resulting in a mismatch with the pre-trained ESM-2 weights. This mismatch had the potential to impede the optimization of the model.

For your question, we conducted an experiment to pre-train the SaProt 35M model both with and without loading ESM-2 pre-trained weights. You can review the results of this experiment through the following link: (https://api.wandb.ai/links/ltenjoy/2u80knzu). The purple curve, representing pre-training without loading pre-trained weights, exhibits a faster decrease compared to the green curve, which represents pre-training with loaded pre-trained weights. This suggests that pre-training SaProt based on ESM-2 pre-trained weights may not be a useful choice.

Foldseek focus more on local protein structure information when constructing structure information sequences. Could this lead to the model not capturing a comprehensive set of structural information?

Thanks for this insightful question. It is true that Foldseek tokens focus more on local protein structure information, but we feel that this does not disturb SaProt to capture global structure information. Previous GNNs build protein graph based on k-nearest neighbors and extract information from these neighbors. We could also think of it as local structure information, but the information will flow from one node to others and the whole graph will be updated when the embeddings are passed to next layer. We think SaProt's attention architecture enables it to aggregate local structure information and finally get a comprehensive understanding of global structure information. Please refer to Section 5.3, as we think the visualization of SaProt embeddings could illustrate that it captures a comprehensive set of structural information.

Thank you again for your patience in reading our explanation! I hope the answers to the above questions can resolve your concerns and look forward to your reply for further discussion!

评论- To authors

2023-11-21

Thank you to the authors for your patient reply. Your reply has solved most of my questions, so I have reconsidered the score. Looking forward to seeing this paper at the conference.

审稿意见

评分: 6置信度: 32023-11-01

This paper presents a new protein language model called SaProt. SaProt incorporates structural information during pretraining, leveraging the large quantity of predicted structures made available from AF2. As a simple way to incorporate structural information, SaProt leverages Foldseek, a recent approach to translate 3D structures into token sequences. Comparing with approaches from prior work, SaProt performs stronger on a range of protein function prediction tasks. Analysis supports the conclusion that it is the structural information that leads the model to outperform approaches such as ESM from prior work that include only sequence information.

优点

The paper demonstrates that pre-training with structures predicted from AF2 at a large scale can be useful for a wide variety of protein tasks, and that this information is relatively straightforward to incorporate in standard Transformer architectures using Foldseek.
The analysis in section 5 helps support the conclusion that it is indeed the structural information, and not other differences with prior work, that leads to the improvement in task performance.
The authors release their code and pre-trained models.

缺点

Foldseek has several hyperparameters. It would be useful to understand their impact. Since one of the core contributions is the proposed vocabulary, it would have been helpful to see more analysis on the various choices involved (see questions below).
The presentation of the paper could be improved in several places. For example, section 3.1 seemed to distract from the presentation of SaProt and would be perhaps better presented as additional analysis later in the paper. Using MLM loss across the three modeling approaches did not seem like a clear comparison, given the differences in model architectures and inputs. Additionally, the connection between these findings and the design choices of SaProt was not clear. Additionally, in section 3.3.2, it may be clearer to present just strategy 2 and then the ablation comparison with strategy 1 later. These recommendations would simplify section 3 and the exposition of SaProt.
With the availability of AlphaFoldDB and Foldseek, leveraging these resources to incorporate structural information into Transformer-based PLMs is a somewhat straightforward step, and prior work has done this, e.g. ProstT5. ProstT5 is a relatively recent preprint, but I'm glad the authors acknowledged this work. However, while experiments show that SaProt outperforms ProstT5, it didn't seem clear to dismiss this work as not a "general-purpose PLM". Regardless, it seems that the main contribution of this work is not in the novelty of approach, but in the empirical results.

问题

It seemed unintuitive to use the cross-product of residues and Foldseek tokens to form the vocabulary. This seems to discard information about which vocab elements share the same residue or Foldseek token. Did you try other alternatives, e.g. by concatenating embeddings for the residue and Foldseek token to form the input representations?
Is the set of proteins used for pre-training SaProt the same as those used for ESM-2?

Nits:

Introduction paragraph 2 - redidue -> residue
Introduction paragraph 2 - "real protein structures" -> "experimentally determined protein structures"?

评论- To reviewer-1519 (3)

2023-11-19

Regardless, it seems that the main contribution of this work is not in the novelty of approach, but in the empirical results.

Sorry, in this context, "this work" do you refer to SaProt, or ProstT5？

Regarding our work, one key contribution of SaProt is its ability to replace ESM models in various biological tasks. ESM1b, a protein version of BERT trained solely on amino acid sequences, has become a prominent AI tool for numerous biology tasks, including disease/cancer variant effect prediction, protein/enzyme engineering, protein design, fitness prediction, protein function prediction, and protein-protein interaction prediction. Our SaProt model surpasses ESM1b/-2 (including the 15B ESM-2, see Appendix Table 6) on these ten tasks, making it a compelling alternative.

Furthermore, unlike ProstT5 (with either AA sequence or 3Di sequence as input), we have developed a novel vocabulary that represents proteins using both residue and structural information. We believe this new vocabulary, although simple and intuitive, will find wide utilization in many other protein-related tasks (e.g. protein design, complex prediction, etc.). Compared to other recent protein structure-aware LM, SaProt were trained with massive protein structures, for example, GearNet, a well know protein LM was trained with only 800,000 structures. SaProt can support both protein-level and residue-level protein tasks. A detailed summary of our contributions can be found on page 2 of our paper.

It seemed unintuitive to use the cross-product of residues and Foldseek tokens to form the vocabulary. This seems to discard information about which vocab elements share the same residue or Foldseek token. Did you try other alternatives, e.g. by concatenating embeddings for the residue and Foldseek token to form the input representations?

We suspect there might be some misunderstanding in this particular aspect. Creating a new vocabulary by combining residues and Foldseek 3Di tokens is a straightforward and intuitive approach that allows for effective sharing of information between the residue and structural tokens. For example, tokens like Aa, Ap, and Ac indicate the presence of the same amino acid residue "A," while tokens like Ap, Cp, and Hp indicate the shared structural token "p." This information can be easily learned through pre-training using a mask-and-predict approach. In our paper, we conducted corresponding experiments to validate this. Please refer to Appendix Figure 8(c). In Figure 8(c), we observed a clear clustering pattern among the new residue-structure tokens, where tokens with the same residue or structure tend to cluster together. Kindly note that these tokens have no inherent relationship during the initialization process, but through pre-training, the representations of the new vocabulary tokens clearly capture the corresponding residue and structural information.

Tokens with the same amino acid, such as Aa, Ap, Ac, and A#, or the same structural token, such as Ap, Cp, Hp, and #p, usually appear in similar contexts. When our protein sequence contains tokens like Aa, Ap, or Ac, there is undoubtedly an inherent similarity in their surrounding context because they all involve the residue "A." The presence of the common amino acid residue "A" indicates a shared characteristic, which suggests that these tokens have similar context. Through the training objective of masked language modeling (MLM), their embeddings gradually become closer. This phenomenon is analogous to research in the field of Natural Language Processing (NLP) concerning word embeddings, where words with similar meanings often have similar embeddings.

Did you try other alternatives, e.g. by concatenating embeddings for the residue and Foldseek token to form the input representations?

Thank you for your suggestion. While we haven't tried that approach, intuitively it seems like a viable option. The main difference between the two methods lies in how we handle token merging and embedding. In our approach, we first merge tokens like Aa and Ac and then perform embedding. During initialization, these tokens have distinct embeddings, but through learning, they capture the shared information of "A." On the other hand, in your method, you initially assign separate embeddings to A, a, and c, and then concatenate them. During the prediction phase, your approach allows for separate prediction of amino acid tokens or structural tokens, while we predict within a unified new vocabulary space (with 441 new tokens). Due to super-expensive training time and compute resources, we have not yet tried this approach. But it is indeed interesting to see what happens by separately training them.

Thank you again for patiently reading our response! We hope we could address your concerns and we are looking forward to your reply for further discussions!

2023-11-22

Thank you for your reply. I confirm my score of leaning towards acceptance.

"this work" do you refer to SaProt, or ProstT5？

I was referring to SaProt. But I acknowledge that ProstT5 was concurrent work, and is different in several ways.

Notes on vocabulary

Let me clarify my original comments. I acknowledge that the model can learn during pre-training that "Aa" and "Ap" correspond to the same amino acid, but I still think it would have been more intuitive for this information to be explicitly provided to the model, e.g. by embedding amino acids and Foldseek tokens separately and then combining them. Regardless, this is only a minor concern, and does not necessarily warrant additional experiments.

评论- To reviewer-1519 (2)

2023-11-19

The presentation of the paper could be improved in several places. For example, section 3.1 seemed to distract from the presentation of SaProt and would be perhaps better presented as additional analysis later in the paper. Using MLM loss across the three modeling approaches did not seem like a clear comparison, given the differences in model architectures and inputs. Additionally, the connection between these findings and the design choices of SaProt was not clear. Additionally, in section 3.3.2, it may be clearer to present just strategy 2 and then the ablation comparison with strategy 1 later. These recommendations would simplify section 3 and the exposition of SaProt.

Thank you so much for providing so much valuable advice. Please allow us to answer your questions one by one.

Sub-question1: For example, section 3.1 seemed to distract from the presentation of SaProt and would be perhaps better presented as additional analysis later in the paper. Using MLM loss across the three modeling approaches did not seem like a clear comparison, given the differences in model architectures and inputs. Additionally, the connection between these findings and the design choices of SaProt was not clear.

Thanks for this great question. Both MIF (Masked Inverse Folding) model and Evoformer were trained through MLM loss for a fair comparison purpose. MLM loss is used because this loss function aligns with the amino acid-level prediction task, such as mutational effect prediction.

Here, the purpose to present these models is not to compare which one achieves higher accuracy but to illustrate a key problem, which is so far unknown to our community. Our results on the loss behaviors aim to show that the two intuitive ways for modeling protein structures will completely fail when directly modeling AF2 structures with MLM. While some related studies employ similar strategies, we find they often use the PDB structures instead of the AF2 structures. These literature has not shown the readers whether their models work or not when using the massive AF2 structures, such as in MIF (a well-known and effective protein structural model). We aim to highlight the fact that directly incorporating AF2-predicted structures with MLM pre-training strategy will lead to information leakage issues. This is because the structural prediction traces from AF2 can potentially be captured by these large pre-trained model, rendering the training ineffective. We feel that we should mention this before introducing our solution, otherwise readers may be confused as to why these predicted AF2 structures are not added directly but instead must be converted to Foldseek tokens?

In addition, we realize that so far most protein structural models are often coupled with contrastive learning. A natural question is why they do not use the MLM loss, given that contrastive learning often focuses only on the protein-level task and therefore cannot help the amino acid-level mutation prediction tasks. To predict amino acid-level tasks, an MLM loss is often required. However, MLM loss combined with AF2 structures raises information leakage concerns, as we stated above.

We feel that this finding could be very useful for future protein structure modeling, as it has not been clearly mentioned or addressed in relevant literature. In fact, before we proposed SaProt, we spent nearly a year exploring various structure modeling methods and attempted various intuitive methods, all of which ultimately failed as long as we used these AF2 structures and MLM loss. Thus, we want readers to understand our motivation that why we have to convert structures into discrete tokens rather than using structural coordinates.

In fact, we only show the loss trend in this section and provide detailed description on the experimental section as you suggested (i.e. Appendix E.1).

Sub-question2: Additionally, in section 3.3.2, it may be clearer to present just strategy 2 and then the ablation comparison with strategy 1 later. These recommendations would simplify section 3 and the exposition of SaProt.

Thank you for your great suggestion. We have rearranged section 3.3.2 and put the comparison between strategies in the Appendix F. Please check our updated PDF for more details.

评论- To reviewer-1519 (1)

2023-11-19

Dear reviewer 1519,

Thank you for your insightful comments! We have taken them into careful consideration and would like to provide a thorough response addressing your main concerns. Please find our detailed response below.

Foldseek has several hyperparameters. It would be useful to understand their impact. Since one of the core contributions is the proposed vocabulary, it would have been helpful to see more analysis on the various choices involved (see questions below).

Thank you for this question, and we sincerely appreciate your suggestion. We did acknowledge this limitation in the conclusion of our paper. In this study, we utilized the default settings of Foldseek. According to the original Foldseek paper, the authors took into account factors such as accuracy and efficiency, ultimately determining that generating 20 3Di tokens produced favorable outcomes. While we recognize that modifying Foldseek's default settings could be interesting, yet any adjustments would necessitate pre-training the SaProt model (if we want to see its impact on performance), incurring substantial computational expenses. Specifically, retraining SaProt will require 64 A100 GPUs running for three months, which exceeds a cost of $200,000.

Despite our inability to retrain the pretrained model, we conducted an interesting experiment to understand the significance of structural tokens. We substituted the AF2 structure with the ESMFold structure, as ESMFold offers faster structure prediction but relatively lower accuracy compared to AF2. We provided the corresponding results in Appendix E.4. The results indicate a significant performance advantage of using the AF2 structure over ESMFold in downstream tasks. This underscores SaProt's sensitivity to structural tokens. Hence, if future versions of Foldseek or improved structural prediction models become available, SaProt holds the potential for even more promising performance. We hope this experiment can be useful for readers.

With the availability of AlphaFoldDB and Foldseek, leveraging these resources to incorporate structural information into Transformer- based PLMs is a somewhat straightforward step, and prior work has done this, e.g. ProstT5. ProstT5 is a relatively recent preprint, but I'm glad the authors acknowledged this work. However, while experiments show that SaProtoutperforms ProstT5, it didn't seem clear to dismiss this work as not a "general-purpose PLM". Regardless, it seems that the main contribution of this work is not in the novelty of approach, but in the empirical results.

Thanks for your excellent question. ProstT5, which became available online on July 25th, is indeed a remarkable work. Our SaProt work was conducted concurrently with ProstT5 as only training the model takes about 3 months using 64 A100 (In the first two months, we use 32 A100, then we add to 64 A100 for another two months. In total, it takes about 4 months for training). In fact, we also had some email discussions regarding the SaProt design with John Jumper, the first author of AlphaFold, one month prior to the release of ProstT5.

Second, there are key differences between SaProt and ProstT5. The original ProstT5 paper in page 6 states that "ProstT5 is not a general-purpose pLM." We have further validated this by showcasing the performance of ProstT5 in several popular protein function tasks (see Appendix Table 5). Particularly, ProstT5 exhibits unsatisfied performance in mutational effect prediction tasks when employing zero-shot prediction. This difference arises because ProstT5 is trained as a translation task, where the input is a residue sequence and the output is the 3Di token sequence, and then do it in an opposite way. In contrast, our SaProt takes both the residue and 3Di token sequences as input, which contributes to its distinct capabilities on representation learning. ProstT5 is more like a generative model while SaProt is a representation understanding model.

Is the set of proteins used for pre-training SaProtthe same as those used for ESM-2?

We strictly follow the procedure in the ESM-2 paper when constructing the pre-training dataset for SaProt, with the only exception of filtering out sequences without AF2 predicted structures in AlphaFoldDB： https://alphafold.ebi.ac.uk/.

评论- To all reviewers

2023-11-22

Dear all reviewers,

We want to express our sincere gratitude for your valuable feedback on our paper. Your insights and constructive comments have significantly improved the quality of our work. We appreciate the time and effort you dedicated to the review process.

Best regards

AC 元评审

2023-12-08

The paper introduces a new structure aware vocabulary for protein sequence data based upon Foldseek structure representations and show that this information contributes to the performance on downstream protein function tasks.

The reviewers all appreciate this paper and recommend publication.

为何不给更高分

Good but not quite an oral. The idea is quite obvious given the recent success of Foldseek and also explored in parts elsewhere e.g. https://scholar.google.com/citations?view_op=view_citation&hl=en&user=BP3ofxcAAAAJ&sortby=pubdate&citation_for_view=BP3ofxcAAAAJ:kUhpeDhEZMUC

为何不给更低分

Could also be a poster.

公开评论- Thanks for your comments

2024-05-10

Thanks for your comments. Please allow us to explain some key distinctions between Saprot and ProstT5 (as you mentioned) in a more concise manner:

(1) The key contribution of Saprot is the introduction of the 3Di+AA token alphabet (Fig 1). In contrast, ProstT5 does not introduce a new token but directly uses AA sequences as input and outputs 3Di token sequences, and vice versa. Our devised 3Di+AA sequence could be a new representation for proteins, naturally incorporating both AA and 3D coordinate information. Kindly note that 3Di+AA here is regarded as one token.

(2) Saprot is based on the BERT model architecture, whereas ProstT5 utilizes a seq2seq T5 model.

(3) Saprot and ProstT5 are concurrent works, as they were both published on bioRxiv within about two-month timeframe. Kindly note that Saprot underwent three months of training. We had discussions the idea with John Jumpter (the first author of AF2) during the development phase when ProstT5 was not yet available online.

(4) Saprot is a general-purpose model that excels in numerous tasks, including zero-shot mutation effect prediction, supervised tasks, residue-level tasks, and even protein inverse folding. On the other hand, the authors of ProstT5 acknowledged in their paper that it is not a general-purpose model. It was mainly useful as a generative model, e.g. for inverse folding task (https://www.biorxiv.org/content/10.1101/2024.05.24.595648v1). In certain tasks, Saprot has shown to outperform ProstT5 by several times, such as in mutational effect prediction tasks (e.g., 0.478 vs 0.155 in Proteingym) (appendix Table 5).

(5) The paper clearly discussed, compared and cited ProstT5 in our initial submission.

Additionally, Saprot ranked #1st on the public blind-test ProteinGym leaderboard surpassing over 60 well-known baselines: https://github.com/westlake-repl/SaProt/blob/main/figures/proteingym_benchmark.jpg .

Hopefully, this provides a clearer and more refined explanation. Undoubtedly, ProstT5 is also an impressive and remarkable piece of work.

Best regards,

Authors

最终决定Accept (spotlight)

2024-01-16

Accept (spotlight)