MSA Generation with Seqs2Seqs Pretraining: Advancing Protein Structure Predictions
We introduce MSA-Generator, a self-supervised model that generates virtual MSAs, enhancing protein structure predictions in key benchmarks.
摘要
评审与讨论
The paper introduces MSA-Generator, a novel self-supervised generative protein language model designed to address the limitations in protein structure prediction due to shallow multiple sequence alignments (MSAs). This model, pre-trained on a sequences-to-sequences task using an automatically constructed dataset, incorporates protein-specific attention mechanisms. These features enable MSA-Generator to produce virtual, enriched MSAs from large-scale protein databases, improving the quality of MSAs particularly for proteins without extensive homologous families.
优点
- Results on CASP14 and CASP 15 seem promising. The method enhances more than one pretrained model, showing the improvement on MSA quality is general.
- Low-quality MSA and lack of MSA have been a challenging problem in protein predictions. This work address an important bottleneck in the field.
- Interesting insights are raised in anlaysis. The generated MSA improves quality in terms of diversity and conservation areas.
缺点
Generally the paper is well-written and performs adequate analysis on important benchmarks. There a few weaknesses that could be addressed to improve the work:
-
Generated MSA improve performance on large-scale pretrained frameworks which are already pre-trained on MSA. Specifically, Alphafold2 has been found to greatly benefit from MSA sequence numbers as well as diversity. However, for traditional models like Potts/ co-evolution based statistic models, could generated MSA improve performance? This could showcase whether generated MSA could replace MSA as a representative input feature.
-
JackHMMER is not a commonly used MSA tool and may show declined performances. Do authors consider using more advanced tools like hhblits and mmseqs2? Also what is the e-value of the generated MSA compared to MSA ?
-
Can this method be used to generate a MSA from scratch rather than augment MSA? MSA computation bottleneck is a even more severe problem and generation could improve it.
问题
See weakness.
局限性
Yes.
We really appreciate your feedback and advice on improving the work, here we provide more discussion:
-
Generated MSA for Traditional Models
This is indeed an interesting and valuable question. We adopted CCMpred [1], one of the leading graphical models for protein contact map prediction, to evaluate how our augmented MSA benefits these traditional models. We measured the Average Top L/5 Precision on the real-world challenge set from CASP14, which includes targets with fewer than 10 homologs (T1064-D1, T1093-D1, T1100-D2, T1096-D2, T1099-D1, T1096-D1). The results are as follows:
Original Augmented Top L/5 0.176 0.205 These results suggest that the generated MSA can also improve the performance of traditional methods, suggesting the widely usage of the proposed model.
-
MSA Search Algorithm
We follow the MSA dataset construction pipeline of AlphaFold2 [2], which employs JackHMMER. We appreciate your suggestion to consider advanced tools like HHblits and MMseqs2 and plan to integrate these tools to construct larger datasets in our future work.
Regarding the E-value, to our best knowledge, it measures the likelihood of an individual sequence alignment occurring by chance, not multiple sequence alignments (MSAs). Therefore, we believe comparing the E-value of one MSA to another is inappropriate. When using JackHMMER with UniRef90 in our setup, the E-value threshold is set to 0.001.
-
Generate MSA from scratch
We have included results on orphan proteins in our global response, which involves generating MSAs from scratch.
However, generating MSAs from scratch is inherently challenging. Without proper contextual information from an input MSA, synthesizing multiple sequences that share co-evolutionary knowledge becomes difficult. One potential solution is to scale up the training size, with the hope that larger models, having seen more sequences, can achieve zero-shot MSA generation.
Regarding the computation required for MSA generation, given the heavy generation cost of larger models, we do not anticipate that it would be much more efficient than search algorithms if there are many homogeneous sequences. Instead, we see the value of applying generation methods to proteins that lack homologs.
[1] Seemayer S, Gruber M, Söding J. CCMpred—fast and precise prediction of protein residue–residue contacts from correlated mutations[J]. Bioinformatics, 2014, 30(21): 3128-3130.
[2] Jumper J, Evans R, Pritzel A, et al. Highly accurate protein structure prediction with AlphaFold[J]. nature, 2021, 596(7873): 583-589.
I thank the authors for their endeavor in preparing the rebuttal. The rebuttal largely solved my question 1 & 3. I choose to keep my score after reading the responses. I hope the authors can include more experiments on different protein architectures (pre-trained, trained from scratch model, and traditional potts models) that involve MSA for comprehensive analysis.
We sincerely appreciate your feedback and thoughtful review. While we understand the suggestion to include experiments on different protein architectures, we respectfully request more detailed guidance on the specific additional experiments you would recommend.
Our work primarily aims to generate MSA for protein structure prediction, as highlighted in the paper’s title. While we recognize the value of exploring broader models requiring MSA, our focus has been on state-of-the-art methods, particularly those built upon deep learning techniques in this domain. We believe the current results sufficiently support our claims within the scope of protein structure prediction.
This work introduces MSA-Generator to generate virtual, informative MSAs. The generated MSAs can advance protein structure prediction.
优点
- The MSA generation and protein structure prediction problems studied in this work are important.
- The writing is clear and the method is easy to follow.
缺点
- The technology novelty is somewhat limited. The model architecture mainly follows the MSA transformer, and the training framework is similar to Seq2Seq.
- I notice that there are several related work that also studies the problem of generating MSA to advance protein structure prediction [1]. The differences and advantages should be discussed.
- The ablation studies are weak. More design choices should be verified.
[1] MSAGPT: Neural Prompting Protein Structure Prediction via MSA Generative Pre-Training
问题
See weaknesses
局限性
None
-
Novelty
The innovation of our work lies in our pioneering approach to self-supervised pretraining for Multiple Sequence Alignment (MSA) generation. While tied-row attention and self-column attention are similar to the mechanisms in the MSA Transformer, it's important to note that the MSA Transformer is an encoder-only model, primarily designed for learning MSA representations. In contrast, our model is an encoder-decoder model, which is specifically tailored for generating MSAs that are more effective for structural prediction. This distinction leads to significant differences in both architecture design and training approach.
Additionally, our seqs2seqs framework extends the vanilla seq2seq by focusing on generating multiple sequences that share co-evolutionary information in parallel, owing to the cross-row/column attention we introduced. Our contribution is valuable as we demonstrate the efficacy of large-scale self-supervised MSA generation.
-
Related Work
We have already incorporated most of the prior protein/MSA generation methods in the related work section. While we appreciate your reference to MSAGPT [1], this work was actually released (8 Jun) after the NeurIPS submission deadline (22 May). Therefore, it was not possible to include a discussion and comparison in our draft. However, we are happy to include it in our revision.
-
Ablation Study
Thank you for your feedback regarding the ablation studies. We acknowledge the importance of verifying design choices; however, conducting extensive ablation studies is computationally expensive for our end-to-end pre-training model.
Additionally, each component of our model is indispensable for its proper functioning. We appreciate your understanding and are open to discussing alternative approaches for future work.
We hope the discussion addresses your concerns, and welcome further discussion.
[1] Chen B, Bei Z, Cheng X, et al. MSAGPT: Neural Prompting Protein Structure Prediction via MSA Generative Pre-Training[J]. arXiv preprint arXiv:2406.05347, 2024.
Thank you to the authors for the rebuttal. Unfortunately, I did not find your response very convincing. This makes me hesitant to increase my score.
-
I do not believe that the proposed encoder-decoder model has significant novelty compared to the encoder-only MSA Transformer, as I did not notice substantial improvements in the decoder and pre-training loss.
-
I realize that MSAGPT was released after the deadline, but I think that comparing it during the rebuttal period would help me better understand the advantages of MSA-Generator.
-
It is essential to report the time spent on pre-training for the community. Besides, small-scale pre-training experiments also help in proposing a more solid model [1].
[1] Rives, Alexander, et al. "Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences." Proceedings of the National Academy of Sciences 118.15 (2021): e2016239118.
- Novelty
We respectfully disagree with the argument regarding the novelty of our work. There is a significant distinction between the encoder-only paradigm and the encoder-decoder paradigm. Our model's innovation lies in its specialized approach to generating Multiple Sequence Alignments (MSAs) for structural prediction—an approach that differs markedly from the objectives of the MSA Transformer. Additionally, we disagree with the notion that utilizing a transformer-based architecture lacks novelty. By that logic, seminal models like BERT, GPT-1, 2, 3, Vision Transformer, and their successors would also lack novelty due to their reliance on attention mechanisms. However, many successful works, such as the MSA Transformer and the ProGEN family, also leverage methodologies from the Transformer architecture. This has not diminished their impact or novelty .
- Comparison with MSAGPT
We would first like to emphasize that, according to the review guidelines, content point 6, papers are not required to compare against contemporaneous works appearing less than two months before the submission deadline, let alone works published after the deadline. Nevertheless, we would like to address your concerns regarding MSAGPT.
MSAGPT is a strong follow-up to our work, with some minor differences. First, MSAGPT incorporates the latest techniques from the NLP community, such as RLHF and RoPE, using rejection sampling from AlphaFold2’s feedback and adopting DPO based on AlphaFold2 to further fine-tune the models. Second, MSAGPT is a larger model, with 3B parameters trained on 16M data, while our model is much smaller, with 260M parameters trained on 2M data. One significant difference lies in the architectural design, which we have highlighted as a key contribution of our work. MSAGPT directly adopts the decoder from Transformer to generate sequences in a 1-dimensional manner. This generation paradigm is highly inefficient for deep MSAs, resulting in overlong output sequences with a complexity of , where M is the depth and L is the length. In contrast, our decoder supports the generation of MSA in a parallel manner, significantly reducing the computational cost to . Furthermore, MSAGPT heavily relies on AlphaFold2 during training, which could introduce bias and unintended data leakage. In contrast, our model operates independently of other models, resulting in a more straightforward training process and yielding more reliable results.
We would also like to underscore that our work serves as the foundational basis for MSAGPT, as indicated by our model's role as an important baseline in their comparisons. Therefore, we believe it is inappropriate to compare follow-up works while assessing our contributions.
- Training Details
Our model was trained for 200k steps on 8 A100 GPUs, as detailed in line 202, with pretraining taking approximately 100 hours. While we appreciate the reference provided, it is important to note that the referenced work uses 250M sequences and is an encoder-only model, making it neither small-scale nor directly comparable to our work.
Thanks for the rebuttal.
I don't argue that using Transformers lacks novelty. I think the contribution from the encoder to the encoder-decoder is not as significant as the authors claim, because both the encoder and decoder are based on the MSA-Transformer mechanism. Additionally, no ablation experiments were conducted to explore whether this is the optimal solution for MSA generation.
For the references I provided, I would like to express that a similar approach can be taken to first conduct small-scale experiments to ablate design choices before moving on to large-scale pre-training. This will ensure that the proposed model is solid.
After reading responses to other reviewers, I will maintain my score.
This paper introduces a method for generating multiple sequence alignments (MSAs) using a self-supervised seq2seq task. By leveraging large-scale protein databases, this approach produces virtual, informative MSAs that enhance the performance of protein structure prediction models such as AlphaFold2 and RoseTTAFold. The improvements are especially noticeable for proteins without extensive homologous families, demonstrating that data augmentation techniques can also be effectively applied in the protein domain.
优点
- The introduction of a seqs2seqs task for MSA generation is innovative, leveraging the power of self-supervised learning to improve protein structure predictions.
- The method demonstrates marked improvements in lDDT and GDT-TS scores on challenging protein sequences using CASP14 and CASP15 benchmarks.
- The enhanced MSA has the potential to serve as an auxiliary dataset for training on protein-related tasks. It would be beneficial if the authors could make their code and the dataset publicly available, allowing the community to take full advantage of these resources.
缺点
- As the main contribution of this work is generating more homologous sequences based on the original low-quality MSA, it would be valuable to investigate how the MSA generator improves structure prediction for orphan proteins.
- What criteria are used to determine the quality of homologous sequences? Are they based on a small number of sequences, or do they consider homologous sequences with only sequence similarity but not structural similarity? Additionally, are the pre-trained datasets composed solely of high-quality homologous sequences?
- Using MSA as input demands significant memory resources. What thresholds are set for the input source MSA and target MSA depth during training? Is there a relationship between the depth of the source MSA and the depth of the target MSA? For instance, is the depth of the source MSA greater than that of the target MSA?
- I noticed that the tied row attention mechanism is employed in the encoder. To reduce memory usage, have you considered introducing this shared attention mechanism in the cross-row attention within the decoder?
问题
See above.
局限性
Yes
We appreciate the reviewer’s feedback and would like to clarify the following points:
-
Orphan proteins
We have included results for proteins with only single sequences and presented these findings in our global response. Please refer to it for detailed discussion.
-
Criteria are used to determine the quality of homologous sequences
To measure the quality of generated homologous sequences, we use AlphaFold2 and evaluate the predicted protein structure accuracy against the reference structure as the metric. The sequences are selected based solely on the depth of the MSA, without considering structural similarity.
For pre-training data construction, we use the protein sequence search tool JackHMMer, following AlphaFold2, and set the e-value to 0.001 to ensure high-quality homologous sequences for pre-training. The search criterion is based on sequence similarity measured by profile hidden Markov model. For more details, please refer to [1].
-
MSA depth
Thank you for raising this valid question. Generating MSAs can indeed be computationally expensive when the input is very deep. However, our goal is to enhance MSAs in situations where rich MSAs are not available, which means the depth of the MSA is typically shallow, making our framework accessible.
During training, we randomly select 10-30 sequences as input and randomly sample another 10-30 sequences as the target from the JackHammer search results, as detailed in Appendix A. There is no requirement for the input to be deeper than the target to mimic real-world applications.
-
Tied-row attention
We appreciate you raising this valuable question. Tie-row attention is designed to compress global information from all input MSAs, which is valid for the encoder as it aims to learn a global representation for the input MSA. However, we did not include this shared MSA mechanism in the decoder to avoid assigning the same weight to each sequence during decoding, thereby ensuring the generation of diverse output sequences.
[1] Potter S C, Luciani A, Eddy S R, et al. HMMER web server: 2018 update[J]. Nucleic acids research, 2018, 46(W1): W200-W204.
Dear Reviewer,
Thank you once again for taking the time to review our paper. Could you please review our rebuttal to see if it has addressed your concerns at your earliest convenience? The discussion period ends in approximately 24 hours. If our response resolves your concerns, we kindly ask you to consider adjusting the scores.
Best regards, The Authors
Thank you for the detailed responses. After reviewing them, I believe my original score remains appropriate for this work, and I will maintain it as is. Thank you again for your efforts!
The paper proposes a method to generate MSA sequences to provide more alignments of the MSA. MSA-Generator can increase the depth of the MSA input, and thus incorporate more information. MSA-Generator demonstrates its capacity to synthesize higher-quality MSA via experiments on the CASP dataset.
优点
-
MSA-Generator can alleviate the shortage burden of MSA data in proteomics research. Protein sequences are in large amounts, which MSA data are limited or require heavy computation costs to obtain. MSA-Generator tries to resolve the challenge by leveraging the generative language models to generate more/argument alignments to increase the depth of MSA.
-
The MSA data generated by the MSA-Generator helped the existing MSA-based structure prediction models to achieve better performance (AlphaFold2 and RoseTTAFold) on the CASP14 and CASP 15 datasets.
缺点
-
There are some other datasets available for experiments. Have you considered conducting experiments on datasets like CAMEO?
-
Figure 4 (c) shows the distribution of LDDT improvement. Can you provide a more precise clarification on Virtual MSAs' pros and cons compared with real MSAs?
-
Will the MSA-Generator be helpful for individual protein sequences as the input? There are some works on single-sequence protein language models worth discussing in your related work section.
[1] ProteinBERT: a universal deep-learning model of protein sequence and function. [2] Modeling aspects of the language of life through transfer-learning protein sequences. [3] Modeling Protein Using Large-scale Pretrain Language Model.
问题
- See above.
局限性
-
Further experiments on CAMEO can be helpful.
-
MSA_Generator's capacity on Individual sequence. Experiments are not necessary, maybe some case study or analysis.
We appreciate the reviewer’s feedback and would like to clarify the following points:
-
CAMEO Results
Thank you for suggesting the inclusion of CAMEO as an additional benchmark. We have taken this into consideration and conducted further experiments. For the CAMEO benchmark, we searched its MSA on the UniCluster30 database using HHblits and identified protein sequences with fewer than 20 homologs as the Real-World Challenge set from CAMEO following [1]. The results of our method are presented below:
CAMEO(avg Depth=8.5) pLDDT LDDT TM-Score GDT-TS ESMFold 49.3 46.8 0.65 0.51 OmegaFold - 47.9 0.59 0.47 RoseTTAFold 69.8 57.0 0.62 0.55 RoseTTAFold+Potts Generation 69.6 56.7 0.59 0.50 RoseTTAFold+Iterative Unmasking 70.2 60.1 0.62 0.57 RoseTTAFold+MSA-Generator 75.6 62.9 0.69 0.62 AlphaFold2 72.6 59.4 0.69 0.61 AlphaFold2+Potts Generation 72.3 59.0 0.64 0.57 AlphaFold2+Iterative Unmasking 74.2 60.6 0.70 0.63 AlphaFold2+MSA-Generator 77.2 64.2 0.73 0.67 These results are consistent with our findings on the CASP14/15 dataset.
-
Virtual MSAs' pros and cons compared with real MSAs
Thank you for raising this valid question. In section 4.2, we compare the performance of generated virtual MSAs with real MSAs under simulated conditions where we downsample 5 sequences as the baseline MSA and 15 sequences as the real MSA. Our findings in this controlled simulation indicate that virtual MSAs are generally as effective as real MSAs, and in some cases, virtual MSAs even outperform real MSAs.
However, we want to emphasize that our study primarily focuses on situations where no real MSA can be constructed using search algorithms, as discussed in section 4.3. We do not suggest that virtual MSAs are superior to real MSAs overall. Instead, in cases where real MSAs cannot be found, virtual MSAs provide a valuable alternative for enhancing protein structure prediction, and this is where our method makes a significant contribution.
-
Individual Protein Sequence
Yes, MSA-Generator also benefit individual protein sequences as the input. We have included results for proteins with only single sequences and presented these findings in our global response. Please refer to it for detailed discussion.
-
Relevant Work
Thank you for mentioning these relevant works ([1][2][3] in the comments). We will include them in our revision. However, the works referenced focus on encoder models that aim to learn good representations for protein sequences, whereas our method is an encoder-decoder model focused on generating high-quality MSAs.
We have adopted single-sequence models, including ESMFold [2] and OmegaFold[3], as our baselines in section 4. We will incorporate a more detailed discussion in section 2 in our revision.
[1] Chen B, Bei Z, Cheng X, et al. MSAGPT: Neural Prompting Protein Structure Prediction via MSA Generative Pre-Training[J]. arXiv preprint arXiv:2406.05347, 2024.
[2] Lin Z, Akin H, Rao R, et al. Language models of protein sequences at the scale of evolution enable accurate structure prediction[J]. BioRxiv, 2022, 2022: 500902.
[3] Wu R, Ding F, Wang R, et al. High-resolution de novo structure prediction from primary sequence[J]. BioRxiv, 2022: 2022.07. 21.500999.
Dear Reviewer,
Thank you once again for taking the time to review our paper. Could you please review our rebuttal to see if it has addressed your concerns at your earliest convenience? The discussion period ends in approximately 24 hours. If our response resolves your concerns, we kindly ask you to consider adjusting the scores.
Best regards, The Authors
We appreciate the reviewers' efforts and feedback. We noticed a common interest in whether the proposed method could benefit single protein sequences, also referred to as orphan protein sequences.
To address this, we conducted experiments using the entire CASP14/15 dataset (the dataset used in Section 4.3) with only single protein sequences as input. Additionally, for a more comprehensive understanding, we included results from a orphan protein family, Orphan25 [1]. We used MMseqs2 to search against UniRef30 and ColabFoldDB [2], which was built by expanding BFD/MGnify with metagenomic sequences from various environments, selecting sequences with no homologues, and obtained 10 proteins as a test set (6WKY, 6WL0, 6XA1, 6XN9, 6XYI, 7A5P, 7AL0, 7JJV). We conducted the experiment using the same setup as in Section 4.3. The results are detailed below:
| CASP14&15 | pLDDT | LDDT | TM-Score | GDT-TS |
|---|---|---|---|---|
| AlphaFold2 | 43.8 | 26.9 | 0.30 | 0.28 |
| AlphaFold2+Potts Generation | 37.7 | 22.2 | 0.21 | 0.23 |
| AlphaFold2+Iterative Unmasking | 48.2 | 30.8 | 0.32 | 0.33 |
| AlphaFold2+MSA-Generator | 57.2 | 36.2 | 0.39 | 0.37 |
| Orphan25 | pLDDT | LDDT | TM-Score | GDT-TS |
|---|---|---|---|---|
| AlphaFold2 | 77.2 | 61.6 | 0.61 | 0.62 |
| AlphaFold2+Potts Generation | 68.9 | 49.3 | 0.49 | 0.43 |
| AlphaFold2+Iterative Unmasking | 78.9 | 62.5 | 0.64 | 0.63 |
| AlphaFold2+MSA-Generator | 81.8 | 66.4 | 0.69 | 0.67 |
The results with orphan protein sequences on the two datasets suggest that our method can also benefit orphan proteins and be particularly helpful for challenging inputs (CASP), further strengthening the effectiveness of our approach.
[1] Wang W, Peng Z, Yang J. Single-sequence protein structure prediction using supervised transformer protein language models[J]. Nature Computational Science, 2022, 2(12): 804-814.
[2] Mirdita M, Schütze K, Moriwaki Y, et al. ColabFold: making protein folding accessible to all[J]. Nature methods, 2022, 19(6): 679-682.
We recommend acceptance as a poster:
- MSA-Generator can alleviate the shortage burden of MSA data in proteomics research
- Results on CASP14 and CASP 15 are promising
- Interesting insights are raised in the analysis
- Authors improved the paper during the rebuttal, with news results CAMEO and Orphan25
To authors: please make sure you include all reviewer comments in the final revision.