MSAGPT: Neural Prompting Protein Structure Prediction via MSA Generative Pre-Training
We propose a novel MSA generative pre-training framework to yield faithful and informative MSA to promote structure prediction accuracy in a low-MSA regime. Studies of transfer learning also show its great potential to benefit other protein tasks.
摘要
评审与讨论
The paper proposes MSAGPT, a novel method for generating MSAs. Utilizing 2D evolutionary positional encoding, MSAGPT reformalizes MSA generation as a one-dimensional sequence generation task optimized with a simple GPT objective. The model incorporates feedback from AlphaFold2 to reduce hallucinations during MSA generation via DPO fine-tuning. Experimental results on curated datasets demonstrate that MSAGPT enhances protein structure prediction in low-MSA scenarios, achieving improved structural reconstruction scores.
优点
-
The paper is well-written, easy to follow, and the proposed framework is simple yet effective, and is straightforward to implement.
-
The use of 2D positional encoding to re-formalize MSA generation as a one-dimensional sequence generation task is innovative and allows for zero- or few-shot MSA generation under a flexible in-context learning framework. It points out a possible direction of playing 2D sequences with novel positional encoding.
缺点
-
Efficiency Concerns: Flattening 2D sequences into 1D for interaction with self-attention increases time complexity, even with FlashAttention. While Figure 8 shows MSAGPT's generation time is lower than the AF2 search pipeline, a comparison with other MSA generative models' efficiency is necessary.
-
More Comparative Analysis: The paper should include a comparison with diffusion-based models for generating protein sequences, such as EvoDiff MSA. Also, could you also report the RMSD (GTD_TS) score.
-
Limited Use Case: The practical use of generating virtual MSAs is limited to models that utilize MSAs, such as MSA Transformer or AlphaFold.
问题
-
I feel like most innovations are credited to 2D RoPE positional embedding, limiting the scope of novelty. Also, could you provide detailed explanation of this method in English? The reference is in Chinese.
-
As mentioned in weakness, please provide a detailed comparison of the efficiency of MSAGPT with other MSA generative models.
-
How does the model perform in MSA-abundant conditions? This should also be evaluated.
-
For the “prediction accuracy” on lines 218 and 121, could you specify the metrics used (e.g., TM-score, RMSD, lDDT, pLDDT)?
-
minor: How does pLDDT selection help in finding structurally similar sequences, as mentioned on line 329?
-
minor: On line 230, do you mean the DPO dataset contains 11k samples?
局限性
Please refer to the Weaknesses and Questions sections. I hope the authors can address concerns regarding efficiency, provide more comparative analysis with baseline models, and offer further explanation on the use of 2D RoPE positional embedding.
About Question-1: the explanation of 2D RoPE and the novelty clarification.
2D RoPE Explanations. Rotary Positional Embeddings (RoPE) encode position information of tokens with a rotation matrix that naturally incorporates explicit relative position dependency. First, consider the 1D Rotary Positional Embedding. Given any two-dimensional feature vector at position . The position embedding can be expressed as:
<p> $f_{\{q,k\}}(x_m, m) = \mathbf{R}_{\theta,m}^2 W_{\{q,k\}} x_m$. Such that $$q_m^Tk_n = (\mathbf{R}_{\theta,m}^2 W_q x_m) ^ T(\mathbf{R}_{\theta,n}^2 W_k x_n)=x_m^TW_q\mathbf{R}_{\theta,n-m}^2W_kx_n $$ Here: $\mathbf{R}_{\theta,\{m,n\}}^2$ is the rotation matrix that depends on the position. $W_{\{q,k\}}$ is a learnable weight matrix. The key to RoPE embeddings is to determine the rotation matrix: $$ \mathbf{R}_{\theta, m} = \begin{pmatrix} \cos m\theta & -\sin m\theta \\ \sin m\theta & \cos m\theta \end{pmatrix} $$ After some derivations, given a 2D position $(m,n)$, a solution for the 2D RoPE is obtained as: $$ \mathbf{R}_{\theta,(m,n)} = \begin{pmatrix} \cos m\theta & -\sin m\theta & 0 & 0 \\ \sin m\theta & \cos m\theta & 0 & 0 \\ 0 & 0 & \cos n\theta & -\sin n\theta \\ 0 & 0 & \sin n\theta & \cos n\theta \end{pmatrix} $$ </p> This solution is easy to understand. It is a block matrix composed of two 1D RoPEs, essentially dividing the input vector into two halves, applying the 1D RoPE to $m$ for one half and the 1D RoPE to $n$ for the other half. From this form, we can also easily generalize to RoPE for 3D, 4D, and other dimensions.The novelty clarification. The principle of multi-dimensional positional encoding has been explored across various domains to address specific challenges inherent to those fields with different intrinsic design proposes. In the MSA generation scenario, incorporating a dual-axis positional encoding scheme is driven by the unique requirements of modeling the complex dynamics of evolutionary patterns in protein homologous sequences, which involves identifying simultaneous mutations across multiple amino acid sites (columns) in different homologs (rows), other high-level interactions. Therefore, a multi-dimensional encoding approach, as compared to a decoupled single-dimensional approach, is both distinct and critical. In light of this, we adapt the RoPE-2D relative positional encoding extended from 1D RoPE to capture these patterns.
About Question-2: the efficiency comparison. We compared the generation speed between MSAGPT and several baseline generative models, including the newly-added EvoDiff. These models were run on a single A100 80G GPU under the direct sequential generation regime, and we report the average tokens per second (toks/s) for generating 2k tokens, averaged over three runs.
| Model | Toks/s |
|---|---|
| MSA-Aug. | 0.35 |
| EvoGen | 0.94 |
| EvoDiff | 1.16 |
| MSAGPT | 0.92 |
From this comparison, we can conclude that MSA-Aug. achieves the lowest inference efficiency due to its encoder-decoder framework. EvoGen and our proposed MSAGPT have similar inference speeds, with the diffusion framework (EvoDiff) showing better inference efficiency. However, EvoDiff shows worse MSA generation quality in the structural prediction tasks (The performance comparison refers to the Table. 1, 2, 3,4 in the attached PDF).
About Question-3: the performance on MSA-abundant conditions. We compare the results of query sequences with abundant natural MSAs to those with abundant natural MSAs augmented by MSAGPT's generated MSAs on CAMEO set. For this comparison, we sample 128, 256, and 512 sequences from both the natural MSAs and the generated MSAs. The results are shown in Table. 6 in the attached PDF. These results indicate that the inclusion of generated MSAs has no significant effect on the performance in MSA-abundant conditions, which is consistent with previous findings that when more than 64 MSAs as input, AF2 predicts a "converged" structure.
About Question-4: the metrics measuring the "prediction accuracy". The prediction accuracy refers to the TM-score, which serves as the golden metric for evaluating the accuracy of predicted protein structures compared with the ground truth structures. Additionally, we provide other metrics such as pTM, GDT_TS, and LDDT in Table 1,2,3,4 in the attached PDF. The results indicate that enhancements in TM-score are consistently accompanied by improvements in these golden metrics, i.e., GDT_TS and LDDT, confirming the robustness and reliability of our predictive method across different evaluation criteria.
About Question-5: the pLDDT selection strategy. pLDDT selection helps identify structurally similar sequences by providing a confidence measure in predicted protein structures without needing ground truth. Higher pLDDT scores highlight regions predicted with greater accuracy, indicating that the corresponding virtual MSA is informative and structurally similar. This confidence-based filtering focuses on the most reliable parts of the predicted structures for more accurate identification. For detailed selection processes, please refer to Appendix E.
About Question-6: the number of cases used in DPO. We construct RLAF preference dataset for the DPO training, where .
Thank you for your detailed responses and additional experiments. I think my concerns are mostly addressed and have raised my score to "accept".
This paper proposes a method to generate multiple sequence alignments for a given protein sequence. To model the co-evolutionary information, the paper proposes 2d evolutionary positional encoding. After pretraining on the alignment sequences, the models are fine-tuned with AlphaFold2 annotations to avoid hallucinations.
优点
- The studied problem, generating multiple sequence alignment is interesting and novel.
- The proposed method is technically sound.
- The paper is well-written and well-structured.
缺点
- Missing important experimental details. The paper omits crucial experimental details, particularly those related to pretraining. This omission significantly impacts the study's credibility and reproducibility. Especially, it's important to explain the process of hyperparameter selection is not explained in detail.
- Evaluation is not rigorous. There is no clear explanation of steps taken to prevent data leakage or the inclusion of data similar to the evaluation set in the training data. The absence of structural or temporal splits is particularly problematic, as these are crucial for assessing a model's performance in truly novel scientific scenarios.
- Evaluation is limited. It would be interesting to design studies to directly investigate the evolutionary patterns learned in the multiple sequence alignment algorithm. Also the paper fails to include many important protein function tasks in its evaluation.
- The results presented in the paper lack error bars, which is particularly problematic for results that are close, such as Table 2 and 3.
问题
See above
局限性
NA
Thank you for your valuable feedback and careful assessment of our work. We address your concerns below,
About Weakness-1: the missing important experimental details. The training details and experimental settings, including the processes for Pre-training, Rejective Finetuning, and DPO, as well as the hyperparameter selection, are thoroughly discussed in Appendix Section C.
About Weakness-2: the data leakage prevention and test data split.
Data Leakage Prevention. As outlined in Section 6.1 of our paper, we implemented a thorough filtering process to eliminate any potential data leakage. Specifically, we removed all MSAs of sequences in the test sets (CAMEO, CASP, and PDB) from the pre-training dataset. Furthermore, we ensured that any sequence in the pre-training set with a similarity greater than 0.9 to a sequence in the test set was excluded. To validate this filtering process, we used the HHblits tool to retrieve sequences from the test set and calculate their maximal similarity distribution with sequences in the pre-training dataset. The results, are illustrated in Figure. 1 in the attached PDF, shows that the maximum similarity is 0.89, confirming that there is no data leakage in the pre-training dataset.
Temporal and structural splits. For the pre-training dataset, we used the OpenProteinSet, containing protein sequences collected before December 28, 2021. For structural predictions, we followed AlphaFold2's evaluation settings, using PDB datasets from before January 22, 2024, CASP14 from May to August 2020, CASP15 from May to August 2022, and CAMEO after August 20, 2020. Our primary goal is to improve structural prediction accuracy in low-MSA regimes using generated virtual MSAs. Therefore, our methods need to generalize across different protein families and timelines. We ensured that sequences in the test set were not included in the pre-training set, as AlphaFold2 does, and conducted experiments on three well-benchmarked datasets to confirm the robustness and generalizability of our approach.
About Weakness-3: the clarification of evaluation. Our primary goal is to enhance low-resource structure prediction using generated virtual MSA. The main experiments focus on protein structure predictions in zero-shot and few-shot scenarios on a Natural MSA-scarce benchmark. We also present results on artificially MSA-scarce and MSA-abundant scenarios (see Tables 4 and 6 in the attached PDF). Additionally, we demonstrate our model's transferability across four protein tasks, highlighting MSAGPT’s potential to impact a broad range of protein-related tasks with generated MSA.
About Weakness-4: the statistical significance test of results. We have addressed statistical significance in several ways. For results in Tables 1 and 2, t-test results demonstrating significance against baseline methods are in Appendix Table 6. For Table 3, we conducted 5-fold cross-validation and reported average performance to mitigate random effects, detailed in Appendix Table 7.
Thanks for your clarifications! I have updated my score accordingly.
Protein structure prediction tools such as alpha fold take a query protein sequence, expand it to a multiple sequence alignment (MSA) of related natural sequences, and then feed this alignment into the model. The first expansion step isn't possible, however, for proteins that don't have many natural relatives, and there are many such 'orphans'. This paper demonstrates that the MSA can be replaced with a set of virtual sequences sampled from a generative model. One of the key contributions of the paper is showing how to find tune those generative model explicitly for the sake of improving the downstream protein structure prediction performance using LM preference optimization techniques such as DPO.
优点
Produces impressive performance improvements for protein structure prediction in the important regime where the query protein is far from other natural proteins.
Demonstrates that good performance can be achieved without modeling techniques specific to multiple sequence alignments (such as axial attention). Using a vanilla setup is nice because it enables using off-the-shelf LM systems that have improvements such as flash attention, etc.
Uses modern tricks of the trade for improving generative models using DPO, etc.
The evaluation compares to multiple recent papers for generating virtual MSAs.
Does a systematic investigation into how to select generated virtual sequences to include in an MSA (L296).
缺点
The evaluation sets are very very small (see below). This is why I gave 'soundness' a score of 2.
The advanced fine tuning techniques don't uniformly improve eval metrics (see below).
==Update after author's response== I have raised my score to 'accept' and raised my soundness score, since the response adequately addressed my concerns about evaluation.
问题
**Tiny evaluation sets The evaluation metrics are reported on very tiny evaluation sets (L247; 8 from cameo, 13 from CASP14&15, 179 from PDB). This makes me worry that the differences in performance are not statistically significant, or don't reflect performance on the broad distribution of orphan proteins that users may want to perform prediction for.
I see two ways forward: (1) demonstrate that the proteins in these tiny eval sets are somewhat representative and that the differences in metrics are statistically significant or (2) change the eval setup to use a bigger eval set.
I think (2) is much easier. To do this, can't you turn any example in the full eval set (e.g., the PDB) into an orphan? You could do the zero-shot eval using just the original sequence, with an MSA containing only virtual sequences. With this, can you report performance on the full eval sets?
**Fine tuning decreases pLDDT
I understand the argument that pLDDT decreased and TM score increased because the fine tuning targeted improvements to TM score. However, it's unfortunate that pLDDT increased. If you fine tune for pLDDT, would that increase at the expense of TM? Could you extend the fine tuning to improve both (such as using a composite reward function based on both)?
**Minor points L24 is inaccurate: "The remarkable success of AF2 can be attributed to its innovative use of co-evolutionary information supported by the Multiple Sequence Alignment (MSA)." Coevolution had been used for structure prediction for many years before AF. AF got better performance because of general model scale and processing the MSA end-to-end instead of using pre-computed features of the MSA.
L113: what does 'homogenous' mean in this context?
L149: As far as I can tell, there are no semantics to the ordering of the rows. The row positional encoding is basically a unique id for which row it belongs to. Perhaps it would be better to use a different representation for the row index that doesn't reflect linear ordering so much.
L180: What order do you use for flattening the axes? Have you tried both?
I'm curious if there is any noticeable difference in the sequences generated after DPO vs. the base model. Are there any sort of degeneracies due to reward hacking, etc?
局限性
Yes
Thanks for your insightful feedback and constructive suggestions for our work. We addressed your questions as follows:
About Question-1: add more evaluations. We have adopted both suggestions to confirm the superiority of MSAGPT:
- Statistical Significance of Metrics: We conducted a paired Student's t-test between MSAGPT and other baselines. The results, shown in Appendix Table 6 in the paper, indicate that the virtual MSA generated by MSAGPT significantly improves structure prediction accuracy in cases with limited MSA compared to other baselines.
- Evaluation on a Larger Set and more metrics: We created the Artificial MSA-Scarce benchmark based on the PDB dataset released before 2024-01-22. We collect approximately 8k protein sequences after filtering and perform zero-shot evaluations using these sequences with MSAs containing only virtual sequences. The results are detailed in Table. 4 in the attached PDF. These results demonstrate that the generated MSA significantly improves structure prediction accuracy in cases with limited MSA than other baselines.
About Question-2: the explanation of decreased pLDDT. The predictive metrics estimated by AlphaFold2, such as pLDDT and pTM, measure the confidence level of AlphaFold2's predictions, rather than reflecting the true structural prediction accuracy. pLDDT can be easily influenced by adding more MSAs. We conducted an ideal experiment to compare the performance among original sequences, augmented by randomly generated MSAs, and augmented by MSAGPT MSA as shown in Table. 5 in the attached PDF. We can see even with randomly generated MSAs, the pLDDT and pTM values are higher than with the original sequences, while the TM score decreases due to the introduction of noisy MSAs. Therefore, we adopt the TM score as the primary metric to select high-quality MSAs for supporting the subsequent RFT and DPO processes. Additionally, we provide other oracle metrics including GDT_TS, LDDT in Table 1,2,3 in the attached PDF. The results show that with TM score as the reward signal, other oracle metrics, i.e., GDT_TS and LDDT also improve, while the predicted metric pTM shows the opposite trend. Thus, the reasonable explanation for the decreased pLDDT and pTM metric is that the post-alignment process reduces hallucination scenarios.
About Question-3: correct the clarification on the utilization of MSA in AF2 . Thanks for your constructive feedback. We will incorporate your suggestions to clarify this in the next version of our paper.
About Question-4: the typo of "homogenous". Thank you for pointing out the typo. The correct term should be "homologous," which, in the context of biology, refers to organs or sequences that are similar in position, structure, and evolutionary origin, but not necessarily in function.
About Question-5: the semantics of MSA rows. The MSA obtained by MSA search tools, such as HMM search, inevitably contain noisy co-evolutionary patterns, such as large portions of deletions, insertions, and gaps. Many previous works aim to filter high-quality MSAs by clustering or ranking them based on predefined rules. One primary rule is to find MSA sequences most similar to the main sequence with fewer gaps, as these are more likely to represent informative co-evolutionary patterns. Following this idea, we sample MSA sequences using similarity-based weighted sampling, where sequences more similar to the query protein with less gaps are more likely to be ranked higher and selected. Our ablation study results, shown in Figure 5, confirm that compared to 2D positional encoding, the 1D positional encoding (which only retains the column site position while abandoning the row ordering information) performs worse. This indicates that incorporating row-order semantics through similarity-based sampling improves performance.
About Question-6: the flattening rules. As demonstrated in the overall framework in Figure 2 in the paper, we flatten the MSA along the row axis to ensure we can generate the MSA sequentially during inference. Flattening along the column axis is also theoretically reasonable, as the ordering information is bounded by the 2D evolutionary position embedding. However, when generating the MSA sequentially during inference, we would need to reverse the positional IDs along the row side. This differs from the training pattern and could introduce unforeseen errors, which require further investigation.
About Question-7: the degenerated cases after DPO. The case studies showing the differences in generated sequences before and after DPO are already presented in Appendix Section G and Figures 11 to 15. Generally, we do not observe significant differences in the generated sequences, except for a few degenerate cases, including generated MSAs that do not match the length of the query sequences or MSAs that contain only gaps or a single type of residue. The proportion of these degenerate MSAs is approximately 6% for DPO-generated MSAs compared to 1% for the base model. To address these degenerate cases, online RLHF like PPO, which directly involves AF2 in the training procedure, may be more effective. However, this approach faces efficiency issues due to the lower inference efficiency of AF2, which needs further investigation.
We guarantee that we will include all experimental results and discussions in the next version of our paper. Your feedback has been invaluable in refining our research. If we've addressed your concerns, we hope you might consider raising your score.
Thank you for your thorough response to my questions. I am pleased to see the new results and have raised my score to 'accept'
Global Response on Newly-Added Comprehensive Evaluations and Claims
Dear Reviewers,
Thank you for your insightful feedback and constructive suggestions. We have incorporated additional experimental results, detailed in the attached PDF, and provided thoughtful discussions to address your concerns point by point, demonstrating the strengths of our work. Specifically, we have:
- Clarified the issue of data leakage in the pre-training data.
- Added statistical significance tests showing that our approach significantly outperforms baselines.
- Introduced a large artificially MSA-scarce zero-shot evaluation benchmark with approximately 8k newly-released PDB datasets.
- Evaluated our model on MSA-abundant scenarios using the 194 CAMEO dataset.
- Included more comprehensive evaluations with additional metrics such as pTM, LDDT, and GDT_TS, advanced baselines like EvoDiff,a diffusion-based sequence generation model, and efficiency comparisons.
We guarantee that all experimental results and critical discussions will be included in the next version of our paper. We appreciate the time and effort you have dedicated to reviewing our work. Your valuable comments have helped us refine our research. If our rebuttals have addressed your critical concerns, we hope you can consider raising your score.
Best regards,
Authors of MSAGPT
The paper introduces MSAGPT, a method that generates multiple sequence alignments (MSAs) by considering it as a 1D sequence generation task, using a 2D evolutionary positional encoding. The paper demonstrates that such generated alignments can substantially improve protein structure prediction accuracy in cases with low-depth MSAs.
The reviewers found the paper well-written and easy to follow, and the method novel and technically sound. They highlighted impressive performance gains achieved in the shallow-MSA regime. The primary concerns raised were regarding the experiments, requesting more details and clarity regarding potential data-leakage. The authors addressed these issues in their rebuttal, and post-rebuttal, all reviewers clearly recommended the paper be accepted.