MutaPLM: Protein Language Modeling for Mutation Explanation and Engineering
A unified framework harvesting protein language models for mutation explanation and engineering
摘要
评审与讨论
The paper presents MutaPLM, a framework designed to interpret and navigate protein mutations using protein language models. This approach utilizes a protein delta network to capture mutation representations and employs a transfer learning pipeline with a chain-of-thought strategy to leverage knowledge from biomedical texts.
优点
- This paper attempts to propose a general interpretable model for protein mutations.
- This paper compiles a mutation-text multimodal dataset, providing an excellent benchmark for future work.
- The code is available. Although I haven't had time to run it yet, I will try to run the code during the rebuttal phase to ensure the reproducibility of the experiments.
缺点
- PLM representations used in this study is the residue-level or protein-level embedding? If the mutation has very few residues, such as a missense mutation, will using protein-level embedding result in h∆ being too small?
- Is it possible to provide some more practical mutation-related downstream task benchmark results? For example, predicting changes in protein properties or PPI?
- Is it possible to compare the proposed method with the predictive results of embeddings extracted by AF, since the description information of the mutation may already be included in the structural changes predicted by AF before and after the mutation?
- I do not deny that this is a good work, but perhaps it is more suitable for the benchmark and dataset track, because its method has limited innovation, and it has not verified its interpretability and performance on actual tasks related to protein properties.
问题
N/A
局限性
This paper discusses the limitations and points out the direction for future work.
Thank you for your appreciation of our model, dataset, and code. We address your concerns below.
Q1: Details of PLM representations and delta features
As detailed in Appendix A.1, the PLM representations used in this study are residue-level embeddings. Regarding the scale of , we demonstrate the following:
- We calculate the average L2-norm on the MutaPLM dataset, which is 9.90 for the wild-type representations and 0.35 for the mutational representations . Since our dataset only involves single-site (missense) mutation, this indicates that a mutation with very few residues will NOT lead to being too small.
- The protein delta encoder amplifies , resulting in an average L2-norm of 1.04 for its outputs (delta features ). Additionally, we argue that the orientation of plays a more significant role in elucidating the evolutionary directions of mutations.
Q2: Experiments on prior mutation-related benchmarks
Thanks for your suggestions. We have tested MutaPLM on several mutation downstream benchmarks, including Spike-ACE2 [1] and avGFP [2] involving protein-protein interaction (virus-receptor binding) and protein properties (fluorescence intensity). Details of the benchmarks and our implementation are presented in the global rebuttal R2, and the results and case analysis are presented in Table 3 and Figure 1 in the uploaded PDF. We observe that MutaPLM achieves competitive performance with fine-tuned mutation models on these realistic benchmarks. We plan to extend our experiments to more protein fitness benchmarks in a future revision of our paper.
Q3: Comparisons with AlphaFold representations
Thank you for your insights. We manually inspect 500 test samples from our dataset and observe that 23 of them describe specific alterations of the protein structure explicitly.
To explore the impacts of structural representations for mutation explanation, one can use the first row of MSA representations calculated by the EvoFormer of AF2 [3] as the residue-level protein representations and feed the representations of the wild-type and the mutant into an LLM. Unfortunately, calculating AF representations on our dataset would require approximately 4000 GPU hours. Hence, as stated in our limitations, we reserve analyzing alterations of 3D structures for future investigation.
Q4: This paper is more suitable for the benchmark and dataset track
We argue that our contribution are three-fold, including the protein delta network, training strategies, and the dataset. We would like to address the innovations of our methodology, as recognized by Reviewer eM8T, in the following:
- Compared with prior works, we are the first to model mutations explicitly with delta features.
- We propose a chain-of-thought strategy for explaining and engineering mutations, which has not been explored in previous multi-modal protein-text LLMs [4, 5, 6].
For mutation interpretation, we have validated the interpretability of our model with qualitative evaluations in Figure 2 and Figure A2 in our paper, as well as additional quantitative evaluations on actual protein fitness benchmarks. For mutation engineering, we have reported the fitness optimization results on 6 realistic datasets in Figure 5 in our paper. Hence, we believe that our work bears potential in real-world applications, and is suitable for the main track of the conference.
Refs.
[1] Shifting Mutational Constraints in the SARS-CoV-2 Receptor-binding Domain during Viral Evolution.
[2] Local Fitness Landscape of the Green Fluorescent Protein.
[3] Highly Accurate Protein Structure Prediction with AlphaFold.
[4] Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models.
[5] ProtLLM: An Interleaved Protein-language LLM with Protein-as-word Pre-training.
[6] ProLLaMA: A Protein Language Model for Multi-Task Protein Language Processing.
In the paper entitled "MutaPLM: Protein Language Modeling for Mutation Explanation and Engineering," the authors proposed multimodal protein-textual language models for understanding the effect of mutation and performing protein engineering. They also build MutaDescribe, the first large-scale protein mutation dataset with rich textual annotations.
优点
- The paper is generally well-written and easy to follow.
- The authors have constructed the first comprehensive protein mutation dataset enriched with textual annotations. This dataset represents a significant foundation for future research in this field.
- The MutaPLM framework introduced in this paper is innovative, particularly in its explicit modeling of mutations and its use of cross-modal transformers for multi-modal feature integration, enhancing its analytical capability.
- By integrating large language models, the proposed framework significantly simplifies protein engineering, offering an intuitive tool that could be readily adopted by biologists for advanced research.
缺点
- The paper lacks a comparison with fine-tuned protein language models. Finetuned PLMs (ESM-1, ESM-2) have been validated to be powerful for various downstream tasks. For example, MLAEP(https://www.nature.com/articles/s41467-023-39199-6) and AugmentedESM(https://www.nature.com/articles/s41587-021-01146-5)
- The paper did not prove why the textural annotation is necessary. From the ablation study, one can conclude that the labeled information from the textual annotation makes the model powerful.
- The paper should add more discussion and experiments on why human-understandable notation is necessary. Human-understandable notations are not more informative compared with a conventional multi-label dataset. Moreover, LLMs may fail to deal with regression tasks, while finetuned PLMs can do better.
问题
-
The statement that "Protein language models (PLMs) fall short in explaining and engineering protein mutations" may need reconsideration. 1. Recent studies, such as those involving ESM-1/ESM-IF1, have demonstrated these models' effectiveness in zero-shot engineering tasks. This contradicts the assertion of inherent limitations due to architectural design and lack of supervision. See https://www.nature.com/articles/s41587-023-01763-2 and https://www.science.org/doi/full/10.1126/science.adk8946
-
The manuscript would benefit from a deeper discussion of the MLDE methodology, particularly in the context of fine-tuning pre-trained protein language models like AugmentedESM(https://www.nature.com/articles/s41587-021-01146-5). A comparative analysis between mutaPLM and MLDE-based methods(e.g. AugmentedESM) could provide more clarity on their respective performances.
-
Based on 2, further exploration of the role of textual descriptions in enhancing model performance would be advantageous. Clarification on how these descriptions integrate with the model to improve predictions would be helpful.
-
The performance of the model on regression tasks remains unclear. It would be instructive for the authors to include results or discuss how the model handles quantitative predictions in the context of protein functionalities.
局限性
The authors addressed the limitations.
Thank you for your positive comments on our presentation, dataset, methodology, and application values. We address your concerns and answer your questions below.
Q1: Misleading statement of PLMs in mutation explanation and engineering
We apologize for this misleading statement in our abstract. We clarify it as follows:
- In our introduction, we argued that while PLMs have shown effectiveness in zero-shot mutation explanation and engineering [1, 2], their implicit modeling of mutations through evolutionary plausibility is not satisfactory for practical needs.
- The inherent limitations due to architectural design and lack of supervision arise in modeling mutations explicitly.
We will change this statement to:
"However, due to architectural design and lack of supervision, PLMs model mutations implicitly with evolutionary plausibility, which is not satisfactory to serve as explainable and engineerable tools in real-world studies."
Q2: Comparison between MutaPLM and MLDE-based methods
Thanks for your insightful consideration. If MLDE refers to Machine-Learning-based Directed Evolution, the discrepancies between MutaPLM and prior MLDE-based methods are as follows:
- Explicit modeling of mutations. MutaPLM models mutations explicitly with delta features by an encoder-decoder architecture. In contrast, existing MLDE-based models [2,3] either model mutations implicitly with evolutionary plausibility or focus on the wild-type sequence instead of the discrapancies between the wild-type and mutant.
- Textual supervision. MutaPLM is a general framework for mutations that allows knowledge transfer across different wild-type proteins through natural language supervision. However, prior MLDE-based models are fine-tuned on a single wild-type protein with a phenotype of interest.
We perform comparisons between MutaPLM and fine-tuned PLMs on both mutation explanation and engineering. Please refer to our global rebuttal R1 for details of these baselines, and the PDF file for experimental results. We observe that MutaPLM consistently outperforms fine-tuned PLMs on both tasks, which demonstrates the effectiveness of our approach. Unfortunately, MLAEP is not comparable under our experimental settings, as the model is specifically designed for calculating the binding affinities of SARS-Cov-2 and its receptors and antibodies.
Q3: Further justification for the role of textual descriptions
We argue that textual descriptions are indispensable for general protein mutation modeling due to the following:
-
Texts connect mutational knowledge from diverse proteins. The mutational effects of proteins are diverse and complicated, and the number of samples with a phenotype of interest is often limited. Therefore, constructing a multi-label dataset will lead to excessive classes and extremely imbalanced label distributions. In contrast, using texts helps combine supervision signals from diverse wild-type proteins and allows knowledge transfer across different phenotypes. This corroborates the intuition of CLIP models [4, 5] using texts instead of conventional labels for cross-modal supervision.
-
Texts provide a user-friendly interface for mutation explanation and engineering. As stated in our introduction, the evolutional plausibility calculated by PLMs cannot meet the practical needs in studying mutations. Texts allow MutaPLM to interpret novel mutational effects from multiple facets and engineer proteins with user-defined properties even when the phenotype of interest has not been seen during training. Such capabilities cannot be obtained on a multi-label dataset.
For experimental justification, we have shown in our ablation studies that textual instructions bring 5.8% absolute gains in Recall@50 in mutation engineering. We further validate that the delta features have captured mutational knowledge through textual supervision. Specifically, we fine-tune the delta encoder and a regression head on two protein fitness datasets introduced in global rebuttal R2. From the results below we observe that textual supervision brings significant benefits to mutation explanation.
| Model | Spike-ACE2 | avGFP |
|---|---|---|
| w/o textual supervision | 0.4010.029 | 0.5790.016 |
| MutaPLM | 0.4810.028 | 0.5930.032 |
Q4: Applying MutaPLM on regression tasks
Similar to existing LLMs [6], we acknowledge that directly applying MutaPLM on regression benchmarks may lead to sub-optimal outcomes. We discuss strategies for handling quantitative predictions are as follows:
- Discretizing numeric values into several segments and instructing the LLM to predict the discrete values [6].
- Performing comparison (which mutation leads to increased fitness) instead of regression, partly inspired by the reward model in InstructGPT [7].
- Fine-tuning a regression head using delta features as additional inputs. The motivation is that the delta features have captured mutational knowledge from massive biomedical texts that could benefit regression tasks.
We implement the third strategy (fine-tuning with delta features) on two datasets, including Spike-ACE2 and avGFP. Please refer to our global rebuttal R2 for implementation details. The results are displayed in Table2 in the uploaded PDF file, where MutaPLM achieves competitive performance with fine-tuned mutation models.
Refs.
[1] Language Models Enable Zero-shot Prediction of the Effects of Mutations on Protein Function.
[2] Learning Protein Fitness Models from Evolutionary and Assay-labeled Data.
[3] Low-N Protein Engineering with Data-efficient Deep Learning.
[4] Learning Transferable Visual Models From Natural Language Supervision.
[5] ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts.
[6] Tx-LLM: A Large Language Model for Therapeutics.
[7] Training Language Models to Follow Instructions with Human Feedback.
I appreciate the efforts made by the authors during the rebuttal. Most of my concerns are addressed. I will raise my score as positive.
Thanks again for your favorable comments on our work! We are glad to have addressed most of your concerns. We are willing to provide additional information if you have any further questions.
The paper proposes a framework to 1). generate text-based mutation effects for mutated proteins and 2). propose new mutated sequences based on the function descriptions. The main module is an encoder-decoder network, which encodes the representations of mutated sequences and outputs the position and amino acid of the mutation. The network is first pretrained on the protein literatures and then fine-tuned on the mutation effects.
优点
- The problem studied in this paper is novel and well-motivated: generate mutated sequences conditioning on the instructions, and generate mutation effects conditioning on the sequences.
- The method is technically sound.
- The paper is well-structured
缺点
Most issues are on the evaluation side. Rigorous evaluations are very important for the AI4Science applications.
- Baseline Selection: The paper employs weak baselines for comparison. None of the baselines used have been specifically trained on mutations. This makes it difficult to accurately assess the true effectiveness of the method.
- Lack of Temporal Evaluation: While the paper adopts a structural split for evaluation, which is acceptable, a temporal-based evaluation would be more ideal and realistic. A temporal split, where some proteins are held out based on their discovery time, would more accurately reflect real-world scenarios in scientific applications.
- Weak Evaluation of Mutation Explanations: The use of GPT-4 to assess scientific explanations is not robust or scientifically sound.
- Missing experimental details. The paper omits several crucial experimental details, which harms reproducibility and thorough understanding of the methodology. Specific areas lacking detail include:
- explain in details how you tune the hyperparameters
- what is the dataset for protein literatures?
- When construct MutaDescribe, did you only use swissprot or the whole dataset? how did you extract the mutation explanations? How do you know whether it's expert-reviewed?
问题
See above.
局限性
NA
Thank you for your appreciation of our task, methodology, and writing. We address your concerns in evaluation as follows.
Q1: Additional supervised baselines.
We have added supervised baselines, including fine-tuned PLMs, for both mutation explanation and engineering. Please refer to our global rebuttal (R1) for more details and Tables 1 and 2 in the uploaded PDF file for experimental results. We observe that MutaPLM consistently outperforms supervised baselines on both tasks, thereby demonstrating the effectiveness of our approach.
Q2: Temporal evaluation.
We appreciate your insightful suggestion. We perform temporal splitting by extracting the publication dates of the corresponding literature from PubMed [1] for each mutation in our dataset. Mutations studied before 2022 are used as training and validation sets, while those studied in 2022 and 2023 comprise the test set. The train/valid/test set comprises 156K, 8K, and 1.6K samples, respectively.
The experimental results for mutation explanation are below:
| Model | BLEU-2 (%) | ROUGE-L (%) |
|---|---|---|
| ProLLaMA | 0.69 | 0.80 |
| GPT4-0613 (5-shot, kNN) | 9.30 | 11.84 |
| Fine-tuned ESM | 6.90 | 12.62 |
| MutaPLM (Ours) | 10.83 | 16.51 |
The experimental results for mutation engineering are below:
| Model | Accuracy (%) | Recall@50 (%) |
|---|---|---|
| Random | 4.40 | 0.81 |
| ESM2-650M | 34.76 | 24.02 |
| ESM+BioMedBERT | 55.78 | 44.04 |
| MutaPLM (Ours) | 58.50 | 46.05 |
We observe that MutaPLM achieves promising performance on the temporal split and outperforms strong baselines, showcasing its potential in assisting real-world scenarios. We are working on evaluating other baseline models on the temporal split.
Q3: GPT-4 evaluation of mutation explanations.
We take 500 samples from our test sets and recruit a postgraduate from a top university who majors in biology to assess the mutation explanations of MutaPLM following the same categorization protocol as GPT-4. Below is the confusion matrix for manual and GPT annotations.
| Human (below) / GPT-4 (right) | Accurate | Relevant | Opposite | Irrelevant |
|---|---|---|---|---|
| Accurate | 60 | 27 | 3 | 7 |
| Relevant | 7 | 86 | 0 | 19 |
| Opposite | 0 | 1 | 28 | 9 |
| Irrelevant | 1 | 7 | 5 | 240 |
We observe that GPT-4 evaluation is consistent with human experts on 82.8% cases, although it occasionally misclassifies accurate predictions into relevant, and relevant or opposite predictions into irrelevant. We will report and discuss the manual evaluation results in a future revision of our paper.
Q4.1: Hyperparameters
Given the computational expense of the experiments, we do not specifically tune our hyperparameters. The rationales for our hyperparameter settings are as follows:
- The learning schedule and LoRA rank are derived from prior LLMs [2, 3].
- The batch size is selected to maximize GPU memory usage.
- The number of pre-training steps is determined based on convergence observations.
- The number of fine-tuning steps is based on evaluating the validation loss every 10K steps.
Q4.2: Protein literature dataset
As detailed in Appendix B.1, the protein literature dataset is collected from the Publication entry of proteins within UniProtKB/SwissProt [4] and PubMed [1]. We plan to publicly release this dataset in the future.
Q4.3: Details about MutaDescribe construction
All mutations are collected from UniProtKB Reviewed (Swiss-Prot), ensuring each sample has undergone expert review. The mutation explanations are obtained from the Phenotypes and Variants -> Description entry for each protein.
Refs.
[1] PubMed: The Bibliographic Database.
[2] ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training.
[3] BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicine.
[4] UniProtKB/Swiss-Prot, the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View.
Thanks for your newly-added experiments and clarifications! I have updated my score.
Thank you again for your positive feedback and insightful suggestions for improving our evaluation! Should you have any additional questions or require further clarification, please do not hesitate to let us know.
We extend our gratitude to all reviewers for their positive comments and constructive feedback. We hope that our responses and additional experiments could address the shared concerns satisfactorily.
(R1) Additional supervised baselines
While no prior work is specifically designed for text-based mutation explanation and engineering, we implement additional supervised baselines by fine-tuning existing protein language models or large language models.
For mutation explanation, we incorporate the following models:
- Fine-tuned ESM. We translate each residue representation of ESM2-650M [1] using a linear projection layer and fine-tune BioMedGPT-LM to explain mutation effects based on the translated features of the wild-type and mutant.
- AugmentedESM. We modify the regression model in the original paper [2] by feeding the adaptive fitness score calculated by ESM2-650M and the amino acid sequence into BioMedGPT-LM for fine-tuning.
For mutation engineering, we implement:
-
ESM+BioMedBERT. We apply a cross-attention layer that takes the last hidden representations of ESM2-650M as queries and the BioMedBERT [3] encodings of textual descriptions of mutational effects as keys and values. The outputs are fed into the language modeling head of ESM2-650M to calculate the probability distribution for each mutation.
-
BioMedGPT. We directly input the amino acid sequence of the wild-type protein and the desired mutational effects into BioMedGPT-LM [4]. We instruct the model to suggest a plausible mutation or an amino acid at the mutated position and perform fine-tuning.
The experimental results are displayed in Table 1 for mutation explanation and Table 2 for mutation engineering in our uploaded PDF. We observe that MutaPLM consistently outperforms supervised baselines on both tasks, demonstrating the effectiveness of our approach.
(R2) Evaluation on protein fitness benchmarks
To further justify the effectiveness of MutaPLM in interpreting mutations, we evaluate our model on two protein fitness regression datasets including:
- Spike-ACE2 [5]: This a deep mutational scanning dataset that aims to predict the binding strengths between SARS-Cov-2 variants and its receptor ACE2, which is critical for identifying potentially dangerous strains of the virus.
- avGFP [6]: This benchmark aims to predict the fluorescence intensity of GFP variants, which is beneficial for developing biomarkers.
We first visualize MutaPLM's explanations for several mutations within the two datasets, as shown in Figure 1 in the uploaded PDF, finding them reasonable and insightful. Then, following prior works [7, 8], we adopt a low- setting with 192 randomly sampled training samples and 48 validation samples. We perform fine-tuning by feeding the adaptive fitness of ESM2-650M and the Delta features of MutaPLM into a 2-layer MLP to predict the fitness scores. We compare our model with Ridge regression, ESM2-650M, AugmentedESM [2], Augmented EVmutation [9], ConFit [8] and Tranception_L [10]. The experiment results, displayed in Table 3 in the uploaded PDF, show that MutaPLM achieves competitive performance with fine-tuned protein mutation models, indicating that the Delta features have captured protein mutational knowledge from natural language supervision. We observed that MutaPLM demonstrates performance comparable to Tranception_L on the Spike-ACE2 dataset and surpasses it on the avGFP dataset. This outcome is partly due to the backbone PLM in MutaPLM, and we speculate that substituting the current PLM with Tranception_L could yield further performance improvements. We plan to address these aspects in a future version of our paper and extend our experiments to include additional protein fitness benchmarks.
Refs.
[1] Language Models of Protein Sequences at the Scale of Evolution Enable Accurate Structure Prediction.
[2] Learning Protein Fitness Models from Evolutionary and Assay-labeled Data.
[3] BioMedBERT: A Pre-trained Biomedical Language Model for QA and IR.
[4] BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicine.
[5] Shifting Mutational Constraints in the SARS-CoV-2 Receptor-binding Domain during Viral Evolution.
[6] Local Fitness Landscape of the Green Fluorescent Protein.
[7] Low-N Protein Engineering with Data-efficient Deep Learning.
[8] Contrastive Fitness Learning: Reprogramming Protein Language Models for Low-n Learning of Protein Fitness Landscape.
[9] Mutation Effects Predicted from Sequence Co-variation.
[10] Tranception: Protein Fitness Prediction with Autoregressive Transformers and Inference-time Retrieval.
The paper presents a method called MutaPLM for interpreting mutations with protein language models, and a new dataset called MutaDescribe, containing data of mutations and associated textual annotations. Reviewers found the paper well-written, and were generally positive about the method and the work put into providing the new dataset. Some concerns were raised over a lack of sufficiently strong baselines, and the validity of using GPT4 to assess scientific explanations, but the authors provided a convincing rebuttal to address these issues. Ultimately, all reviewers recommended the paper be accepted.