Venus-MAXWELL: Efficient Learning of Protein-Mutation Stability Landscapes using Protein Language Models
摘要
评审与讨论
The paper introduces Maxwell: a framework to fine-tune a pre-trained model to predict the values for all single substitution mutations for a reference protein, moving from sequence-label to sequence-landscape modeling. An unmasked reference protein is fed through a pre-trained protein language model, which returns per position logits. The reference sequence logits are subtracted, and the resulting matrix is interpreted as a mutation landscape. The logits are via a composite loss function trained to approximate experimental values. The paper curates a representative stability dataset which is used for benchmarking, where the proposed model outperforms other models.
优缺点分析
Strengths
- Well-motivated problem to move from sequence-label to sequence-landscape mappings. I appreciate the efficiency comparisons.
- Simple idea of learning multiple values via regression on a full backbone.
- Beats ThermoMPNN across metrics.
- Well-written paper with clear notation.
- Rigorous handling of datasets and curation of novel stability dataset.
Weaknesses
- The novelty of the approach and the methodological contributions are limited. Perhaps this is outweighed by the increased utility of the model given its rapid inference. I am however missing clear cases where Maxwell would have a substantial impact. Directed evolution is used as an example, but it is unclear to me exactly how this would work given the lack of multi-mutant predictions.
- Confusing ad-hoc loss function where the Pearson correlation between experimental labels and model “logits” is maximized while the MSE between the labels and MLP-transformed logits is minimized. I do not understand the use of the MLP. If the role of the MLP is to map from logits to s, why is this not a core component of the model? See related questions. Perhaps I’m misunderstanding something.
- I am similarly missing a qualitative analysis of when the model performs well or poorly. Perhaps the model is just consistently better across proteins and dataset sizes, but it would nonetheless be illustrative to see.
- No code in the supplementary materials for reproducibility.
I will consider raising my score if the weaknesses - particularly 1 and 2 - are adequately and convincingly addressed.
问题
- L44-46: As you point out later, most PLMs are either encoder-only or decoder-only. To state that most methods fail to harness the learned evolutionary patterns due to them only using the encoder while ignoring the decoder is therefore not entirely accurate in my opinion. For encoder-only models, the language model heads are typically deliberately simple MLPs which map from the latent representation to logits.
- L65: I think that the claim that existing stability datasets exhibit significant redundancy, data leakage, and misalignment should be quantitatively supported. Presumably, this should be easy to show. I think a figure - perhaps just in the appendix - showing these existing issues would show the benefit of the introduced dataset.
- L129: Phrasing: “primarily composed of mainly consisting of”
- Figure 2: I think the figure is slightly misleading w.r.t. the one-hot encodings as it shows these being fed directly into the PLM which would be odd. From the appendix, I can see that this is not actually the case as the input sequence is tokenized in two ways: regular tokenization for the PLM and one hot for the PLM output. I think this should be clarified in L148-150.
- Loss function. I am generally confused by the loss function, particularly the inconsistent use of the MLP and the motivation to optimize the Pearson correlation.
- Regarding equation 10, the Pearson correlation coefficient captures the linear correlation between the predicted and ground truth values. In that case, it is not actually a ranking, right? I think labeling the loss as a ranking loss is incorrect and misleading. Have you explored using ranking losses?
- L186: I am confused about the role of the MLP. The aim of the main architecture is to predict values, yet an MLP is used in the loss function to map from a logit-like space to space to compute the MSE loss. This necessarily means that the model outputs are not actually s. Why is that? Wouldn’t it make more sense to have the MLP as a component after the element-wise subtraction?
- In a similar vein, why is the MLP only used for the MSE loss and not the Pearson loss?
- Is the MLP in the MSE loss trained jointly with the rest of the architecture or is it trained separately?
- Section 5.2: How are the zero-shot scores computed? For zero-shot scoring using sequence-only PLMs, a position would typically be masked and the logits of the masked position conditioned on non-masked residues would typically serves as the fitness proxy. Is this how the values are computed? Or are full sequence (pseudo-) likelihood computed? If so, how? If the sequences are non-masked, does this not result in very peaked distributions, and wouldn’t it benefit encoder-only models to be scored using masking?
- L190: The weighing factor of the losses is 0.1. Why this values? Did you experiment with others?
- Figure 4: In the related works, many predictors are referenced. I think it could be beneficial and fairly straightforward to include more of these in the results to add more granularity.
- Is Maxwell consistently better than the other baselines? Or are there instances where it is outperformed? I think an analysis or discussion of some qualitative results would be interesting to see, such as potential failures modes or instances where Maxwell is superior.
- I have some questions regarding the role of Maxwell for directed evolution, which is proposed as a use case for the model. Since Maxwell cannot handle multiple mutations, what does DE with Maxwell actually look like? At each step, would a single mutation be suggested, the resulting variant would then be used as input for the next round? ESM-IF1 which is used as PLM is structure conditioned. This presumably means that the structure would change over time, particularly if stability is optimized. Wouldn’t structure prediction be a potential bottleneck which would overrule the inference-speed gains of the proposed architecture? I think some more motivation for how Maxwell is used in a DE pipeline would help show the utility of the model.
- You mention on lines 330-331 that Maxwell can potentially be used for diverse protein function prediction. How would this work? Would it be using the same framework but just substituting the stability data with other datasets?
局限性
I think that the limitations of the framework should be discussed slightly more. This relates to the mentioned weaknesses and raised questions. E.g., when does the model fail? How sensitive is it to the structure quality? How can be used for DE if it cannot handle multiple mutations?
最终评判理由
I am pleased with the authors’ rebuttal which has addressed my raised weaknesses, questions, and limitations. I particularly appreciate the authors' acknowledgement that their model isn't necessarily ideal for multi-mutant modeling despite its capacity for it. The justification for redundancy in existing datasets, the detailed analyses for diverse proteins, the inclusion of additional baselines and experiments as well as the clarifying explanations has, in my opinion, strengthened the quality and utility of the paper and its methods and I therefore think the paper is of interest to the community and should be presented at this year's NeurIPS.
格式问题
No formatting issues.
Thank you for carefully reviewing our manuscript and providing constructive feedback. Below we respond to your questions and concerns. We would appreciate it if you could let us know whether your concerns are addressed by our response.
Response to weakness 1.
(1) Applications. Directed evolution (DE) often begins with single-point mutations. MAXWELL can identify stable single mutations of the target protein for experimental validation, which can then be combined into multi-point mutants. The ΔΔG of multi-point mutants can be predicted by existing "train-on-single-and-predict-on-multi" models such as ECNet [1] or ProteinNPT [2]. Unlike these methods, MAXWELL requires no prior mutational data, making it useful in early-stage DE.
(2) Multi-Point Mutations Prediction. Although MAXWELL isn't trained on multi-point mutants, it is able to predict their stability via additive scoring: , where is the individual predicted effect of each single-point mutation. On the multi-point mutations from public Mega-scale ΔΔG dataset and M1261 dataset, MAXWELL significantly improves prediction accuracy over the PLM baseline, increasing Spearman correlation from 0.317 and 0.272 to 0.597 and 0.446(Table R1).
Table R1. Multi-point mutant prediction performance
| Model | Mega-scale | M1261 |
|---|---|---|
| ESM-IF | 0.317 | 0.336 |
| MAXWELL (ESM-IF) | 0.597 | 0.492 |
| ThermoMPNN | Not Supported |
Note that MAXWELL and other ΔΔG models (ThermoMPNN, StabilityOracle) primarily target single-point mutants, because:
- DE workflows prioritize reliable single-mutant prediction.
- Accurate multi-point predictions usually need prior single-mutant data, often unavailable.
- Multi-point datasets are typically smaller, restricting model effectiveness.
We hope this clarifies MAXWELL's design principles.
Response to weakness 2.
MLP serves as an auxiliary component rather than the core predictor for several reasons:
(1) The MLP is randomly initialized and lacks pretraining, whereas the PLM output head is pretrained and estimates ΔΔG via log-probability differences between mutant and wild-type residues.
(2) The MLP outputs ΔΔG values, but ranking is more relevant in practice. Rankings from the language model head are more accurate. For instance, MAXWELL (ESM-IF) trained only with MSE loss achieves a Spearman correlation of 0.492, lower than the language-model head trained solely with ranking loss (0.506 vs. 0.492, Table 2 in the main text).
(3) The MLP mainly aids training by providing a ΔΔG-based loss. It improves PLM head performance (e.g., Spearman corr. from 0.506 to 0.517, Table 2 in main text).
Response to weakness 3.
We performed a qualitative analysis to understand MAXWELL's behavior across proteins. We found a strong correlation (ρ = 0.685, R² = 0.469) between MAXWELL's performance and the PLM's zero-shot accuracy, confirming that our method effectively leverages pretrained knowledge. Correlations with protein length (ρ = -0.378, R² = 0.058) and dataset size (ρ = 0.307, R² = 0.022) were weak, suggesting robustness to these factors. Additionally, we examined structure quality using AlphaFold pLDDT scores and observed only a weak correlation with performance (ρ = 0.264, R² = 0.070), suggesting that MAXWELL is relatively insensitive to moderate structure noise. The slight size trend is driven by a few large datasets (>200 mutants) where MAXWELL performs consistently well. These findings will be included in the revision.
Response to weakness 4. We apologize for not including the code in the supplementary materials. Code and data will be released in the next version.
Response to question 1. Though simple, PLM output heads are pretrained to map representations to logits, enabling zero-shot ΔΔG prediction. To validate their importance, we re-initialized the head randomly. Table R4 shows performance drops sharply, confirming its critical role.
Table R4. Performance of MAXWELL with and without the pretrained head (Spearman corr. on Test12K).
| Model | Spearman |
|---|---|
| MAXWELL (ESM-IF) with pretrained head | 0.517 |
| MAXWELL (ESM-IF) with randomly re-initialized head | 0.432 |
Response to question 2. We quantified the occurrences of duplicates and misalignments in publicly available stability datasets, including Myoglobin, S669, S8754, M1261, vb1432, Fireprotdb, and Thermomutdb. The results are presented in Table R5.
Table R5. Summary of Data Issues in Public Stability Datasets
| Dataset | Seq. Mismatch | Missing PDB ID | Missing Type | Duplicate Rows | Duplicate Mutant |
|---|---|---|---|---|---|
| Myoglobin | - | - | - | - | 21 |
| S669 | 333 | - | - | - | - |
| S8754 | - | 506 | - | - | 3912 |
| M1261 | - | - | - | - | 447 |
| vb1432 | - | 131 | - | - | 4 |
| Fireprotdb | - | 34 | 19 | 39600 | 5890 |
| Thermomutdb | - | 373 | - | - | 2425 |
Response to question 3. We apologize for the typo and will correct it.
Response to question 4. Thank you for pointing this out. The PLM input is indeed an integer token sequence. We will update the figure.
Response to question 5. In short, the MLP is only used for auxiliary tasks. Pearson correlation serves as an alternative ranking loss.
5.1 Pearson correlation serves as a simple yet effective ranking loss. We tested other ranking losses and observed similar performance (see Table R6).
Table R6. Performance of different ranking losses.
| Loss | Spearman |
|---|---|
| Pearson | 0.517 |
| ListMLE | 0.512 |
| Soft Spearman | 0.514 |
5.2 MAXWELL aims to rank ΔΔG values rather than predict their exact ΔΔG values, as ranking is more important in practice. The MLP is auxiliary (see response to Weakness 1). Applying the MLP after the element-wise operation yields similar results (Spearman: 0.514).
5.3 Adding Pearson loss to the MLP branch had negative effect (Spearman: 0.507).
5.4 They are trained jointly. Using only MSE loss reduces performance, and MAXWELL (ESM-IF) drops to Spearman corr. 0.496.
Response to question 6. For zero-shot scoring, we input the full, unmasked protein sequence into a BERT-style PLM, which outputs amino acid probability distributions at each position. The log-probability difference between the mutant and wild-type residues serves as a fitness proxy. We do not apply masking because (1) BERT-style models can handle unmasked inputs due to their pretraining (including random replacements and unchanged tokens), and (2) masking each position individually greatly increases computation and breaks the ability to compute the full mutational landscape in a single forward pass. And, in practice, the performance difference is negligible. For example, with ESM2, Spearman correlation is 0.273 (with masking) vs. 0.268 (without).
Response to question 7. We tested various λ values (Table R7), and λ = 0.1 yields the best performance.
Table R7. MAXWELL (ESM-IF) Performance Across λ
| λ | Spearman | Pearson |
|---|---|---|
| 0.0 | 0.506 | 0.528 |
| 0.1 | 0.517 | 0.542 |
| 0.2 | 0.514 | 0.540 |
| 0.4 | 0.504 | 0.538 |
| 0.8 | 0.494 | 0.523 |
| 1.0 | 0.486 | 0.519 |
Response to question 8. We have evaluated additional models. Results are shown in Table R8:
Table R8: Performance of Additional Baseline Models.
| Model | Spearman | Pearson |
|---|---|---|
| FoldX | 0.410 | 0.431 |
| ProGen2 | 0.187 | 0.191 |
| Tranception | 0.244 | 0.254 |
Unfortunately, we were unable to reproduce StabilityOracle. The official code could not be executed.
Response to question 9. While MAXWELL outperforms baselines on average, it underperforms ThermoMPNN on 143 of 308 proteins (~46%). These often overlap with proteins where the language model's zero-shot performance is low (114/143 below average), highlighting the importance of the PLM's intrinsic capability. Fortunately, our method is general and can improve as language models advance.
Response to question 10.
We appreciate your valuable suggestion.
(1) MAXWELL's role in DE is discussed in our response to Weakness 1.
(2) While MAXWELL focuses on single-point mutation prediction, it does not directly address iterative design or multi-site optimization. Such tasks are better suited for models like ECNet [1] or ProteinNPT [2] that integrate modeling with experimental feedback. MAXWELL can be trained with experimental data, and we plan to explore this in future work.
(3) Although ESM-IF is structure-aware, its mutant scoring only requires the wild-type sequence and structure, not mutant structures. So it introduces no computational bottleneck.
We will clarify MAXWELL's role in directed evolution more explicitly in the revised Introduction. Thank you again for the insightful feedback.
Response to question 11. Yes, your intuition is correct. Theoretically, our framework is task-agnostic and can be adapted for other functional prediction tasks by substituting the training data. We chose ΔΔG as our primary focus because it is currently the ideal use case, benefiting from both the availability of large-scale public data and a theoretical link to PLM log-likelihoods [3]. Extending this framework to other tasks is a future direction, contingent on the curation of suitable datasets and potential loss function adjustments.
Response to limitations
We appreciate your suggestion to clarify the limitations. As discussed in Response to Weakness 3, MAXWELL's performance primarily depends on PLM prior quality and is relatively robust to structure quality. Its role in directed evolution is addressed in Response to Weakness 1. For more advanced DE tasks such as iterative design or combinatorial optimization, we plan to explore these directions in future work. We will make these points clear in the revised manuscript.
[1] ECNet is an evolutionary context-integrated deep learning framework for protein engineering. Nature Communications, 2021.
[2] ProteinNPT: Improving Protein Property Prediction and Design with Non-Parametric Transformers. NeurIPS, 2023.
[3] Zero-shot protein stability prediction by inverse folding models: a free energy interpretation. arXiv, 2025.
I am pleased with the authors’ rebuttal which has addressed my raised weaknesses, questions, and limitations. I appreciate the acknowledgement that while this single-point focused model is capable of multi-mutant modeling, iterative design and multi-site engineering efforts should ideally use dedicated modeling approaches instead. I will raise my score from 3 to 5 to reflect the updated manuscript, and I hope that the ablation results and further explanations presented in the rebuttal will feature in the final paper (if only in the appendix).
We are very grateful for your supportive feedback and for increasing your score. It's encouraging to hear that our rebuttal successfully addressed your concerns. We confirm that all of the ablation results and further explanations presented in our discussion will be integrated into the final version of the paper.
This manuscript addresses the problem of predicting protein mutant stability across a sequence of interest. Existing frameworks model the relationship between each mutant sequence and its corresponding stability score for every mutant/score pair, resulting in prohibitively high training and inference costs.
The authors propose using a mutation landscape matrix constructed by taking the difference between the matrix derived from the wild-type sequence and the matrix obtained from a protein language model. In computational experiments, this matrix is shown to perform better than existing frameworks such as Rosetta, ThermoMPNN, and ESM-IF+MLP. Moreover, the effectiveness of the proposed approach is demonstrated by its fast prediction capability.
优缺点分析
Strength
- Demonstrated efficiency and performance
问题
This reviewer can understand that the proposed method is more efficient than existing methods, but cannot understand why it outperforms the existing methods in terms of prediction accuracy, since the existing methods explicitly learn from mutated sequences, whereas the proposed method learns them only implicitly through language models.
局限性
The authors addressed the limitations of their work.
最终评判理由
The authors have addressed my concern regarding the reason for the improved performance over existing embedding methods. I find their explanation convincing and therefore assign a final score of +1.
格式问题
No concerns regarding paper formatting.
Thank you for your valuable feedback to help us improve our paper. We detail our response below, and please kindly let us know if you have any further questions.
Response to the question. We thank you for this thoughtful question, highlighting the need to explain why MAXWELL can outperform methods that explicitly regress from the embeddings of mutated sequences.
While it may seem counterintuitive that an implicit method outperforms an explicit one, the superior accuracy of MAXWELL stems from two key points:
-
MAXWELL leverages the zero-shot mutation prediction capabilities of pretrained protein language models, providing a stronger starting point than methods that regress directly from the embeddings of mutated sequences. Protein language models (PLMs) are pre-trained on millions of naturally occurring proteins, learning statistical patterns that reflect evolutionary constraints -- unstable proteins are unlikely to appear in nature. Consequently, the probability ratio (or log-probability difference) between a mutant amino acid and the wild-type residue at a given position can serve as a proxy for evolutionary preference. Larger differences suggest mutations that are more evolutionarily plausible and potentially more stable.
Importantly, MAXWELL retains every component of the protein language model, including its output head, thereby preserving the model's full zero-shot capability. Its initial performance is therefore equivalent to the zero-shot performance of the PLM itself. In contrast, methods that regress from the embeddings of mutated sequences typically discard the original language model head and instead attach a randomly initialized MLP regressor. As a result, their initial predictions are nearly random, and they lose the valuable prior knowledge encoded in the pretrained output head. To validate the importance of the pretrained output head, we include an ablation study in which the language model head is randomly re-initialized. As shown in Table R1, this leads to a drop in performance, confirming the critical role of the pretrained head in enabling MAXWELL's accuracy. Additionally, as demonstrated in our paper, if the entire PLM is randomly initialized, performance drops dramatically, falling even below the zero-shot baseline. Taken together, these results prove that MAXWELL’s primary advantage lies in its ability to effectively leverage the evolutionary information captured by the PLM during pre-training. Because MAXWELL is compatible with various PLMs, its performance is poised to improve as this foundational technology continues to advance.
Table R1. Performance of MAXWELL with and without the pretrained PLM
Model Spearman MAXWELL (ESM-IF) with pretrained head 0.517 MAXWELL (ESM-IF) with randomly initialized head 0.432 MAXWELL (ESM-IF) with randomly initialized whole PLM 0.202 ESM-IF Zero-shot 0.375 -
MAXWELL utilizes a matrix-based fine-tuning algorithm that enables efficient training on protein mutation ΔΔG data and supports direct landscape-level prediction for unseen proteins. In contrast, traditional methods regress from the embeddings of individual mutated sequences and produce a scalar prediction for each mutation. MAXWELL instead reformulates the task as a two-dimensional mutational landscape regression problem. This matrix formulation treats the set of single-point mutations as a structured landscape matrix, enabling the model to learn a protein's mutational landscape within a single forward pass. This design not only accelerates training but also improves generalization across diverse proteins. To assess the impact of the matrix formulation, we conducted an ablation study in which the model was fine-tuned using a conventional pointwise regression approach. Each individual mutation was scored using a mean squared error (MSE) loss in this setting, corresponding to likelihood-based fine-tuning without the matrix structure. As shown in Table R2, removing the matrix-based structure resulted in a clear drop in performance and slower convergence, with the Spearman correlation decreasing from 0.507 to 0.401.
Table R2. Performance of matrix-wise (MAXWELL) and pointwise regression on ΔΔG landscape prediction (Base model: ESM-IF).
Fine-tuning Method Landscape Structure Spearman Training Speed MAXWELL 2D landscape 0.507 Fast Point-wise, likelihood-based 1D (single mutation) 0.401 Slow
We will revise the main paper to explicitly clarify why MAXWELL outperforms methods that regress from mutated sequence embeddings.
Thank you for the detailed explanation and the additional results. I am convinced by your response and will therefore raise my score.
Thank you for taking the time to review our response. We're happy to hear that our explanations and new results addressed your points. We appreciate your engagement and support in helping us improve the paper.
This paper introduces MAXWELL, a framework for predicting protein mutation stability. Instead of predicting individual mutation effects separately, MAXWELL predicts an entire mutation landscape matrix in a single forward pass. The authors demonstrate improved computational efficiency while maintaining competitive predictive performance compared to existing sequence-to-label methods.
优缺点分析
Strengths:
- The proposed method is simple yet novel and sound. This matrix-based approach allows the model to learn relationships between different mutations more effectively, as mutations at different positions can influence each other through the shared representation. Moreover, the reported test-time speedups (10× faster than ThermoMPNN, orders of magnitude faster than MLP baselines) make this approach valuable for some protein engineering applications.
- The dataset construction is rigorous. The dataset curation process is well-documented and addresses a common issue in many papers of the field: test data leakage. The authors use mmseq to filter the sequences with >30% identity between train/test sets and then conduct further verifications using pairwise global sequence alignment.
Weakness:
- Several aspects of the experimental setup lack proper justification or transparency. For example, the weighting factor λ = 0.1 in the joint loss appears arbitrary and the MLP architecture details (dimensions of W₁, W₂) are not specific. Plus, It's unclear whether "optimal performance" in tuning learning rates refers to validation loss (and how is the validation set constructed?), or inadvertently test set performance, which would be problematic.
- The introduction is not well-written and weakens the motivation for the project. Overall, the introduction fails to adequately motivate why learning mutation landscapes jointly is superior to independent predictions beyond computational efficiency. I am confused by the "initial performance of ΔΔG prediction is equivalent to the zero-shot prediction performance of the PLM" (lines 62-64) since the model is also fine-tuned in the proposed method.
问题
- Experiment details: Can you provide more details on the hyperparameter tuning process? Specifically:
- What was the search space for the learning rate grid search?
- How was "optimal performance" defined during hyperparameter selection?
- Can you provide ablation studies for the loss weighting factor λ?
- What are the specific dimensions used in the MLP projection layer? How sensitive is performance to these choices?
- Can you detail the advantage of your method compared to the conventional approach beyond computational efficiency? Can you clarify the statement about "equivalent to zero-shot prediction performance"?
局限性
Yes
最终评判理由
The author's response has resolved my concerns about the rigor of the experimental design. I believe the proposed approach will benefit future work in this area. Hence I recommend acceptance.
格式问题
NA
Thank you for the insightful comments. We have addressed each of your concerns below and will incorporate these clarifications into the revised manuscript. We hope our responses resolve your concerns and are happy to discuss further if any points remain unclear.
Response to weakness 1: To answer your question thoroughly, we'll address it in points.
Point 1.1 “The weighting factor λ = 0.1 in the joint loss appears arbitrary.”
Response: We agree the choice of hyperparameters requires justification. The value for λ was determined empirically through an ablation study, with the results shown in Table R1.
The model achieved the best performance on the test datasets when λ=0.1. This result suggests that the framework performs optimally when the ranking loss serves as the primary training objective and the MSE loss acts as a beneficial auxiliary objective to align the scale of the predictions. We will add this ablation study to the appendix of our revised manuscript. Table R1. Performance of MAXWELL (ESM-IF) for various λ values
| λ | Spearman | Pearson |
|---|---|---|
| 0.0 | 0.506 | 0.528 |
| 0.1 | 0.517 | 0.542 |
| 0.2 | 0.514 | 0.540 |
| 0.4 | 0.504 | 0.538 |
| 0.8 | 0.494 | 0.523 |
| 1.0 | 0.486 | 0.519 |
Point 1.2 “The MLP architecture details (dimensions of , ) are not specific.”
Response: The MLP consists of two hidden layers. The dimensions are where V represents the vocabulary size of the protein language models (PLMs) (e.g., 35 for ESM-IF and 25 for ProSST). A hidden activation function is applied between and .
The MLP's operation is defined as: We will revise this section in the paper.
Point 1.3 “It's unclear whether "optimal performance" in tuning learning rates refers to validation loss (and how is the validation set constructed?), or inadvertently test set performance, which would be problematic.”
Response: We selected hyperparameters based on 5-fold cross-validation. All validation sets were derived solely from the training set, ensuring no leakage of test data into the training or validation process. Our procedure for selecting the learning rate (lr) and training epochs is as follows:
TrainingSet = {T1, T2, T3, T4, T5}
For each lr in {1e-3, 1e-4, 1e-5, 1e-6,5e-3,5e-4,5e-5,5e-6}:
Initialize scoreset as empty list
For i from 1 to 5:
InternalTrainingSet = TrainingSet \ {Ti}
InternalValidationSet = Ti
Train model on InternalTrainingSet for 10 epochs:
For each epoch in 1..10:
Record Spearman correlation score sj on InternalValidationSet
Track the epoch with the best score
Add the best score of this fold to scoreset
CurrentLrScore = Average(scoreset)
End For
Table R2 illustrates the performance of different learning rates during 5-fold cross-validation. We selected 5e-5 as the optimal lr, as it demonstrated the best performance across the 5-fold cross-validation. The best Spearman correlations on the validation sets for this lr were achieved at epochs {7,7,6,8,7} respectively, averaging 7 epochs. Subsequently, we trained the final model on the entire training dataset for 7 epochs, without any intermediate evaluation during this phase. After training, the model's performance was directly evaluated and reported on the test set. Throughout this entire process, the test set (Test12K) was used exclusively for reporting the final scores and was not involved in any hyperparameter tuning, thereby preventing any data leakage.
We will supplement the details of our hyperparameter selection process in the Appendix.
Table R2. Performance of MAXWELL (ESM-IF) with different learning rates in 5-Fold cross-validation
| Lr | Avg Spearman |
|---|---|
| 1e-3 | 0.278 |
| 1e-4 | 0.499 |
| 1e-5 | 0.492 |
| 1e-6 | 0.501 |
| 5e-3 | 0.184 |
| 5e-4 | 0.416 |
| 5e-5 | 0.517 |
| 5e-6 | 0.508 |
Response to weakness 2: To answer your question thoroughly, we'll address it in points.
Point 2.1 On Motivation Beyond Computational Efficiency
Response: Our framework offers several key advantages:
Versatility with Protein Language Models: As PLMs continue to advance, our model is designed as a universal fine-tuning framework, compatible with diverse protein language models (PLMs), including masked language models and inverse folding language models. This allows researchers to easily integrate the latest or most suitable PLM for their target protein without re-designing the architecture.
Novel Fine-tuning Methodology: We introduce a novel matrix-based fine-tuning paradigm. Unlike existing approaches that often involve adding separate regression heads or utilizing non-matrix-based likelihood methods, our method directly and elegantly fine-tunes the PLM's output logits, offering a more streamlined way to adapt PLMs for specific tasks.
Maximized PLM Knowledge Utilization: MAXWELL fully capitalizes on the knowledge embedded within PLMs. The insights gained during the unsupervised pre-training phase of PLMs are largely preserved and directly integrated. By inheriting the pre-trained PLM, MAXWELL starts from a significantly advanced foundation.
Superior Accuracy: These advantages translate to superior empirical performance. MAXWELL (ESM-IF) outperforms the current state-of-the-art model, ThermoMPNN. Furthermore, ensembling our model with ThermoMPNN boosts the Spearman correlation to 0.548 on the Test12K dataset.
We will elaborate on these model advantages in the Introduction section.
Point 2.2 On Clarifying "Initial Performance"
Response:
The phrase "initial performance" refers to the predictive ability of our framework before any fine-tuning begins. This intrinsic capability is equivalent to the zero-shot ΔΔG prediction ability of the underlying PLM itself.
MAXWELL is initialized with the parameters of a pretrained PLM. Our framework's architecture is designed to be a differentiable, matrix-based version of the PLM's standard zero-shot scoring function. PLMs, pre-trained on millions of protein sequences, can compute a likelihood for a given protein sequence. The difference in likelihood between a mutant sequence and its wild-type counterpart can then be used to estimate the mutation's effect and ΔΔG. This approach, which doesn't require training on mutation-specific data, is known as zero-shot prediction. MAXWELL inherits this powerful zero-shot capability and can further enhance it through fine-tuning.
From a mathematical perspective, each element in the output matrix L from an unfine-tuned MAXWELL model is equivalent to , which represents the PLM's zero-shot output for ΔΔG estimation. The relationship between ΔΔG and the logarithmic ratio has also been mathematical derived[1].
Response to question 1: We primarily focused on optimizing the learning rate and number of epochs. The adjustment process relied on 5-fold cross-validation. This involved splitting the training set into five distinct folds. In each iteration, we used four of these folds for training and reserved the remaining one for validation. This process was repeated five times, and the average score from these five evaluations was used to determine the best hyperparameter combination.
Once the optimal hyperparameters were identified, we trained the final model on the entire training dataset using these tuned parameters. Afterward, the model's performance was evaluated on the test set. We strictly ensured that only the training set was used during the hyperparameter tuning process, thereby preventing any data leakage.
Specifically:
Point 1.1 What was the search space for the learning rate grid search?
Response: The lr search range was set to {1e-3,1e-4,1e-5,1e-6,5e-3,5e-4,5e-5,5e-6}.
Point 1.2 How was "optimal performance" defined during hyperparameter selection?
Response: Optimal performance refers to the mean Spearman score achieved across the validation sets in a 5-fold cross-validation, given a specific set of hyperparameters. A higher mean Spearman score indicates a better hyperparameter combination.
Point 1.3 Can you provide ablation studies for the loss weighting factor λ?
Response: Yes,we also conducted an ablation study on the loss weighting factor λ, with the results presented in Table R1. (This is in response to a weakness identified.)
Point 1.4 What are the specific dimensions used in the MLP projection layer? How sensitive is performance to these choices?
Response: The MLP consists of two hidden layers. The dimensions are , where V represents the vocabulary size of the protein language models (PLMs) (e.g., 35 for ESM-IF and 25 for ProSST). A hidden activation function is applied between W_1 and W_2. The MLP's operation is defined as:
The MLP's dimensions are designed to align with the protein language model's (PLM) vocabulary size. Here, V is a fixed vocabulary size, but we can consider adjusting the dimension D. In our paper, we set D=V.
We conducted an ablation study on the size of D, and the results are presented in Table R3. Taking ESM-IF as an example, we performed five sets of experiments with D={V, 64, 128, 256, 512}, while keeping other hyperparameters constant. The results in Table R3 indicate that the model's performance is relatively insensitive to the intermediate layer dimension. Specifically, the model achieved its best performance when the intermediate layer size was set to D=V.
Table R3. Model performance with varying hidden layer sizes
| D | Avg Spearman |
|---|---|
| V(35) | 0.517 |
| 64 | 0.506 |
| 128 | 0.516 |
| 256 | 0.515 |
| 512 | 0.505 |
| 1024 | 0.487 |
| Response to question 2: Please refer to our response to Weakness 2. |
[1] Zero-shot protein stability prediction by inverse folding models: a free energy interpretation. arXiv, 2025.
Thank you for your response! My concerns about the experiments have been resolved after reading your rebuttal. I have increased my score to 5.
We are very grateful for your supportive feedback. Thank you for taking the time to review our rebuttal and for increasing our score. We're delighted that we were able to resolve your concerns.
This paper proposes Matrix-wise landscape learning(MAXWELL), a novel framework for efficiently learning protein mutation stability landscapes. MAXWELL transforms the mutation prediction task from the traditional sequence-to-label approach to a sequence-to-landscape paradigm, leveraging protein language models to predict the ∆∆G of protein mutations in a matrix-driven manner. The framework enables the learning of an entire mutation landscape with just a single forward and backward propagation, significantly improving computational efficiency. Additionally, the authors constructed a large-scale protein mutation ∆∆G dataset with strict controls on data leakage and redundancy to ensure robust model evaluation. Experimental results demonstrate that MAXWELL outperforms existing state-of-the-art methods in both prediction accuracy and computational efficiency.
优缺点分析
Strengths
1.MAXWELL employs a matrix-driven scoring approach, transforming mutation prediction to comprehensive landscape prediction. By integrating both ranking loss and MSE loss during training, it enables more holistic optimization of model performance.
-
Compared to traditional methods, MAXWELL demonstrates significant speed improvements in both training and inference, allowing efficient processing of large-scale mutation data.
-
The framework is compatible with various PLMs, effectively leveraging evolutionary information from pre-trained models to enhance prediction accuracy.
Weaknesses
1.MAXWELL primarily focuses on modeling single-point saturation mutations, and does not yet address the more complex multi-point mutation combinatorial effects, which limits its applicability in complex mutational scenarios.
2.The paper lacks a detailed theoretical explanation for why the probabilities of mutant and wild-type proteins can be used to predict mutation scores and ∆∆G values. Although a probability-based scoring formula is proposed, the underlying theoretical basis and assumptions are not sufficiently elaborated.
3.The experimental validation in this study was conducted only on the author-constructed Test12K dataset, without additional testing on publicly available benchmarks.
4.The paper only compares the proposed framework with zero-shot and MLP baselines , omitting comparisons with other fine-tuning frameworks.
问题
1.In Equation (2), the order of probability calculations for the mutant and wild-type proteins appears inconsistent with the sequence described in the "Method" section. Could this be a typographical error?
2.Does the training set contain a sufficient number of mutant variants to ensure scalability across all mutation sites and amino acid types? If the MASK matrix is overly sparse, could this adversely affect the model’s generalization capability?
3.Did the authors attempt to combine MSE and ranking loss during MLP training to validate the effectiveness of introducing matrix transformations and mutation scoring?
4.MAXWELL fine-tunes all parameters of the PLM, whereas the MLP baseline freezes the PLM’s parameters. Could this discrepancy lead to performance differences, and if so, has this been quantified to isolate the contribution of the proposed architecture?
局限性
Yes, the authors have discussed limitations in the paper.
格式问题
No
We would like to appreciate the reviewer for the evaluation and comments. We would appreciate it if you could let us know whether our response addresses your concerns.
Response to weakness 1: We agree that MAXWELL is primarily designed for training on single-point mutations. However, it is still capable of making predictions for multi-point mutations using the feasible prediction formula: where represents a multi-point mutant constructed by single-point muttations and is the mutation landscape as defined in the paper.
Table R1 presents MAXWELL's performance on the multi-point mutation subset of the public Mega-scale ΔΔG dataset and M1261 dataset. The results confirm that MAXWELL effectively enhances multi-point mutation prediction capabilities, demonstrating its practical utility in this regard.
Table R1. Multi-point mutation prediction performance of related models
| Model | Mega-scale | M1261 |
|---|---|---|
| ESM-IF | 0.317 | 0.336 |
| MAXWELL (ESM-IF) | 0.597 | 0.492 |
While MAXWELL can handle multi-point mutation prediction tasks, our primary objective is to contribute a high-performance single-point mutation prediction model. This focus aligns with typical directed evolution workflows, which often begin with single-point mutations and then leverage experimental data to guide multi-point exploration. MAXWELL can provide initial stability predictions without experimental data. These recommendations can then be integrated with other models, such as ECNet and ProteinNPT, to guide the design of multi-point mutations, creating a powerful, iterative optimization cycle when combined with wet-lab validation.
Response to weakness 2: The idea that "probabilities of mutant and wild-type proteins can be used to predict mutation scores" is a well-validated conclusion in protein language models (PLMs). ProteinGym's benchmark quantifies the correlation between these PLM-derived probabilities and mutant phenotype, clearly demonstrating a notable association. This has become a key indicator for evaluating the performance of PLMs. Intuitively, pre-trained PLMs have learned the natural distribution of numerous protein sequences. Their calculated probabilities essentially reflect a mutant's likelihood of existing in nature—an inherent indicator of mutation stability, as unstable mutants rarely persist in natural contexts.
Theoretically, prior research [1] has established that the predictive power of these probabilities for stability stems from an approximate mathematical relationship: the change in thermodynamic stability (ΔΔG) correlates with the logarithmic ratio of probabilities derived from inverse folding models, as expressed by:
To enhance clarity, we will provide a detailed theoretical explanation of this relationship at the beginning of the Methods section. We believe this will help readers better grasp the underlying mechanism.
Response to weakness 3: Test12K is a curated meta-benchmark composed exclusively of established, publicly available benchmarks to ensure a comprehensive and challenging evaluation. To directly address your concern, we have tested the performance of MAXWELL on several public benchmarks, comparing it against current state-of-the-art models.
As shown in Table R2, MAXWELL consistently outperforms ThermoMPNN and Stability Oracle on well-known benchmarks including p53, s669, and Myoglobin, all of which are subsets of Test12K. Evaluation protocol is: for each dataset, we calculate the correlation score on a per-protein basis and then average these scores. A full evaluation against Stability Oracle on the complete Test12K was not possible, as the officially provided source code was non-functional.
The results demonstrate that MAXWELL achieves state-of-the-art performance on these public benchmarks. We deliberately excluded the popular ProteinGym benchmark from our evaluation. This exclusion is due to ProteinGym's overlap with the training data of both our model and the ThermoMPNN model, which would lead to data leakage and invalidate the conclusions.
Table R2. Model Performance on Public Benchmarks (Mean Per-Protein Spearman)
| MAXWELL | ThermoMPNN | Stability Oracle | |
|---|---|---|---|
| Test12K | 0.517 | 0.508 | - |
| p53 | 0.751 | 0.702 | 0.727 |
| s669 | 0.565 | 0.545 | 0.503 |
| Myoglobin | 0.724 | 0.620 | 0.669 |
| Ssym | 0.602 | 0.516 | 0.567 |
Response to weakness 4: Thanks for the valuable reminder. We have additionally compared MAXWELL with other fine-tuning methods (all based on the ESM-IF): likelihood-based fine-tuning and one-hot augmented fine-tuning.
As shown in Table R3, MAXWELL consistently achieves the optimal performance. It's worth noting that while these alternative methods can be adapted for this task, they were originally designed for intra-protein learning (predicting new mutations on a protein seen during training) rather than generalizing to entirely new proteins, which is the focus of our benchmark.
We will add these new comparative results to the baseline section of our revised paper to provide a more comprehensive evaluation.
Table R3. Performance comparison between MAXWELL and other fine-tuning frameworks
| Spearman | |
|---|---|
| MAXWELL | 0.517 |
| Likelihood-based fine-tuning | 0.401 |
| One-hot augmented fine-tuning | 0.442 |
Response to question 1: Thank you for your reminder. There was indeed a typo in Equation 2, which has now been corrected to:
Response to question 2: Thank you for this important question. We agree that data sparsity is a key consideration. Our training set, Train226K dataset, is substantial, containing over 226,000 mutation entries across 255 proteins, with an average of around 887 mutations per protein. While the mutation landscape for any single protein is sparse (due to the experimental infeasibility of measuring all possible mutations), the scale and diversity of the combined dataset allow the model to learn generalizable patterns of sequence-stability relationships. MAXWELL's strong performance on the Test12K dataset, which exhibits low sequence and structural similarity to the training set, clearly demonstrates its robust generalization, indicating it's not merely memorizing sparse landscapes.
To verify how sparsity affects model generalization, we conducted ablation experiments. Using ESM-IF as the base model, we randomly removed of the training data from the training set, where represents the original performance. The results, presented in Table R4, show that our framework is exceptionally robust to data sparsity. Performance remains remarkably stable, with a Spearman's correlation of 0.501 even when using only 10% (23K mutants) of the training data. A significant performance drop only occurs at the extreme of removing 95% of the data. This demonstrates that our model learns generalizable stability principles from the diverse collection of proteins, rather than overfitting to the sparse landscape of any single protein.
Table R4. Effect of training set sparsity on model performance
| Traning mutants | Spearman | |
|---|---|---|
| 0% | 226K | 0.517 |
| 20% | 181K | 0.517 |
| 40% | 136K | 0.515 |
| 60% | 90K | 0.512 |
| 80% | 45K | 0.509 |
| 90% | 23K | 0.501 |
| 95% | 11K | 0.485 |
Response to question 3: Thank you for this insightful question. We performed the suggested experiment, and the results validated our framework's design. As shown in Table R5, when we added the ranking loss to the MLP's output, the model's performance degraded. We hypothesize that primary evolutionary information resides within the language model's output head, and its score rankings correlate with protein mutation ΔΔG. This correlation makes ranking loss suitable for the language model's direct output. Conversely, the randomly initialized MLP lacks this inherent evolutionary information, rendering it unsuitable for ranking loss application. We will include these results in our ablation experiments.
Table R5. Impact of loss in MLP on model performance
| name | Spearman |
|---|---|
| Only use MSE loss during MLP training | 0.517 |
| Combine MSE and ranking loss during MLP training | 0.507 |
Response to question 4: We acknowledge this potential risk. We did not perform full training for all models due to prohibitive computational speed. While MAXWELL only requires a single computation per protein during full training, MLP baselines demand feature computation for each individual mutant sequence, making their full training exceptionally slow. We supplemented experiments on full training experiments for ESM-IF + MLP and LoRA and the training results using LoRA (configured with lora_r = 8 and lora_alpha=16), as presented in Table R6. (Note: Full training for all baselines was limited by computing power and time constraints.)
The results in Table R6 indicate that the performance of ESM-IF + MLP is slightly lower than ESM-IF (LoRA) + MLP but notably higher than ESM-IF (no freeze) + MLP. For the fully trained ESM-IF, its performance drops significantly after training. We speculate this decline is due to overfitting caused by an excessively high number of model parameters. We'll include these results in our ablation experiments.
Table R6. Training results for ESM-IF + MLP
| Model | Spearman |
|---|---|
| MAXWELL | 0.517 |
| ESM-IF (freeze)+MLP | 0.302 |
| ESM-IF (no freeze)+MLP | 0.121 |
| ESM-IF (LoRA)+MLP | 0.327 |
[1] Zero-shot protein stability prediction by inverse folding models: a free energy interpretation. arXiv, 2025.
Dear reviewer y7hf
Have you had a chance to look at this rebuttal? Do you find that the authors address the weaknesses that you identified?
-AC
This paper introduces MAXWELL, a new method for prediction of protein stability. Unique features of the model include the ability to predict an entire landscape of mutations with a single forward pass, which provides considerable computational speedups compared to existing approaches. The authors also present a new ddg benchmark set with better control of data leakage, and report state-of-the-art results. The reviewers raised some initial concerns regarding the original submission: the lack of theoretical foundation, the lack of testing on established benchmarks and relevant baselines, lack of justification of the experimental setup, and lack of convincing arguments for impact on the community. The authors provided an elaborate rebuttal, which was received positively, and all reviewers who responded to the rebuttal raised their score to a 5 - recommending acceptance of the paper.