From Likelihood to Fitness: Improving Variant Effect Prediction in Protein and Genome Language Models
摘要
评审与讨论
The paper introduces Likelihood-Fitness Bridging (LFB), a conceptually simple method to improve the zero-shot fitness prediction capabilities of pre-trained protein language models. Instead of relying on single estimates for a given sequence, homologs are retrieved and aligned and the fitness prediction is estimated as an average over homologs. Focusing on single-sequence models at varying scales, it is systematically shown that LFB improves zero-shot capabilities using as few as 10 homologous sequences. Notably, the plateauing phenomenon for zero-shot performance and model scaling is alleviated, where, with the use of LFB, larger models tend to produce higher quality predictions. The method is straightforward to conceptualize and implement, yet it is theoretically supported in the paper. The method is applied to predict the effects of substitution mutations and the pathogenicity of human variants.
优缺点分析
Strengths
- Conceptually simple yet theoretically supported contribution with potential for real world impact.
- Thorough experimentation and result analysis.
- Addresses an important gap in current protein language modeling literature where model scaling doesn’t inherently correlate with downstream performance - a phenomenon which is alleviated by the method contribution.
- Code included for reproducibility.
Weaknesses
- Averaging over effects of homologous sequences is not necessarily groundbreaking and parallels can be drawn to retrieval-augmented fitness predictors like Tranception and PoET. However, the simplicity, theoretical foundation, and performance improvements are highly appealing.
- No major weaknesses to point out. See questions.
问题
- The hyperlinks should be double checked - some point to main text figures instead of appendix figures (e.g., L38, L108, L150)
- Figure F.3: Does exact/inexact correspond to masked-marginal and unmasked-marginal scoring? It is not immediately clear from the legend or caption.
- Some figures can be tricky to read given the small fontsizes which often appear blurry despite zooming.
- The proof-reading seems a bit rushed, particularly in section 5.
- Why are the error bars not included in the main text figures?
- While Figure 4a shows that improvements occur using as few as 10 sequences, increasing this number does help performance. How many sequences are typically used? I think it could potentially be interesting to see the performance difference for individual datasets as a function of the number of used sequences.
- In L46, it is stated that the cost is approximately 10x that of a single sequence. Is the suggestion for LFB then to use 10 sequences for estimation?
- L52: I’m not sure I entirely agree that 10x the inference cost is lightweight. Particularly if many more sequences, e.g., hundreds, are used and need to be aligned.
- Equation (2) details how fitness is estimated using masked marginals, yet in L155-158 it is stated that unmasked marginal scoring ends up being used for ESM-2. For completeness, I also think the resulting equation should be included - perhaps only in the appendix.
- Is there a way indels can be handled by LFB, e.g., for autoregressive models which typically manage indels more gracefully than encoder-only models?
- In the discussion it is mentioned that tendency of overpredicting pathogenicity can be overcome using LFB. Is there a risk of potentially underpredicting pathogenicity?
- Figure F.8: In some instances, the performance drops when averaging over homologous sequences. Why might that be? What characterizes these failure modes?
- Given the promising results for sequence-only models, would you expect similar improvements for hybrid models, e.g., those leveraging structure (ProSST, SaProt) or those leveraging alignments (Tranception, PoET, TranceptEVE)?
局限性
The limitations are adequately described in the main text of the paper.
最终评判理由
My raised questions have been adequately addressed and the resulting revisions have led to a more convincing manuscript. I will happily leave my score at 5 and would like to see the paper presented at this year's NeurIPS as I believe it is of interest to the community.
格式问题
No formatting issues.
We are very grateful for your thorough review and interesting questions on the work. We address your questions below:
1, 2, 3, 4.
Thank you very much for pointing out these issues in the manuscript. We will fix these in the camera ready version.
While informative in some regards, we also worry that these error bars may be misinterpreted. As such, we opted to remove them from the main text but nevertheless keep them in the supplement. The error bars in Figure F4b are obtained by resampling the DMS, however, the problem is that the assays are diverse in numerous regards. For instance, the correlation between experimental replicates is low in some cases and higher in others. This means that the maximum possible Spearman correlation between a model and an experiment varies between assays, and hence there is a tendency for all models to correlate well with some assays and poorly with others. When resampling we get a sense of this variability, but this has little to do with the statistical significance of the performance gain between one model and another, and we worry that some readers may interpret them as such. We should clarify this in the text.
To complement this analysis, we have performed permutation tests to check the significance of the mean difference in spearman between LFB and the baseline method. We did so with two-sided permutation tests with 100,000 iterations and found significance for both large models (p<10-5). We propose to add this analysis to the manuscript.
Results presented in Figure 2 and 3 and Table 1 used the full sequence neighborhood, without downsampling, except for the results using Evo 2 which were based on randomly selected 9 sequences (as outlined in section 3.2). We agree that a non-aggregated picture of the effect of subsampling on individual datasets/ proteins could be interesting. We will add it as a supplementary table.
The statement is that light versions of LFB with 10 sequences yield significant performance gains. And while according to Figure 4a, 10 sequences does seem like a reasonable choice for some applications, it was not our intention to make a blanket recommendation. We envisage practitioners performing their own cost-benefit analysis in the context of the application of interest.
Perhaps we are not always being sufficiently precise in regards to what we mean by lightweight and would be better off making comparative statements. If the model of interest is computationally heavy in the first place, then a 10x increase in the cost of scoring a sequence is not a negligible difference. On the other hand, scoring a sequence is generally a negligible cost compared to the cost of developing the model in the first place. Thus, as a strategy for improving performance, clearly LFB is extremely cheap compared to most, if not all, alternatives to developing a better model.
Regarding the computational cost of building sequence alignments, in the case of protein models, and as far as we are aware, this is rarely a significant bottleneck. For instance, we found making a UniRef50 alignment with MMseqs2 took around 20 seconds per sequence on 20 CPU cores, and using MMseqs2 GPU it is reported this can be reduced to around a second (Kallenborn et al. 2024). We are aware of some recent work, Protriever, that uses vector search to rapidly obtain related protein sequences. This could be an interesting future direction for LFB or LFB-like models.
For genomic language models the situation is different as whole genome alignment is extremely challenging. Fortunately publicly available precomputed alignments exist, such as the zoonomia alignment used in this work.
Thank you for the suggestion. We show the unmasked marginal formula in appendix D.2. eq 7, but currently it is quite unclear as we didn’t name it there as unmasked-marginal scoring, so we will ensure it is made clearer and link to this section in the main text.
Yes indeed LFB can be used for scoring indels. We have explored this a bit with Evo 2 using the performance assessment from the original paper: (non-coding indels; baseline 0.913, LFB 0.932) and (coding indels; baseline 0.812, LFB 0.816). We have some reservation about the quality of this performance assessment which is why we did not include these results in the manuscript. We could also in principle explore this with the ProteinGym indel benchmark and the ProGen2 family. We have not performed this analysis yet though as our priority so far has been on assessments which could be used across model families.
Given that performance improves in terms of area under ROC curves, the intention with this discussion was to explore in what way this improvement is being achieved. There will certainly be some cases where pathogenicity is underpredicted, but overall, in this case, it seems that the gain in performance is largely coming from a reduced tendency to over predict pathogenicity. That said, we think the situation is more nuanced, and how the distributions shift under LFB also depends on the scoring method. We are tempted to remove this paragraph.
This is a very interesting question. We did some analysis in order to see if there were particular kinds of experimental assays that performed worse (Figure F9) and, we also looked at the relationship to alignment depth (Figure F10), but found no notable trend. One hypothesis may be that among these proteins, the fitness landscape changed rapidly across evolution, invalidating the assumptions underlying LFB. Any suggestions for ways of checking this or other possibilities would be welcome.
The method was developed based on considerations of the data generation process and hence why likelihood does not purely reflect fitness. However, we agree that there are plenty of reasons to think the method might extend to hybrid models. In particular, given the nature of the retrieval method used in Tranception and TranceptEVE, it seems very likely that the two approaches are complementary and that further performance gains will come from combining these approaches. The situation with PoET, ProSST and SaProt is less clear though and warrants investigation.
I would like to thank the reviewers for adequately addressing my raised questions - my assigned score will remain at 5.
This work proposes a test-time augmentation scheme called LFB for improving performance of PLMs (and GLMs) on variant effect prediction. Given the wildtype sequence and variant sequence characterized by the mutation a -> b at site i of x, to compute the fitness effect of this mutation, the mutation is applied to a set of suitably similar sequences (percent identity threshold >= 30%) to obtain an augmented set of sequences . The fitness effect of the mutation is then taken to be the average fitness effect over this augmented set. LFB gives broad improvements across ProteinGym clinial and DMS benchmark, notably making larger PLMs perform better, reversing a trend observed in the literature.
优缺点分析
Strengths:
- The method seems to work well in practice.
- The method is relatively simple, as it relies on a test-time augmentation scheme. No model retraining is required.
- The effectiveness of LFB will likely prompt futher research into why it works so well, and lead to improved variant effect predictors (whether by use of LFB itself or by the insights derived from trying to understand it).
Weaknesses:
- The theoretical explanation provided in the paper seems flaky to me. In section 3.1, LFB is motivated under a OU process. I don't see how the OU process is related at all to the original motivation of addressing phylogenetic structure and sampling biases, as this OU "evolutionary model" doesn't contain any of these effects: there is no phylogeny in the model (as the OU samples x_i are assumed to be iid) and there is no sampling bias either (as the x_i are drawn from the exact stationary distribution). In fact, it seems to me that this OU example is just a glorified instantiation of the straightforward fact that "averaging iid observations produces a lower variance mean estimate". In particular, it seems to me that you could have just started the example by saying that the x_i are drawn iid from a Normal distribution, and skip the OU dynamics completely. In the Weinstein paper, a OU process over a tree is considered, which is much more reasonable. In summary, the example presented does not highlight LFB's ability to account for phylogenetic structure and sampling biases -- it just captures the motivation of any generic test-time augmentation method, which is reducing the variance of the model predictions.
- While LFB is supposed to reduce variance (line 123), the distribution of scores shifts noticeably to the right when LFB is applied (Figure 2b). So, clearly, there is more at work than just variance reduction (while still being a TTA scheme). This adds to my criticism that the theoretical explanations are flaky: I think LFB is interesting, but my impression is that the work doesn't really explain what it's doing in practice.
- LFB is a test-time augmentation (TTA) scheme. TTA is standard in other areas like computer vision. There is no mention about this.
- The augmented set includes sequences which do not contain the wildtype allele. I only realized this while reading the supplement. I think this is important and should be mentioned in the main text. My original expectation was that LFB would contain only sequences with the wildtype allele.
问题
- I would appreciate it if the OU example was explained in more detail. How does it capture any of the issues relating to phylogenetic structure and sampling bias? It seems to me like LFB is just a TTA scheme motivated by variance reduction (as you say in line 123). If this is the case, you can basically remove the section on "A simple model of evolution" since nothing interesting is really going on there.
- Do you agree that Figure 2b, which shows a shift in the scores, is at odds with the equation in line 124 posing LFB as a variance reduction scheme? I don't mind this -- I just think this discrepancy should be mentioned.
- Do you agree that LFB is a form of test-time augmentation? Would it be valuable to mention this in the paper to communicate it more effectively?
局限性
I feel the work oversells their theoretical explanation. LFB as motivated in the OU example doesn't seem designed to address phylogenetic correlations nor sampling bias, but rather generic model prediction noise (like TTA in computer vision). If this is the case, being clear about this limitation in our theoretical understanding of LFB would be valuable. In particular, the last sentence in the abstract seems to me like it is misrepresenting the contributions of LFB: "These results suggest that accounting for phylogenetic and sampling biases is essential to realizing the full potential of large sequence models in variant effect prediction" -- LFB doesn't seem designed to handle either.
最终评判理由
This work is interesting and I think it will lead to a lot of discussion in the field and move things forward. Empirical results are strong. The only reason I do not score the work as a 5 is because I am not quite convinced by the underlying theoretical explanation.
格式问题
No concerns.
Thank you for your thoughtful comments and for drawing the connection between Likelihood-Fitness Bridging (LFB) and test-time augmentation (TTA). We agree there is an interesting parallel, though at the same time we see an important distinction. TTA is typically used in supervised learning settings to reduce variance in predictive outputs due to model noise, whereas in contrast, LFB operates in an unsupervised setting, where our goal is to extract an estimate of biological fitness from the sequence likelihoods of language models. The likelihood of the model is not noisy necessarily, but rather reflects both fitness and other factors, including biologically meaningful structure such as phylogeny. The motivation for LFB is therefore not removal of generic prediction noise, but rather the extraction of a fitness-related component from the model’s likelihood outputs. This is why we chose to describe it as a modified fitness inference approach. In practice, in this work, we model these likelihoods as noisy estimates of fitness, so we recast the problem in a form where averaging predictions, just as in TTA, appears to be well motivated. We will revise the manuscript to make the connection more explicit.
With the OU model example, we aimed to show how LFB could reduce the effects of drift or phylogenetic non-fitness signals on our fitness estimates. Our objective was to lay out a simple model of evolution to gain intuition and therefore, for clarity, we decided to leave out correlations among the sequences, as they do not qualitatively alter the conclusions. But we agree with you that the lack of phylogenetic correlation structure among the related sequences makes this example less compelling. So we propose to include them in the example, using the same Ornstein-Uhlenbeck tree model as in Weinstein et al. 2022. In the text (L140, and L559 / Eq25), we currently have,
$
\text{Var} (\sigma_{\rm LFB}) = \frac{4\delta^2}{n^2} \sum_i \text{Var}(\varepsilon_i) = \frac{2\delta^2 \sigma^2}{\alpha n}\
$
and by adding in the correlations arising due to the phylogenetic structure in the OU tree process we have
$
\text{Var} (\sigma_{\rm LFB}) &= \frac{4\delta^2}{n^2} \Bigl( \sum_i \text{Var}(\varepsilon_i) + \sum_{i\neq j}\rm{cov}(\varepsilon_i, \varepsilon_j) \Bigr) \newline &= \frac{2\delta^2 \sigma^2}{\alpha} \Bigl( \frac{1}{n} + \frac{1}{n^2} \sum_{i\neq j}\exp(-\alpha \space t_{i,j}) \Bigr) \newline &< \frac{2\delta^2 \sigma^2}{\alpha} = \text{Var}(\sigma_{\rm LL}),
$
where is the distance between sequences on the phylogenetic tree. The reduction in variance is bounded by the contribution of the correlations, . In practice, we choose sequences which provide less correlated predictions, in other words, we choose well spread out representatives across the tree by using the clustered/redundancy-reduced Uniref50 database. We also saw that including sequences over larger hamming distances from the reference sequence improved performance (Figure F2), although there appears to be a tradeoff between including distant less correlated sequences and not including sequences which are too distant, possibly within different local fitness maxima (the breakdown of the assumption the sequences are drawn from the OU tree process with one maximum). We hope that in this way the connection with phylogeny is clearer in the example. As you suggest, this example doesn’t rule out other explanations but rather presents how LFB would mitigate the effects of genetic drift on fitness prediction in this scenario.
On the shift in scores on Figure 2b:
We show similar distributions in Figure F6, with masked-marginal scoring on the top row and unmasked-marginal scoring on the bottom row (as included in Figure 2b). Although there is a systematic improvement in the separation between benign and pathogenic variants for LFB with the masked-marginal scoring (improved AUC), the overall shift in distributions is not present. This shift appears to be related to the unmasked-marginal scoring – scores are lower as the likelihoods of non-reference amino acids are lower when the reference sequence amino acids are not masked during scoring. The usefulness of unmasked logits has been explained in Gordon et al. 2024, and we also provide an analysis using masked-marginal scoring in Figure F3. We will clarify this point in the manuscript.
On "The augmented set includes sequences which do not contain the wildtype allele":
As specified in L 117 and Alg. 1, as well as in Figure 1, the set of sequences used in LFB contain the reference allele of the query sequence inserted at the position studied. We will draw more attention to it around L 117.
Thanks to the authors for your response to my questions.
I would need to think more about the extent to which LFB is handling phylogenetic correlations and sampling biases (instead of just density estimation noise). I agree the use of clustering/redundancy-reduced data does a lot in this direction. That's pretty much the standard in the field AFAIK, and IMO not quite a principled approach, e.g. identity cutoffs have always seemed arbitrary to me (a method with no cutoffs and full use of data would be great).
I continue to think this is an interesting paper that will lead to significant discussion and follow up work. I still recommend acceptance and raised my confidence from 3 to 4. The reason I do not rate it as 5 is because I am still not fully convinced by the underlying theory -- it still seems to me like this could just be a variance reduction mechanism as in a traditional TTA scheme.
The paper proposes a "Likelihood-Fitness Bridging" (LFB) strategy to improve the prediction of variant effects in protein and genomic language models by averaging the scores of sequence models under similar selection pressures. The authors found that the likelihood estimates of existing models are affected by phylogenetic structures and sampling biases, leading to a gap with biological fitness. Based on the Ornstein-Uhlenbeck evolutionary model, LFB does not require retraining the model and can reduce the estimation variance only through post-processing. Experiments show that LFB significantly improves the performance of protein language models such as ESM-2 and ProGen2, as well as the Evo2 genomic model. In particular, it reverses the performance plateau of large models, making the largest model achieve the optimal accuracy.
优缺点分析
Strength:
-
This paper presents a novel perspective on the gap between fitness and likelihood, highlighting potential systematic biases in protein language model training, a topic that is seldom addressed in related work.
-
The authors conduct thorough experiments to demonstrate how these biases hinder pLM performance and how the proposed LFB estimator can effectively mitigate these limitations.
Weakness:
-
As noted in the limitations section, the sampling biases are not explicitly modeled, which leaves a gap in the proposed approach. Additionally, the methodology for selecting related sequences is somewhat unclear and could benefit from further clarification. Thirdly, it is suggested that the authors clarify whether LFB can still achieve good performance for sequences lacking homologous information.
-
Although the method is computationally lightweight, would introducing MSA increase computational time? It is suggested that the authors add analysis to compare the computational time of LFB and fine-tuned language models.
问题
-
How are “related sequences” defined in the LFB framework? While the authors mention using the Zoonomia 447-way alignment, it would be helpful to clarify whether there are specific criteria for including or excluding certain species to enhance performance or interpretability. How sensitive is LFB to the size and composition of the selected sequence neighborhood? Could poorly aligned or divergent neighbors introduce new biases? Would integrating structure-based neighborhoods (e.g., using FoldSeek) instead of MSA improve robustness for low-homology proteins?
-
How does the method handle cases where the mutation site in a related sequence does not align with the reference nucleotide in the human sequence? There must be some filtering or alignment criteria—could the authors elaborate on this?
-
The performance gain for the Evo 2 gLMs was more modest, and the scaling trend was not reversed. Could this be related to the scope of homologous sequences used? For gLMs, the analysis relied on a 447-way mammalian alignment from Zoonomia , whereas the protein models drew from the much broader UniRef50 database, which spans the entire tree of life. Could a more taxonomically diverse genomic alignment potentially unlock larger gains for gLMs?
-
The LFB estimator uses a simple, unweighted average of log-likelihood scores across all selected homologs. Have the authors considered a weighted average? For instance, weighting sequences based on their evolutionary distance from the reference sequence or by their model perplexity score could potentially yield a more refined and robust fitness estimate
-
The authors should provide access to code so that other researchers can reproduce the results.
局限性
Yes
最终评判理由
The authors have addressed my questions in the rebuttal.
格式问题
No
We are very grateful for your thorough review and interesting questions on the work. We address your questions and comments below:
Regarding not addressing sampling bias:
While we haven’t explored the sampling bias theoretically, we believe that the strategy of averaging over sequences under the same selective pressures should in principle remove multiple sources of noise on the fitness predictions – both the effects of phylogeny as we explore more systematically, but also other effects arising from over or under representation of certain sequences in the database. Our subsampling analysis Fig.4a seems to point in this direction, in the sense that even in low-sample regimes the approach delivers performance improvements.To what extent exactly LFB handles the effects of sampling bias is an interesting question for future work.
Regarding computational times of including MSAs and comparison to finetuning:
While LFB does require retrieving homologous sequences via MMseqs2, this step is relatively efficient: a single search against UniRef50 takes approximately 20 seconds on 20 CPU cores and can be reduced to ~1 second using GPU-accelerated MMseqs2 (Kallenborn et al., 2024). This cost is incurred once per input sequence and is trivially parallelizable across a large dataset. Comparing this to the cost of running a protein language model – for a typical protein of length 700, a single forward pass takes approximately 1.0 seconds with ESM-15B and 0.03 seconds with ESM-650M. Consequently, scoring substitutions across 10 homologs can range from 0.3 seconds (e.g., one-site scoring with ESM-650M: 0.03s × 10) to up to 133,000 seconds in the worst-case scenario (full mutagenesis of a 700-residue protein using ESM-15B: 700 × 19 substitutions × 10 homologs × 1s).
If we were to estimate the computational cost of a fine-tuned model trained on sequences similar to the query, a comparable MMseqs2-style search would still be required to construct the training set, in addition to the cost of training the model and performing inference. The precise cost would depend on the training configuration and hardware, and thus a direct comparison would require additional assumptions.
We note, however, that fine-tuning strategies involve limitations beyond computational expense. As discussed in Gordon et al. (2024), they require careful design to prevent overfitting, including curated training data, tuning schedules, and model selection. In contrast, LFB alters only the inference procedure and keeps the underlying model fixed, avoiding these risks while remaining broadly applicable and computationally lightweight.
Question 1.
Regarding the definition of “related sequences”: The definition and composition of “related sequences” in the LFB framework depend on the underlying model type (protein language model vs genomic language model), as discussed in s3.2, and appendix D1. This is a reflection of both the distinct practical challenges of building alignments for proteins vs genomes, as well as what data is currently available for doing so.
For pLMs, we construct multiple sequence alignments (MSAs) using MMseqs2 searches against UniRef50. We define the sequence neighborhood by selecting homologous sequences with at least 30% sequence identity to the reference. This threshold was chosen based on an empirical analysis of LFB’s sensitivity to neighborhood size, as shown in Figure F2. Furthermore, we demonstrate in Figure 4a that LFB is robust to subsampling: selecting as few as ~10 random homologs is sufficient to recover nearly the full performance gain provided by the complete sequence neighborhood. Motivated by this, for gLMs, we randomly chose 9 species to include from the zoonomia 447-way alignment (in addition to the human sequence for each gene).
Regarding sensitivity to composition and size: Overall, we find minimal sensitivity to the composition of the neighbourhood (as seen by the subsampling experiment Figure 4a), and as mentioned, we studied an overall sensitivity to the size of the neighborhood in Figure F2. That said, we are very interested in the fact that the ideal neighbourhood for each individual gene is not always at 30% identity, and see this fact as an exciting avenue for improvement of LFB in future.
Regarding sequences with poorer homology: We studied the sensitivity of LFB to the size of the MSA in Figure F10, and found minimal sensitivity, but we agree it is important to clarify how LFB behaves when homologous information is absent. LFB is designed to fall back to the base pretrained language model (pLM) predictions when no homologous sequences are available. In such cases, no averaging is applied, and the output corresponds to the standard pLM marginal scores. While LFB does not improve upon the base model in the absence of homologs, it does not degrade performance either.
The challenge of poorly aligned regions is a great point. We would be interested to see whether structure-based alignment techniques could improve LFB in these cases, though in the hardest to align cases such as disordered regions they would unlikely bring an advantage.
Question 2.
In the case where no mutation site aligns to the reference, i.e. where there is a gap in the alignment, we took the score for this sequence in this position to be neutral (i.e. zero), as we describe in appendix D3.
Question 3.
This is a great point of discussion we would like to include in the text. In contrast to the protein alignments we only consider sequences across mammals for the gLM. This could limit the benefits of LFB as we saw when filtering the protein alignments to neighbourhoods too local, Figure F2. It is exciting that the data needed to explore this idea in more detail will be available in the near future through the Earth Biogenome Project.
Question 4.
This is an interesting point. Rather than reweighting, we have focused on strategies to identify the relevant sequence neighborhood for LFB; we have approached it from the perspective of filtering based on distance to reference sequence (or identity, see Figure F2) and model perplexity (Figure 4b). In both cases we see that filtering improves performance, suggesting that reweighting schemes might also be a fruitful approach (though random subsampling shows low sensitivity to the actual composition of the averaging within such sequence neighborhood, Figure 4a).
Question 5.
We will make the code publicly available in a clean and reusable format upon conclusion of the review process, as we are not allowed to update github repos during review, and we are keen for the method to be adopted. For now we have provided code for pier review in the Supplementary Materials.
Thanks for the explanation and conducting additional experiments. I will raise the score accordingly.
This paper proposes "Likelihood-Fitness Bridging (LFB)", which improves the accuracy of variant effect prediction by averaging model scores across sequences undergoing similar selective pressures. This method can be directly applied to existing protein and genomic language models without requiring retraining. The authors further experimentally validate the effectiveness of their approach.
优缺点分析
Strengths:
(1) This paper proposes "Likelihood-Fitness Bridging (LFB)", which improves the accuracy of variant effect prediction by averaging model scores across sequences undergoing similar selective pressures.
(2) The authors further experimentally validate the effectiveness of their approach.
Weakness:
Firstly, it should be noted that my research area is not directly related to this paper, and the following comments are provided for reference only and to assist in improving the manuscript.
(1) The paper lacks a clear explanation of its motivation. In the abstract, the authors state that "variant effects are typically estimated by comparing sequence likelihoods, assuming that likelihood serves as a proxy for fitness"; however, this assumption often fails in practice. However, the paper does not provide sufficient experimental evidence to support the claim that this assumption indeed breaks down in real-world applications. It is recommended that the authors include additional analysis or experiments to clarify the specific reasons behind the failure of this assumption, thereby strengthening the motivation of the work.
(2) Table 1 reports unexpectedly lower performance for the 15B model compared to smaller-scale models. Could the authors clarify this counterintuitive observation?
(3) The methodology presented by the authors seems straightforward. To enhance reproducibility, we recommend making the implementation code available for peer review.
问题
Please see the above weakness.
局限性
Yes
最终评判理由
The author's response addressed my concerns. I will raise my score.
格式问题
None
Thank you for your thoughtful feedback and for engaging with the paper, although the topic lies outside your main area of expertise. Ensuring the work is accessible and well-motivated to researchers from adjacent fields is an important goal for us.
(1) Motivation and the fitness-likelihood gap: We believe the manuscript already lays out both the conceptual and empirical rationale, though we agree this can be explained more clearly. Specifically, in the introduction, we describe how likelihoods from generative sequence models are now used across fields to estimate fitness (proxy for biological viability/quantitative measure of how well a protein variant is expected to function or be tolerated in a biological system), and outline why this assumption may however not always hold (due to phylogenetic structure and sampling artifacts, L30). We then discuss a significant piece of empirical evidence of a gap between likelihood and fitness – as the field fits better density estimators on datasets of biological sequence (obtaining better sequence reconstructions on validation set sequences), they do not perform any better, and sometimes indeed worse, on fitness effect prediction tasks (L36). Examples of this can be seen in Figure 2, 3 and F1 as well as in Table 1. This phenomenon has been noted across multiple studies (e.g., Nijkamp et al 2023, Gordon et al. 2024, Hou et al. 2025), and motivates the need for a method like LFB. This directly presents a problem in practical applications such as predicting disease causing variants (Figure 2, Sec. 4.2).
Section 2.3 is then devoted to exposing this gap between fitness and likelihood which our approach aims to mitigate. We provide further evidence that likelihood reflects not only fitness through references which deal with this issue explicitly, in particular Steinhardt et al. 2024 cover species compositional bias to likelihoods, Weinstein et al. 2022 cover the gap between fitness and likelihood introduced by phylogenetic structure, and Gordon et al. 2024 and Hou et al. 2025 cover the worsening performance of bigger pLMs on fitness estimation.
(2) Performance of the 15B model in Table 1: In Table 1, the relatively poor performance of the 15B model is not an anomaly, but rather a known and pressing problem for the field, as discussed above – larger models are not necessarily better for fitness prediction, despite their improved likelihood estimation. This counterintuitive trend is a central motivation for our work, and one that LFB is designed to address.
(3) Code availability: Reproducibility and adoption of the method is very important for us. The current implementation supporting our experiments is included in the Supplementary Materials for peer review. We are not allowed to update the repository during the review window, but will make the code publicly available in a clean and reusable format upon conclusion of the review process.
Thank you for your reply, which addressed my concerns. I will raise my score.
This submission addresses the empirical paradox that larger protein language models with lower perplexity obtain poorer zero-shot variant effect prediction performance. The authors find that by applying mutations to homologs and averaging predicted effects, the performance drop can be erased. Reviewers generally appreciated the significance of these results and their ability to inspire discussion within the community. However, reviewers raised concerns about the theoretical justification provided by the authors. Authors were able to partly resolve these concerns in the discussion phase, raising 2 out of 4 scores. I lean towards acceptance due to the significance of the results, although I concur that the theoretical justification is not fully convincing.