Ultrafast classical phylogenetic method beats large protein language models on variant effect prediction
摘要
评审与讨论
The paper explores a method for estimating the transition matrix from multiple sequence alignments (MSAs). It utilizes a phylogenetic tree model, which is parameterized by the transition matrix and site rates. To estimate the transition matrix via maximum likelihood, an alternating optimization method is employed: first, the model parameters are fixed, and a set of phylogenetic trees is constructed from the MSAs. Then, given the MSAs, the trees, and the site rates, the transition matrix is estimated by maximizing the likelihood of the data given the tree.
A recent method, CherryML, speeds up this process by replacing the likelihood of the data given the tree with the composite likelihood of the cherries (pairs of adjacent leaves) given the tree. This likelihood can be simplified using time-reversible models and can be learned from the tree, MSA, and leaf sequences.
Building on the concepts of CherryML, this paper introduces a new method called FastCherry, which further simplifies the composite likelihood estimation by eliminating the need for tree construction. Instead, it focuses on creating disjoint pairs of similar sequences and estimating the composite likelihood from these pairs. Additionally, FastCherry extends the original WAG phylogenetic model by considering per-site transition matrices, with all parameters estimated using maximum likelihood.
The authors demonstrate the significance of their work through two different evaluation tests. The first test evaluates the speedup achieved by FastCherry compared to CherryML. The second test examines variant prediction performance in ProteinGym, comparing FastCherry to baseline approaches such as ESM-1v and ESM-IF1.
优点
The paper addresses an important application problem, though its relevance to the machine learning community may be limited. It is noteworthy that a phylogenetic model can achieve results comparable to, and sometimes slightly better than, pretrained PLMs in a specific use case. The paper is well-written, providing careful explanations of technical points that require a background in protein science.
缺点
The idea of simplifying the trees by partitioning the MSA into pairs of similar sequences is an approximate representation of the cherries in the tree. Since the partitioning algorithm is a greedy one, it may overlook the full structure of the MSA, reducing the relationships between sequences to binary relations between pairs. That also explains why CherryML with FastCherry lose the accuracy compared to CherryML with FastTree in Figure 1.b.
This work represents an incremental improvement over the original CherryML proposal, and therefore, its novelty is somewhat limited. Additionally, the results demonstrated in the variant prediction benchmark show only marginal improvements compared to language models. For a NeurIPS paper, I would expect more significant advancements, both in methodology and in the results presented.
For the machine learning community, it is not sufficient to demonstrate that one method is better than another through benchmark results alone. A deeper analysis explaining why the phylogenetic model performs better than language models in certain benchmarks and when it fails to do so would provide valuable insights for the community to learn from.
问题
-
How could the authors provide a deeper analysis of why the phylogenetic model performs better than language models in certain benchmarks?
-
Could you please explain a bit on why on Figure 1.b when the number of family increases the error rate seem increase for CherryML?
局限性
NA
Thank you for your thoughtful review. Please find our response below:
The idea of simplifying the trees by partitioning the MSA into pairs of similar sequences is an approximate representation of the cherries in the tree. Since the partitioning algorithm is a greedy one, it may overlook the full structure of the MSA, reducing the relationships between sequences to binary relations between pairs. That also explains why CherryML with FastCherry lose the accuracy compared to CherryML with FastTree in Figure 1b.
The whole purpose of developing FastCherries is to avoid doing expensive computations while assuring little reduction in accuracy. It is a worthwhile tradeoff that can benefit applications requiring scalability, such as the one considered in our manuscript. Indeed, for the variant effect prediction task, it can be seen in Supplementary Figure S1 that FastTree is too slow to scale up to the full dataset size. In contrast, FastCherries provides no loss in performance and can crunch all the available data, yielding the best results.
This work represents an incremental improvement over the original CherryML proposal, and therefore, its novelty is somewhat limited.
We respectfully disagree with this view. As our results clearly demonstrate (Figure 1, Supplementary Figure S1, and Supplementary Table S1) and as the other reviewers have noted, the performance of CherryML with FastCherries is comparable to that of CherryML with FastTree while being one to two orders of magnitude faster. Furthermore, we are proposing a novel method (SiteRM) to estimate site-specific rate matrices and this framework has the potential to advance phylogenetic inference significantly; software such as IQTree and its partition model puts this application well within reach. When compared to the seminal LG model, SiteRM provides a substantial improvement in variant effect prediction, as seen in Supplementary Table S1, suggesting that it can model site-specific evolutionary constraints more accurately than previous approaches.
For the machine learning community, it is not sufficient to demonstrate that one method is better than another through benchmark results alone. A deeper analysis explaining why the phylogenetic model performs better than language models in certain benchmarks and when it fails to do so would provide valuable insights for the community to learn from.
How could the authors provide a deeper analysis of why the phylogenetic model performs better than language models in certain benchmarks?
The improved performance of our method comes from conditioning on the wildtype to obtain a local fitness function Please see our response to Reviewer 2 for more discussion on the intuitions. We plan to make this more clear in the final version. We do agree with the reviewer that deeper error analysis of large language models would be a valuable research direction. Ultimately, we believe that the best variant effect predictors will combine our evolutionary modeling with protein language modeling.
Could you please explain a bit on why on Figure 1.b when the number of family increases the error rate seem increase for CherryML?
The fact that the (rounded) error rate is 1.7 for and 1.8 for is just a small sampling effect. At these larger sample sizes, the statistical risk of the estimator has converged to its bias term.
Thank you for your response. I still have concerns about reducing the dependency to pairs, as this seems like a rough approximation. Even though empirical results indicate only a small reduction in accuracy for certain test cases, this approach should be supported by a theoretical analysis. Specifically, it should demonstrate which types of data distributions result in an acceptable loss of accuracy on average, as well as the extent of the accuracy loss in the worst-case scenario. Therefore, I would prefer to maintain my score as it is.
While we agree that theoretical results would be nice, the matrix exponential is an unwieldy object and therefore deriving quantitative theoretical results regarding the asymptotic bias or sample complexity of our method is challenging. These challenges are not unique to our work but rather true of the whole field of statistical phylogenetics which relies on such continuous-time Markov chain models, such as the seminal work of Whelan and Goldman or that of Le and Gascuel (neither work provides theoretical results). Following best practice in the field, we showcased our work on a variety of standard simulated and real-data benchmarks — as well as the novel variant effect prediction benchmark — where we showed that our method has comparable or superior performance at a fraction of the computational cost.
The paper proposed a fast method for phylogenetic estimation from MSA called FastCherries, this method significantly speeds up the computational process with high accuracy. This method was demonstrated to be orders of magnitude faster than existing methods while achieving similar statistical efficiency. Further, the authors proposed the SiteRM model for variant effect prediction, also outperforming existing models, particularly large language models.
优点
- The proposed methods are significantly faster than existing methods and are scalable for large sets of MSA. Meanwhile, the methods maintain a high level of accuracy, making a good contribution to the field of protein evolution.
- The methods made a superior performance in variant effect prediction than the large protein language model, which is effective in particularly GPU resources.
- The methods were tested rigorously on both simulated and real data.
缺点
- The paper lacks a comparison to other methods in terms of computational resources, such as memory usage. Could the author provide some results?
- The model is still highly parameterized. Can the SiteRM model avoid overfitting with extensive datasets?
- How does the method perform with datasets that have a large number of gaps in MSAs (for example, the sequence has very low homology)?
- Can the methods be extended to incorporate other types of biological data, such as DNA/RNA sequence or structural alignment?
问题
See above.
局限性
Yes, the limitations are well described.
Thank you for your thoughtful review. We respond below:
The paper lacks a comparison to other methods in terms of computational resources, such as memory usage. Could the author provide some results?
Our method uses linear space; the logarithmic factors in the computational runtime come from divide-and-conquer and binary searches. This makes our method space efficient (up to a constant). We will discuss this and add practical results on memory usage in the final version. In more detail, the space complexity is Here, is the cost of the precomputed matrix exponentials, which is shared across all MSA’s. is the cost to load one MSA into memory. With a runtime of , pairing takes space, since it takes to store the distance between x and all other sequences, and the space complexity equals the cost to store the stack of recursive calls. is the space of computing the initial site rates. The term comes from the count matrix the number of times character occurs at position . Computing site rates given branch lengths takes space and computing branch lengths given site rates takes space. In both cases, we only track the best rate for the current site or best branch length for the current site.
The model is still highly parameterized. Can the SiteRM model avoid overfitting with extensive datasets?
As described in Section 3.3 of our manuscript, we avoid overfitting by using pseudocounts drawn from the LG model, which serves as a prior. For the results described in the paper, we find that mixing the pseudocounts with the data at a 1:1 ratio works well in practice, but it should be possible to tune the regularization coefficient using cross-validation. Other options we did not explore that would benefit from large datasets would be some form of weight-sharing of the rate matrices across different positions.
How does the method perform with datasets that have a large number of gaps in MSAs (for example, the sequence has very low homology)?
The QMaker "insect" dataset is especially notorious for containing a large number of gaps. We observe a similar performance to FastTree in Supplementary Figure S1.
Can the methods be extended to incorporate other types of biological data, such as DNA/RNA sequence or structural alignment?
Indeed, by using compound states one can model the joint evolution of amino acids and structural states. Our software is not limited to amino acid alphabets and can be applied to learn site-specific rate matrices for other kinds of molecular data such as DNA/RNA or structural states. We plan to pursue this in future work.
Thanks to the authors for their detailed rebuttal. After reviewing the responses to my questions, I checked the answers provided and decided to keep my score. This paper is suitable for acceptance.
The authors devise a fast and accurate phylogenetic inference algorithm based on CherryML. Their scalability allows them to fit flexible models to large protein families, prohibitively expensive for previous methods. As an example, they estimate site-specific substitution rates and use these rates to predict the effects of mutations. This method of mutation effect prediction is competitive with some state of the art generative modeling methods. This suggests scalable and accurate phylogenetic inference as a potentially important component of
优点
The writing is very clear. In particular, the authors describe CherryML very clearly.
The author's method is efficient and accurate (figure 1).
缺点
The authors state "The first probabilistic model of protein evolution was proposed by Whelan and Goldman". It would strengthen the paper to better cite the models that Whelan and Goldman were inspired by.
I would appreciate the author comparing to more state-of-the-art mutation effect predictors.
问题
How does this work relate to the conclusions of Weinstein, Eli N., Alan N. Amin, Jonathan Frazer, and Debora S. Marks. 2022. “Non-Identifiability and the Blessings of Misspecification in Models of Molecular Fitness and Phylogeny.” Advances in Neural Information Processing Systems, December.?
How do you parameterize the substitution matrix?
局限性
Adressed.
Thank you for your thoughtful review. Please see our response below:
The authors state "The first probabilistic model of protein evolution was proposed by Whelan and Goldman". It would strengthen the paper to better cite the models that Whelan and Goldman were inspired by.
Thank you. We will cite the earlier models in the final version, in particular Dayhoff's and JTT's.
I would appreciate the author comparing to more state-of-the-art mutation effect predictors.
In Supplementary Table S1, we show results including all variant effect predictors from the ProteinGym paper.
How does this work relate to the conclusions of Weinstein, Eli N., Alan N. Amin, Jonathan Frazer, and Debora S. Marks. 2022. “Non-Identifiability and the Blessings of Misspecification in Models of Molecular Fitness and Phylogeny.” Advances in Neural Information Processing Systems, December.?
While we agree with the theoretical results presented in the above paper, our work challenges the assumption in their theorems that there exists a unique, "global" fitness function f. Instead, our intuition is that the fitness function is contextual, i.e. it is different in different parts of the evolutionary tree. For example, in some subtree, a positively charged amino acid may be required at position , while in another subtree a negatively charged one may be required. This may be because of different contexts the protein is evolving in, due to e.g. protein-protein interactions. Our approach to variant effect prediction captures this intuition by using the "local" fitness function at each point in the tree. We do not think that the theoretical results in the referenced paper necessarily explain our strong performance on variant effect prediction. Instead, we believe that it is the use of "local" fitness functions that is leading to our improved performance. There is certainly exciting theoretical and empirical work to be done in this direction.
How do you parameterize the substitution matrix?
We use the parameterization from the CherryML paper, namely the off-diagonal elements of are given by where for an unconstrained vector (here will be the stationary distribution of ), and is a symmetric matrix given by , where is an unconstrained upper triangular matrix. This unconstrained parameterization allows CherryML to use unconstrained first-order optimizers to quickly estimate , as implemented by libraries such as PyTorch and Tensorflow.
It seems I maybe didn't understand the parameterization of Q. I thought there was a single substitution matrix learned for each site that was constant across the tree, or "local". Is this not the case? If not, could you point me to where in the paper you describe learning a substitution matrix that changes across the tree?
We apologize for the confusing statement. We are indeed estimating a single rate matrix for each site and it is assumed to be constant along the tree. By "local" fitness function, we meant that the distribution is conditioned on ; i.e., we are explicitly modeling the probability of transitioning from a given sequence .
It seems to me you're still supposing a global fitness function but evolving according to it for a short time. It would be interesting to compare your method with evolution for different amounts of time. However, this doesn't have to do with the soundness of the paper. I keep my score and recommend accept.
This paper introduces a new method for estimating amino acid substitution rate matrices from multiple sequence alignments, speeding up computation by orders of magnitude. The method, called SiteRM, outperforms traditional methods and large protein language models in variant effect prediction, showing its speed and accuracy in evolutionary biology.
优点
- the proposed method to calculate amino acid substitution rate matrices is designed to be computationally efficient, with a near-linear runtime while maintaining comparable performance. This efficiency enables the analysis of extremely large MSAs, making it suitable for high-throughput applications.
- SiteRM has shown superior performance in variant effect prediction compared to large protein language models that incorporate complex residue-residue interactions, which can be attributed to conceptual advances in the probabilistic treatment of evolutionary data.
- by estimating site-specific rate matrices for each column of a multiple sequence alignment, SiteRM captures the evolutionary dynamics at a finer resolution, which allows for a more accurate assessment of the impact of variants on protein function.
- SiteRM can deal with large datasets with millions of sequences, which is particularly useful for handling the vast amount of data generated in clinical and deep mutational scanning studies, where comprehensive variant effect prediction is crucial.
缺点
- while the paper demonstrates the effectiveness of SiteRM in variant effect prediction, further exploration of its applicability to other evolutionary biology tasks or datasets could further understand its capabilities and limitations.
- there lacks more detailed information on the benchmarking process, including datasets used, and potential biases in the evaluation, which could improve the transparency and reproducibility of the results.
问题
- it would be better to provide a more vivid and more understandable method to explain the provided algorithm, such as drawing some diagrams or flowcharts of the pipeline.
- although the provided method can improve the end-to-end runtime of the process, its performance drops a little comparing with the original method. Could you please list some possible reasons to explain the performance decrease, and some possible solutions.
局限性
- the performance of the method may vary across different protein families or evolutionary contexts. It would be better to assess the generalizability of the approach to diverse datasets and evolutionary scenarios.
- the accuracy of the estimated rate matrices and predictions might depend on the quality of the input multiple sequence alignments. Are there any ways to address potential biases or errors in the MSAs to improve the robustness?
- the benchmark used to evaluate the performance may have limitations or biases. It would be better to compare the performance on other benchmarks.
Thank you for your detailed review. We respond below:
while the paper demonstrates the effectiveness of SiteRM in variant effect prediction, further exploration of its applicability to other evolutionary biology tasks or datasets could further understand its capabilities and limitations.
We agree and plan to explore other applications of SiteRM such as improving phylogenetic tree inference in future work. Software such as IQTree and its partition model puts this application within reach.
there lacks more detailed information on the benchmarking process, including datasets used, and potential biases in the evaluation, which could improve the transparency and reproducibility of the results.
We focussed on standard benchmarks from prior work (CherryML and ProteinGym) with the hope that it maximizes transparency and reproducibility. We also provided code with detailed instructions to reproduce all our results. Nonetheless, we agree that our work is not self-contained and more details on the benchmarks could be included. We plan to incorporate these more detailed descriptions of the benchmarks into the final version.
it would be better to provide a more vivid and more understandable method to explain the provided algorithm, such as drawing some diagrams or flowcharts of the pipeline.
This is a great suggestion. The closest we currently have to this is the runtime analysis in Appendix A.4.2. We will include a more graphical depiction of the algorithm in the final version.
although the provided method can improve the end-to-end runtime of the process, its performance drops a little comparing with the original method. Could you please list some possible reasons to explain the performance decrease, and some possible solutions.
FastCherries shows a small asymptotic bias due to the use of Hamming Distance (HD) in the pairing step. We explored other alternatives during the project which we decided not to include in the paper, such as (1) pairing based on maximizing the composite likelihood of the pairing, (2) pairing based on minimizing the MLE distance between pairs, and (3) even a random pairer. The random pairer actually worked quite well for the WAG model of protein evolution on some benchmarks, showing asymptotic consistency with a relative statistical efficiency of ~1/8th, but worked poorly for the LG model due to the challenges of estimating site rates. We found that approaches (1) and (2) can indeed provide more accurate estimates in some cases. Since pairing based on HD worked very well already, we decided to make it the focus of the paper. We plan to explore other pairing methods in future work. It would be exciting to find a variant of FastCherries that retains the near-linear runtime while showing an even smaller error.
the performance of the method may vary across different protein families or evolutionary contexts. It would be better to assess the generalizability of the approach to diverse datasets and evolutionary scenarios.
Our benchmark on the QMaker datasets (Supplementary Figure S1) shows that our method performs well on datasets from diverse parts of life. We agree that more in-depth error analysis may provide further insights and improvements to the method, but we leave this for future research.
the accuracy of the estimated rate matrices and predictions might depend on the quality of the input multiple sequence alignments. Are there any ways to address potential biases or errors in the MSAs to improve the robustness?
This is an important question which pertains to much work done in statistical phylogenetics. Although we did not explore it in our work, the speed of our method makes it possible (unlike prior work) to obtain bootstrap confidence intervals for the rate matrix estimation, which should enable users to understand the extent to which the estimates are stable to e.g. subsampling the MSA (whether rows or columns) or changing the MSA building algorithm.
the benchmark used to evaluate the performance may have limitations or biases. It would be better to compare the performance on other benchmarks.
As mentioned above, the QMaker benchmark provides evidence in this direction showing generalization to diverse domains of life.
We thank all the reviewers for taking the time to carefully read our manuscript and provide thoughtful feedback. Different reviewers have asked different interesting questions, to which we have replied individually; we will also clarify them in the final version of our paper. The only major criticism is from reviewer #4, who states that our contributions are incremental over the original CherryML. As we explain in our rebuttal to the reviewer, we respectfully disagree with this view. In fact, the other reviewers have pointed out that our contributions are significant. For clarity, we highlight our contributions below:
-
Our new FastCherries algorithm significantly speeds up the tree estimation step of the CherryML framework, giving a near-linear time algorithm for end-to-end rate matrix estimation from MSAs. As we have pointed out in our response to reviewer #3 – whom we thank for bringing up space complexity – our method further has linear (and thus optimal up to constants) space complexity. While there is a small loss of accuracy, the whole purpose of developing FastCherries is to avoid doing expensive computations while assuring little reduction in accuracy. It is a worthwhile tradeoff that can benefit applications requiring scalability, such as the one considered in our manuscript.
-
In applications, as our results clearly show (Figure 1 and Supplementary Figure S1 and Supplementary Table S1) and the other reviewers have highlighted, the performance of CherryML with FastCherries is comparable to that of CherryML with FastTree, while being one to two orders of magnitude faster. These results include diverse MSAs (Supplementary Figure S1 uses the QMaker MSAs which come from diverse areas of life). Furthermore, the performance of SiteRM over LG in variant effect prediction is far from incremental, as can be seen in Supplementary Table S1.
We agree that there is much interesting work to be followed up, such as applying our method to improve phylogenetic tree inference. We plan to pursue this research in future work.
We thank all reviewers again for their time and dedication.
All reviewers find this paper strong. Minor critical remarks/questions were made related to the significance of the improvement over prior SOTA, applicability in challenging cases (very large datasets, large number of gaps, ...), and theoretical analysis to complement empirical validation. Nearly all issues were resolved in the rebuttal and discussion.
Overall, the decision is unanimously to accept. The authors can hopefully make good use of the detailed reviews in subsequent work.