Can a Crow Hatch a Falcon? Lineage Matters in Predicting Large Language Model Performance
Explicitly modeling LLM **lineage** boosts performance prediction accuracy. Our lineage-regularized matrix factorization leverages ancestry to outperform standard methods when predicting new or merged models with minimal evaluation data.
摘要
评审与讨论
This paper proposes Lineage-Regularized Matrix Factorization (LRMF), which incorporates ancestral model information into a matrix factorization framework. LRMF models how well a particular model performs on a specific task using latent embeddings, but adds graph Laplacian regularization terms based on model lineage and instance similarity.
I found the method to be well-grounded and novel enough to warrant publication. The experimental evaluation is comprehensive, covering different model sets, cold-start scenarios, and data sparsity impacts. The paper reads well and has a logical structure. Their results back up their claims nicely. The practical value here is significant. By showing how important lineage is, the paper offers a more efficient approach to model selection and tuning.
接收理由
- This work addresses the high cost of LLM evaluation, a practical bottleneck faced by the LLM community, by developing tools to predict performance beforehand.
- The core idea of leveraging lineage information through graph regularization is both novel for LLM performance prediction and intuitively makes sense. It fills a gap in existing work that typically ignores these relationships.
- The research claims are adequately backed by experiments on representative datasets. I appreciate that they considered diverse scenarios including mixed model types, cold-starts, and varying data availability.
- The gains shown over relevant baselines are meaningful - LRMF clearly outperforms both NCF with factors and their simpler lineage averaging method.
- The methodology is explained clearly enough that I could probably reproduce this work if needed, with sufficient details on data, implementation, and hyperparameters.
拒绝理由
-
The NCF with factors baseline only uses a limited set of factors (architecture type, model type, param size). While I understand the challenges with collecting comprehensive factors (and they cite Zhang et al., 2024 on issues with dominant factors), I would've liked to see comparison against a baseline with richer factors where possible - things like training tokens, dataset mix, or detailed architecture information where available. This would give us a more confident sense of LRMF's advantage.
-
The quality of lineage information feels like a potential weak spot. The approach relies on lineage data extracted from Hugging Face metadata (using "base model:" tags and config files). How reliable and complete is this metadata? The authors acknowledge this limitation but don't really quantify how missing or incorrect lineage data might impact results.
-
The current LRMF model treats all lineage connections the same way - whether it's fine-tuning, merging, different degrees of fine-tuning - and handles different path lengths implicitly through the graph Laplacian. This seems like an oversimplification of what can be complex inheritance patterns. The authors do mention this as future work, but it's still a limitation of the current approach.
-
The negative correlation on GPQA needs addressed. LRMF consistently shows poor (negative) correlation on this benchmark in both experiments (Tables 1 & 2), even performing worse than their simple Model Lineage Averaging baseline. This suggests the LRMF assumptions might break down for certain task types. This deserved deeper discussion and analysis than just reporting the number.
-
The construction of the instance similarity graph (A^(X)) depends on a specific embedding model and a fixed k=20. I would've liked to see some exploration of how sensitive the results are to these choices.
-
The evaluation focuses on relative performance (using Pearson correlation) to rank models correctly. While this makes sense, I think some discussion on predicting absolute performance scores would be valuable for certain applications.
-
The paper is missing citations for several models they mention - Llama 2 (which they also capitalize incorrectly), Pythia, Qwen, etc. And when they mention specific models like Qwen-2.5-7B-Instruct, hyperlinks to the Hugging Face models would be helpful.
给作者的问题
-
Why are we seeing negative correlation by LRMF on the GPQA benchmark across both experiments. Even the simpler Model Lineage Averaging baseline does better here. Do you have any insights into why LRMF might struggle specifically with this benchmark?
-
Could you tell us more about the specific factors used in the NCF baseline and how you chose them? Did you try incorporating potentially stronger factors, even if they were only available for some models, to further assess what lineage regularization uniquely contributes?
-
I'm wondering how sensitive the LRMF results are to how you constructed the instance similarity graph A^(X). Have you experimented with different embedding models for prompts or different values of k when defining neighbors?
Thank you for the detailed review. We reran ablations and robustness checks; answers are below.
Responses to Questions
Q1. Low Performance of GPQA and MuSR
Why are we seeing negative correlation by LRMF on the GPQA benchmark across both experiments?...
- Further analysis of model performance on GPQA and MuSR revealed that, for many models, these benchmarks were excessively difficult, resulting in accuracy on multiple-choice questions that was close to random chance.
- This substantial difficulty hindered clear differentiation between models on these benchmarks.
- Additional experiments with five different train/test splits showed extremely high variance in correlation coefficients for GPQA and MuSR, suggesting instability in predictions.
- We attribute this instability to the excessive difficulty of these benchmarks, where irrelevant data from other models contributes noise instead of useful signals. |Method|all|bbh|gpqa|ifeval|math|mmlu-pro|musr| |-|-|-|-|-|-|-|-| |LRMF |0.587(±0.032)|0.644(±0.021)|−0.110(±0.135)|0.406(±0.018)|0.536(±0.020)|0.653(±0.015)|0.094(±0.194)| |MLA |0.414(±0.014)|0.396(±0.017)|0.261(±0.106)|0.283(±0.026)|0.421(±0.034)|0.405(±0.033)|0.296(±0.058)| |NCF_F |0.242(±0.065)|0.295(±0.054)|0.137(±0.056)|0.082(±0.061)|0.190(±0.036)|0.283(±0.062)|0.104(±0.077)|
Q2. Factor Selection
Could you tell us more about the specific factors used in the NCF baseline and how you chose them?...
- We used only the factors consistently available across all models in the Hugging Face model card JSON: “architecture type,” “model type,” and “parameter size.” We didn’t include factors like “dataset size” or “batch size” because they weren’t reliably present. In the homogeneous setting, such missing factors would likely be similar across models anyway, so excluding them has little impact.
Q3. Sensitivity of A^(X)
I'm wondering how sensitive the LRMF results are to how you constructed the instance similarity graph A^(X). Have you experimented with different embedding models for prompts or different values of k when defining neighbors?
- We compared several widely used embedding models, including
Snowflake/snowflake-arctic-embed-l-v2.0,infloat/e5-mistral-7b-instruct, andall-mpnet-base-v2, and found that the choice of embedding model had almost no diffrence on the results. We attribute this to the fact that the essential requirement is to connect top-k tasks with rough similarity; any model capable of capturing coarse-grained similarity is sufficient for this purpose. - For the parameter k, we experimented with values ranging from 2 to 100, selecting the value that yielded the best AUC-ROC on the validation set. However, differences in performance across different k values were minimal. |embedding model|k=2|5|10|20|50|100| |-|-|-|-|-|-|-| |Snowflake/snowflake-artic-embed-l-v2.0|0.57|0.57|0.57|0.57|0.57|0.57| |infloat/e5-mistral-7b-instruct|0.57|0.57|0.57|0.57|0.57|0.57| |all-mpnet-base-v2|0.57|0.57|0.57|0.57|0.57|0.57|
Other Comments
The quality of lineage information feels like a potential weak spot ... how missing or incorrect lineage data might impact results.
The reliability of lineage metadata extracted from Hugging Face is indeed a critical issue, as this metadata often lacks completeness—for instance, merged models frequently omit details about their base models. To quantitatively assess the impact of noise/incompleteness of lineage data, we performed additional experiments by randomly modifying lineage data (adding or removing lineage links). Notably, we observed distinct responses between the two methods:
- LRMF:
- Robust to incomplete (missing) lineage data (40% removal led to only a ~10% decrease in correlation).
- Highly sensitive to incorrect lineage additions (40% addition caused a ~50% decrease in correlation).
- Implication: For LRMF, accuracy is better preserved by omitting uncertain lineage information rather than risking incorrect additions.
- MLA:
- Demonstrated robustness against both random additions and removals of lineage data.
- The resilience likely stems from random lineage connections averaging out model-specific variations, pulling results toward a global mean.
Detailed results from these experiments are available here.
Absolute Performance Metrics
The evaluation focuses on relative performance (using Pearson correlation) to rank models correctly. While this makes sense, I think some discussion on predicting absolute performance scores would be valuable for certain applications.
- We also used AUC-ROC as an absolute performance metric for binary predictions (solvable or not) and observed high scores (0.8–0.9) across all methods.
- However, high AUC-ROC can occur even when true correlation is zero, especially in homogeneous settings. Therefore, we report correlation coefficients as the primary evaluation metric in the main paper.
Thank you for the updated results and detailed reply. I hold my scores.
We appreciate your positive feedback and constructive comments. Thank you for your time and consideration.
The paper studies the effect of lineage on model predictive performance -- before running large-scale optimisation and prediction. The authors show that various techniques, such as finetuning and model merging, fail to take into account the lineage relationships between various models that are derived from their ancestors. Taking lineage features into account would help in improving the correlation between actual and predictive performance. Existing techniques make use of scaling laws or an adaptation that considers parameter size and token count. The authors pose a novel lineage-based matrix factorization technique that considers known graph paths as lineage dependencies accross original and derived models. This technique also addresses the cold start problem of performance estimation of LLMs.
接收理由
- The paper addresses the cold start problem with very few predictions from the model.
- The paper provides a large-scale evaluation of the LMRF technique.
- Accounting for linear information does seem like an interesting idea.
拒绝理由
The paper demonstrates incorporating lineage, it doesn't extensively analyze the sensitivity of its LRMF model to varying degrees of noise or incompleteness in this foundational lineage data. A critical aspect would be understanding at what threshold of data inaccuracy the benefits of lineage regularization begin to diminish or even become detrimental. Furthermore, over smaller subsets of data this effect might be more pronounced.
For both homogeneous and heterogeneous cases how does one decide to use model lineage averaging or Lineage-Regularized Matrix Factorization? The paper does not discuss in depth about the failure nodes and provide insights into the emperical analysis. Further, the paper still needs to shed light on the poor performance of GPQA
给作者的问题
Please read weaknesses.
We would like to express our sincere gratitude for your careful review and insightful comments on our research. The feedback from the reviewers has significantly improved the quality of our paper. Based on your suggestions, we have conducted several additional experiments, which we invite you to examine along with our responses.
What if there is noise in the model lineage?
The paper demonstrates incorporating lineage, it doesn't extensively analyze the sensitivity of its LRMF model to varying degrees of noise or incompleteness in this foundational lineage data. A critical aspect would be understanding at what threshold of data inaccuracy the benefits of lineage regularization begin to diminish or even become detrimental. Furthermore, over smaller subsets of data this effect might be more pronounced.
To quantitatively assess the impact of noise/incompleteness of lineage data, we performed additional experiments by randomly modifying lineage data (adding or removing lineage links). Notably, we observed distinct responses between the two methods:
- LRMF:
- Robust to incomplete (missing) lineage data (40% removal led to only a ~10% decrease in correlation).
- Highly sensitive to incorrect lineage additions (40% addition caused a ~50% decrease in correlation).
- Implication: For LRMF, accuracy is better preserved by omitting uncertain lineage information rather than risking incorrect additions.
- MLA:
- Demonstrated robustness against both random additions and removals of lineage data.
- The resilience likely stems from random lineage connections averaging out model-specific variations, pulling results toward a global mean.
Detailed results from these experiments are available here.
LRMF vs. MLA
For both homogeneous and heterogeneous cases how does one decide to use model lineage averaging or Lineage-Regularized Matrix Factorization? The paper does not discuss in depth about the failure nodes and provide insights into the emperical analysis.
- LRMF generally achieves superior performance compared to MLA across most evaluation settings.
- We recommend selecting the method with the best validation performance, as no inconsistencies were observed between validation and test rankings.
- Based on further experiments above, MLA is preferable in scenarios where lineage information contains a large number of unreliable entries.
Low Performance on GPQA and MuSR
Further, the paper still needs to shed light on the poor performance of GPQA
- Further analysis of model performance on GPQA and MuSR revealed that, for many models, these benchmarks were excessively difficult, resulting in accuracy on multiple-choice questions that was close to random chance.
- This substantial difficulty hindered clear differentiation between models on these benchmarks.
- Additional experiments with five different train/test splits showed extremely high variance in correlation coefficients for GPQA and MuSR, suggesting instability in predictions.
- We attribute this instability to the excessive difficulty of these benchmarks, where irrelevant data from other models contributes noise instead of useful signals.
| Method | all | bbh | gpqa | ifeval | math | mmlu-pro | musr |
|---|---|---|---|---|---|---|---|
| LRMF | 0.587(±0.032) | 0.644(±0.021) | −0.110(±0.135) | 0.406(±0.018) | 0.536(±0.020) | 0.653(±0.015) | 0.094(±0.194) |
| MLA | 0.414(±0.014) | 0.396(±0.017) | 0.261(±0.106) | 0.283(±0.026) | 0.421(±0.034) | 0.405(±0.033) | 0.296(±0.058) |
| NCF_F | 0.242(±0.065) | 0.295(±0.054) | 0.137(±0.056) | 0.082(±0.061) | 0.190(±0.036) | 0.283(±0.062) | 0.104(±0.077) |
Thank you for the discussion and detailed response. I have no other questions.
We appreciate your positive feedback and constructive comments. Thank you for your time and consideration.
This paper proposes to leverage lineage relationship between models to predict performance of fine-tuned or merged LLMs. It demonstrates improvement over a scaling-laws-based method and provide interesting observations with practical implications on the effect of the lineage on the model performance.
接收理由
The use of the readily available model lineage information for performance prediction is a unique and novel approach to an important problem in current LLM research. In addition, the paper highlights intriguing characteristics of the relationship between model lineage and its performance, which would be interesting to explore further.
拒绝理由
A few clarifications and additions would be helpful:
- When a new model is to be evaluated, does the entire optimization problem have to be re-run or is there a more efficient approximation that can be done?
- How stable are the obtained model embeddings? When adding a new model or several models for example, how different will the results be?
- The comparison is done only with one previous work that relies on scaling laws. Comparing to another performance prediction technique, e.g. from the model routing works, would help highlight the importance of the model lineage information and could increase the impact of the paper.
给作者的问题
- Are the 1,000 random samples the same for all models or are they randomly selected for each model? What is the resulting size of the instance matrix?
We would like to express our sincere gratitude for your careful review and insightful comments on our research. The feedback from the reviewers has significantly improved the quality of our paper.
Adding New Models in Our Method
When a new model is to be evaluated, does the entire optimization problem have to be re-run or is there a more efficient approximation that can be done? How stable are the obtained model embeddings? When adding a new model or several models for example, how different will the results be?
This is a situation that frequently occurs in the large-scale recommendation domain, and it can be optimized at low cost using various methods. For example, a popular approach is to warm-start the optimization using the previously obtained feature vectors as initial values. This enables obtaining solutions much faster than cold-start. In that case, we can also expect the embedding variations to be minimized.
Comparison with Existing Routing Methods
The comparison is done only with one previous work that relies on scaling laws. Comparing to another performance prediction technique, e.g. from the model routing works, would help highlight the importance of the model lineage information and could increase the impact of the paper.
We are tackling a new problem: 'When creating a new model that can effectively solve a certain problem through fine-tuning or merging, which model should be used as the base?' Our goal is to estimate the performance superiority of such unknown new models. This differs from the problem setting of 'Routing (determining which existing model should be used to solve a new problem),' so, to the best of our knowledge, no existing method proposed for Routing can be directly applied. Routing assumes that candidate models are already given and that these candidate models have already been evaluated on several samples, whereas we are estimating the performance of unknown models from cold start.
However, the kNN Routing concept proposed by Shinitzer et al. (2024) can be applied and utilized with an extension that incorporates Lineage. While they use the average of scores for top-k similar tasks (= Task Lineage Average), Model Lineage Average is the application of this concept in the model direction. MLA underperforms LRMF in most cases but shows good performance in some cases. We consider this result to indicate that methods that worked well in Routing were also effective in some settings.
Questions to Authors
Are the 1,000 random samples the same for all models or are they randomly selected for each model? What is the resulting size of the instance matrix?
- The same 1,000 random samples are consistently used across all models, ensuring fair comparison. The resulting instance matrix size is
[number of models] × 1,000. - In Appendix C’s setting (each model evaluated only on t unique samples), the matrix size remains the same but becomes sparse; LRMF still achieves superior performance.
Thank you for the response to my questions and concerns, and for clarifying the differences and parallels between the focus of the paper and model routing. I have raised my score since I believe that these clarifications and the additions in the responses to the other reviewers are very helpful to better understand the contribution and power of the proposed approach.
Thank you for your valuable feedback and thoughtful suggestions. We sincerely appreciate the time and attention you have devoted to our work.
This paper studies the problem of large language model performance prediction, and proposes to incorporate lineage information (e.g., the model is fine-tuned from a base model, or merged from two models) into the performance prediction model. The proposed methods are validated with model performance on the Open LLM benchmark, in two different settings, (1) homogeneous, with 145 models derived from Qwen 2.5 7B; and (2) heterogeneous, with 2534 models based on various training recipes. The proposed lineage-regularized matrix factorization (LRMF) approach outperforms the baseline of collaborative filtering with factors substantially, with up to 7-10 percentage point increases.
接收理由
- The paper explores an important research question and provide a reasonable solution to improve existing methods.
- Simple and effective method that leads to substantial performance improvement over the baseline.
拒绝理由
- Evaluation could be strengthened with additional baselines and ablation studies.
- The paper currently uses one baseline method (NCF with factors). This is concerning since there are multiple prior works on model performance prediction (e.g., Ruan et al., 2024; Zhang et al., 2024; Polo et al., 2024; Kipnis et al., 2025; etc as discussed in Section 2). In the case that methods in prior work are not applicable to the problem here, the reasons should also be explained.
- The proposed method LRMF uses both “lineage regularization” and “task regularization”, with two separate loss terms in the final loss (line 244). It’s unclear whether the performance improvements come from “lineage regularization” or “task regularization”. If the paper claims that the former is important (as done in the title), I would expect an ablation study on the two loss terms to show that.
- Confusing claim in the abstract and confusing paper organization
- An improvement of “up to 7-10 percentage points” is described in the abstract and conclusion. The proposed method improves from 0.460 to 0.719 in Table 1 and 0.256 to 0.579 in Table 2. It’s confusing how the number of 7-10 was derived.
- The paper claims that the method addresses the cold-start problem in the abstract and Line 251, however I only find the relevant results in appendix C. The paper should guide the reader more explicitly if it's a main advantage of the method.
给作者的问题
Question
- Line 302: Clarification question about evaluation metrics.
- Did you predict instance-level accuracy (i.e., either 0 or 1), aggregate them to dataset-level accuracy, and compute the correlation (using multiple test models)?
- Also by “across various benchmarks”, is it referring to experimenting with each benchmark independently?
- Line 314: Is there one fixed train/test split or do you sample multiple times?
- Line 341: I don’t quite understand the reasoning of “this limited factor diversity explains …” Could you please elaborate on this?
Discussion
- Line 393: I’m curious about your thoughts on how to incorporate weighted connections (i.e., edge information) and maybe multi-hop relations on the model fine-tuning/merging graph. What are methods you feel suitable for this extension?
Thank you for the detailed review. We reran ablations and robustness checks; answers are below.
Limited Baselines
The paper currently uses one baseline method (NCF with factors) ...
- We address the problem of predicting model performance before any post-training steps such as fine-tuning or model merging. This can be viewed as a model-side cold-start problem.
- Polo et al. (2024) and Kipnis et al. (2025) cannot be used under this problem settings, because they require the target model to have been evaluated on some samples.
- Similarly, scaling law methods (e.g., Ruan et al., 2024) are not suitable for performance estimation in the context of post-training data or model merging, since they extrapolate performance from factors like parameter count or pretraining data volume, which are essentially unchanged in these scenarios.
Regularization Term Ablation
The proposed method LRMF uses both “lineage regularization” and “task regularization”, ... I would expect an ablation study on the two loss terms to show that.
- Most correlation improvement is due to the Model Lineage regularization term.
- Appendix B.1 (Fig. 4a/4b) shows AUC-ROC ablation for each regularization term.
- In response to feedback, we expanded the ablation study to more comprehensively explore and visualize regularization effects using correlation coefficients. (See new figure at here )
- The case λ_M=0,λ_X=0 (i.e., NCF/MF) fails in cold-start (correlation = 0).
- Model lineage (λ_M > 0) enables strong performance even without task lineage (λ_X = 0).
Evaluation Metrics and Consistent Reporting
An improvement of “up to 7-10 percentage points” is described in the abstract and conclusion. ... It’s confusing how the number of 7-10 was derived.
- We use two metrics: Pearson correlation (for model ranking) and AUC-ROC (for instance-level accuracy).
- The “7–10 percentage points” originally referred to AUC-ROC improvement.
- For consistency, we will instead report Pearson correlation improvement: “up to a 25-point increase.”
Cold-Start Problem
The paper claims that the method addresses the cold-start problem in the abstract and Line 251, however I only find the relevant results in appendix C...
- All experiments address the cold-start problem: predicting test-model performance with no evaluation data based solely on training-model evaluation results and lineage information.
- Appendix C shows results under a stricter setting where training-model evaluation data are limited (5–500 samples per model). Even then, LRMF outperforms MLA and NCF with factors in nearly all cases.
Responses to the Questions
Line 302: Clarification of Metrics
Line 302: Clarification question about evaluation metrics. Did you predict instance-level accuracy (i.e., either 0 or 1), ...
- That is correct. We first predict instance-level accuracy (0/1), then compute average scores at the overall or benchmark level and calculate correlations. Thus, the constrained matrix factorization is performed once on the entire set of benchmarks.
Line 314: Train/Test Split
Line 314: Is there one fixed train/test split or do you sample multiple times?
- In response to feedback, we conducted five different samplings and computed mean and variance. Trends remained consistent, although score stability varied greatly across benchmarks. For GPQA and MuSR—where LRMF underperforms—variance was about ten times larger than in other benchmarks. We found these benchmarks are very difficult for most models, with performance close to random; including other model scores in such cases adds noise and degrades prediction.
| Method | all | bbh | gpqa | ifeval | math | mmlu-pro | musr |
|---|---|---|---|---|---|---|---|
| LRMF | 0.587(±0.032) | 0.644(±0.021) | −0.110(±0.135) | 0.406(±0.018) | 0.536(±0.020) | 0.653(±0.015) | 0.094(±0.194) |
| MLA | 0.414(±0.014) | 0.396(±0.017) | 0.261(±0.106) | 0.283(±0.026) | 0.421(±0.034) | 0.405(±0.033) | 0.296(±0.058) |
| NCF_F | 0.242(±0.065) | 0.295(±0.054) | 0.137(±0.056) | 0.082(±0.061) | 0.190(±0.036) | 0.283(±0.062) | 0.104(±0.077) |
Line 341: Factor Diversity
Line 341: I don’t quite understand the reasoning of “this limited factor diversity explains …”...
- NCF with factors relies only on whether a model is FT, merged, or base. Information about which data or learning method was used is all collapsed to “trained,” so on instruction-following benchmarks like IFEval—where data or method differences strongly affect scores—NCF cannot capture feature differences, resulting in lower predictive performance.
Line 393: Weighted Connections
Line 393: I’m curious about your thoughts on how to incorporate weighted connections...
- We can represent weighted edges by using merge ratios for model merging or fine-tuning token counts as weights. This can be implemented by using a soft adjacency matrix A.
- The graph kernel in the regularization term already reflects multi-hop relations: if edges exist between A–B and B–C, A and C embeddings become closer indirectly.
Thanks for the clarification!
In my initial review there was some misunderstanding regarding the problems setting. It seems that this paper pursues a cold-start setting (predict a model's performance given its "lineage" information, before actually training the model) that is very different from prior work (predict a model's performance based on its performance on other tasks, or scaling-law-style pre-training meta-data) and thus methods in prior work are not very suitable as baselines. Hence I think the baseline selection in the paper is acceptable.
Also I appreciate (1) clarification on the metric, (2) clarification and improved ablation study on task/lineage regularization (3) additional experiments on multiple train-test splits.
Happy to adjust my ratings. I hope our discussion get incorporated to your revised paper.
Thank you very much for your positive feedback. We greatly appreciate your insightful suggestions, which motivated us to conduct additional experiments and further clarify the problem setting. We will incorporate these enhancements and the points raised in our discussion into the revised version of the paper.
This paper studies the task of predicting performance of LLMs on downstream tasks without running the evaluation. The key insight is to use information about lineage, i.e., what base model the LLM in question is derived from. The proposed lineage-aware method based on matrix factorization substantially improves prediction accuracy. All four reviewers agreed that the paper was insightful and achieved substantial gains on this task. The authors should incorporate their comprehensive rebuttal materials into the final version of the paper.