PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
4
4
5
5
3.8
置信度
创新性2.8
质量3.8
清晰度3.0
重要性3.0
NeurIPS 2025

How Benchmark Prediction from Fewer Data Misses the Mark

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29

摘要

关键词
EvaluationLarge language model

评审与讨论

审稿意见
4

The paper conducts a large‐scale evaluation of 11 micro-benchmarking methods for predicting full benchmark performance from a small subset of data across 19 diverse tasks. The paper shows that a simple baseline—randomly sampling 50 examples and fitting a regression model (Random-Sampling-Learn)—outperforms most sophisticated core-set selection techniques when source and target models are drawn from the same distribution. Critically, they find that nearly all methods break down when predicting the performance of stronger, unseen models, as their accuracy hinges on similarity to the source models. The only approach that remains consistently advantageous in this frontier setting is an augmented inverse propensity weighting (AIPW) estimator.

Overall, the manuscript delivers a rigorous, timely reality-check with clear metrics.

优缺点分析

Strengths:

  1. The paper conducts extensive experiments across a diverse set of benchmarks and methods, yielding a comprehensive and robust empirical evaluation.

  2. The paper introduces an AIPW-based micro-benchmarking approach that maintains at least the performance of random sampling even in the “new models are strictly better” (extrapolation) regime.

  3. The paper offers a thorough quantitative analysis of the extent to which benchmark-prediction methods depend on similarity between target and source models, clearly elucidating why methods fail when that similarity is low.

Weaknesses:

  1. The manuscript does not report how well each method preserves the relative ranking of target models (e.g. via Kendall’s τ). Adding these experiments—at least in an appendix—would strengthen the evaluation without incurring excessive additional effort.

  2. There is no analysis of method performance when the number of source models or the size of the selected coreset is small. In particular, IRT-based approaches might be unstable with few source models; highlighting whether AIPW or RANDOM-SAMPLING-LEARN become even more advantageous in this low-data regime would be valuable.

问题

The authors do not appear to report the variance of each method across multiple repeated experiments; could they please provide these variance estimates?

局限性

yes

最终评判理由

The authors' response has resolved my issue. Since my score is already positive, I have decided not to make any further changes to it.

格式问题

No Paper Formatting Concerns

作者回复

Dear reviewer k5n7.

We appreciate your insightful comments. We are encouraged that you found our experiments extensive and robust, our quantitative analysis thorough, and the manuscript to provide a rigorous, timely reality check. Below, we address your specific concerns and suggestions.

Q1: Ranking preservation. We appreciate your valuable suggestion. We follow your recommendation and calculate Kendall's tau between the real ranking and the predicted ranking of target models. The results are averaged across 100 trials for each benchmark. We further average the results across all benchmarks, and the results are shown below. Our conclusions mostly remain unchanged, with all benchmark prediction methods outperforming Random Sampling under interpolation, while none can surpass Random Sampling under extrapolation. We will include the detailed results in the camera-ready version.

MethodAverage ranking difference under interpolation split (corresponding to Figure 2, last row)
Random-Sampling0.705
Random-Sampling-Learn0.748
PCA0.725
AIPW0.721
Random-Search-Learn0.765
Lasso0.774
Double-Optimize0.758
Anchor-Points-Weighted0.763
Anchor-Points-Predictor0.768
P-IRT0.711
GP-IRT0.72
MethodAverage ranking difference under extrapolation split (corresponding to Figure 3, last row)
Random-Sampling0.534
Random-Sampling-Learn0.185
PCA0.206
AIPW0.511
Random-Search-Learn0.198
Lasso0.121
Double-Optimize0.437
Anchor-Points-Weighted0.504
Anchor-Points-Predictor0.318
P-IRT0.419
GP-IRT0.454

Q2: Analysis when the number of source models is small / when the coreset is small. We appreciate your suggestion. We conduct an experiment on the HELM-GSM8K benchmark under the interpolation setting. Following your recommendation, we use only 25% of all models as source models, resulting in 21 source models and 62 target models in each random trial. The results are presented below. Interestingly, most benchmark methods perform worse than Random-Sampling. The exceptions include AIPW and Double-Optimize.

MethodAverage estimation gap in HELM-GSM8K under interpolation split with fewer source models
Random-Sampling4.4
Random-Sampling-Learn4.5
PCA6.7
AIPW3.8
Random-Search-Learn4.4
Lasso7.9
Double-Optimize4.2
Anchor-Points-Weighted4.9
Anchor-Points-Predictor4.9
P-IRT4.8
GP-IRT4.4

We also conduct another experiment in which the only change to Figure 2 is reducing the coreset size from 50 to 10. The results are presented below. We demonstrate that when the coreset is small, Random-Search-Learn remains the best-performing method, while the IRT method degrades to perform worse than Random-Sampling.

MethodAverage estimation gap in HELM-GSM8K under interpolation split with coreset size n=10
Random-Sampling10.1
Random-Sampling-Learn8.2
PCA9.9
AIPW9.1
Random-Search-Learn7.6
Lasso14.6
Double-Optimize8.4
Anchor-Points-Weighted9.7
Anchor-Points-Predictor7.9
P-IRT10.2
GP-IRT10.1

These results further highlight the potential limitations of existing benchmark prediction methods. We appreciate your insightful suggestion and will include it in the camera-ready version.

Q3: The variance of each method. We report the standard errors in Figures 6-11 in the Appendix. We thank you for the question and will be more explicit about this in the revised version.

Thank you again for your valuable feedback. We hope these clarifications address your concerns, and we look forward to further discussion.

评论

Thank you for your reply. Your response has resolved my issue. Since my score is already positive, I have decided not to make any further changes to it.

审稿意见
4

This paper studies a specific paradigm for efficient model evaluation, which they refer to as benchmark prediction. In this setting a predictor h is trained to estimate the performance of a model across an entire test dataset given only the performance of the model on a select subset of data points in the test set. They study prediction performance both with and without train test distribution shift. They find in the easier setting (no distribution shift) all methods perform well, while in the harder setting (with distribution shift) most methods fail to generalize. Two methods do generalize to some degree, the first is the random-sampling baseline where a random subset of the test set is selected and the mean performance across the subset is used as the estimate, while the second is the Augmenting Inverse Propensity Weighting method.

优缺点分析

Strengths

  • Thorough experimental results for subset selection approaches to efficient evaluation on a diverse set of benchmarks and models

Weaknesses

  • The random-sampling baseline from this work is a common baseline studied in other efficient evaluation works. Both [1] and [2] show this baseline (referred to as “uniform” and “filled subset”, respectively) is quite bad compared to bandit algorithms. The subset selection paradigm for efficient evaluation is certainly not the state of the art and a comparison to the state of the art bandit algorithms is certainly necessary to make a statement regarding efficient evaluation techniques.
  • Should include a list of the models for your analysis somewhere in the main paper or appendix.

[1] Shi, Chengshuai, et al. "Best arm identification for prompt learning under a limited budget." arXiv preprint arXiv:2402.09723 (2024).

[2] Zhou, Jin Peng, et al. "On Speeding Up Language Model Evaluation." arXiv preprint arXiv:2407.06172 (2024).

问题

My understanding of the literature is that in any setting you might do this subset selection benchmark prediction you can also use bandit algorithms which have been shown to be much more effective than the random sampling baseline in your work. The literature contradicts your recommendation that the random sampling baseline should be used and without a comparison to bandit algorithms it is impossible to draw a conclusion about AIPW. Please correct me if I have missed something here, otherwise we need a comparison to the state of the art to understand the contribution of your analysis -- I would be happy to raise my score in either case.

局限性

Yes

最终评判理由

This work provides important evidence of the misleading nature of current benchmark prediction literature, which is a valuable service to the field.

格式问题

None

作者回复

Dear Reviewer 39bg.

We appreciate your insightful comments. We appreciate your recognition of the thoroughness of our experimental results, and we address your concerns and suggestions below.

Q1: Comparison to Bandit Algorithms. We thank you for pointing out the related literature [1] [2] on bandit algorithms. We agree that a comparison is important for clarity. However, we would like to clarify a key distinction between the research objectives.

Benchmark prediction aims to select a small subset of evaluation points to predict overall benchmark performance for any new model. This is useful when one wishes to obtain an accurate estimate for all models, not just the best one. In contrast, the bandit literature assumes a pool of target models and focuses on identifying the best model, rather than accurately estimating all models’ performance. By design, bandit algorithms will focus almost entirely on the best few models, which leads to bad performance estimation for all other models.

This paper focuses on the benchmark prediction task and explores when and how benchmark prediction is effective. All three other reviewers acknowledge the importance of addressing the problem. In the benchmark prediction literature, the state-of-the-art method, as far as we know, is the GP-IRT method, which we examined in our experiment.

To further address your suggestion, we experiment with the code from [2] out of the box within the benchmark prediction context. Specifically, we apply both the UCBE method and the UCBE-LRF method to the HELM-GSM8K benchmark under the interpolation model split. The results are shown below. Both bandit methods yield significantly larger estimation gaps compared to methods tailored for benchmark prediction. This supports our claim that while bandit algorithms excel at identifying the best model, they are less effective for accurate performance estimation across all models.

We appreciate your valuable suggestion and will include the citations and discuss the differences in the related work.

MethodAverage estimation gap under interpolation split on HELM-GSM8K
Random-Sampling4.4
Random-Sampling-Learn4.0
PCA3.7
AIPW3.8
Random-Search-Learn3.9
Lasso6.8
Double-Optimize3.9
AnchorPointsWeighted5.6
AnchorPointsPredictor6.1
P-IRT4.3
GP-IRT4.0
UCBE17.2
UCBE-LRF21.6

Q2: List of Models Used. We appreciate your suggestion and will include the model list and characteristics in the revised version.

Thank you again for your valuable feedback. We hope these clarifications address your concerns, and we look forward to further discussion.

[1] Shi, Chengshuai, et al. "Best arm identification for prompt learning under a limited budget." arXiv preprint arXiv:2402.09723 (2024).

[2] Zhou, Jin Peng, et al. "On Speeding Up Language Model Evaluation." arXiv preprint arXiv:2407.06172 (2024).

评论

Thank you for the helpful clarification. Your explanation of the problem setting makes the contribution much clearer. Your rigorous empirical results effectively demonstrate the issues with existing benchmark prediction literature. I'll raise my scores accordingly.

I appreciate you taking the time to benchmark the bandit algorithms mentioned. As you rightly point out, I wouldn't expect them to recover the absolute scores well, but I do think they might perform reasonably at recovering relative rankings among models. That comparison could be interesting, though I understand it's somewhat tangential to your main contribution.

审稿意见
5

This paper discusses the problem of benchmark prediction: using model performance on fewer test samples from a benchmark to estimate performance on the full benchmark. The paper motivates this problem as arising due to the high costs of evaluating models on large-scale benchmarks. It then discusses several recent methods that have been proposed to tackle the benchmark prediction problem. Then, a systematic comparison of 11 different benchmark prediction methods (both previously proposed and some new methods) is conducted under two regimes: interpolation and extrapolation. The main finding of the paper is that most benchmark prediction methods underperform a simple random selection + regression baseline under the extrapolation regime. This undercuts claims from recent work suggesting that sophisticated coreset selection methods for finding optimal test samples to evaluate models, are required for strong benchmark prediction performance. Finally, the paper discusses how most benchmark prediction methods exploit model similarity, showcasing that benchmark prediction might not truly yield the promise it holds---evaluating stronger models efficiently from a database of weaker model evaluations.

优缺点分析

Overall, I think this paper makes an extremely important point regarding the recent literature around the benchmark prediction problem. I appreciate the authors conducting a thorough empirical analysis of recent methods under the two different settings of interpolation and extrapolation. Overall, I think the paper is quite strong. I however have some important concerns regarding some of the experimental results and overall structure of the paper, which I note down below.

  • Citing prior relevant work. There are two important prior works that have been missed in the paper references. Prabhu et al, NeurIPS'24 proposed a similar method called Sort & Search, that leverages prior evaluations to regress newer sample level predictions. Another relevant work is MetaBench, ICLR'25 that employs IRT based methods to tackle the benchmark prediction problem.

  • Similarity of key takeaways to prior work. One of the key takeaways mentioned in the paper in lines 62-63 and 221-225 is also one of the suggested takeaways in Prabhu et al (see fig 5 and its caption): “This suggests generalizing beyond a single rank ordering, not better sampling strategies, should be the focus of subsequent research efforts.” I would appreciate if the authors can provide a discussion on the key similarities and differences between the main takeaways.

  • Validity of aggregate accuracy metric for the benchmark prediction problem. In Prabhu et al, they suggested that aggregate metrics like accuracy are not great for the benchmark prediction task since they can easily be manipulated (see section 2.1 "Why Adopt Sample-wise Prediction Metrics instead of Overall Accuracy Prediction?"). Sample-wise metrics are more important as they can enable precise characterization of the models on a per-sample basis. Again, a key discussion around this point would be appreciated.

  • Regarding performance on "Near-Extrapolation" settings. How does the extrapolation performance look like, if you look at “near-extrapolation” instead of on-average extrapolation. Essentially, the motivation would be that in the real-world, in most cases, we want to efficiently evaluate newer models which are a class-above but not significantly ahead of the current generation of models. Hence, in most cases, we care about predicting “near-extrapolation” (for examples performance of target models in the bin 70%-80% from source models of accuracy 20%-70%, rather than target models with accuracy in the bins 90%-95%). Hence, I would suggest employing a bin-windowed-evaluation strategy where you bin the performance of all models into K bins, and report the “near extrapolation” benchmark prediction performance using bins 1 to k to predict the k+1th bin as target models. If this indeed results in performance that is similar to interpolation, I would temper down the claims in the paper regarding the usefulness of the benchmark prediction task and methods.

问题

I've noted down all my concerns in the previous section.

局限性

Yes

最终评判理由

I am recommending acceptance. The authors have sufficiently addressed all my concerns, in my eyes making the paper much stronger. Note that the original submission was quite strong to begin with. Additionally, I think the topic explored in the paper is of extreme importance for many researchers working on foundation model evaluations, hence I believe that the paper will have sufficient audience at the conference.

格式问题

NA

作者回复

Dear Reviewer V6AP,

Thank you for your detailed and thoughtful review. We are encouraged that you found the problem we addressed important, the empirical analysis thorough, and the paper overall strong. Below, we will address your specific concerns and suggestions.

Q1: Citing Prior Relevant Work. Thank you for highlighting the missing citations. Among them, MetaBench closely resembles the P-IRT method we discussed in our paper. There are two main differences: First, MetaBench employs lower-dimensional IRT models and selects the coreset based on information filtering. Second, metabench jointly fits its models on multiple benchmarks, making a fair comparison to other methods difficult. Prabhu et al. propose a method called Sort & Search for efficient lifelong model evaluation. Rather than predicting benchmark performance, their method focuses on predicting the performance on individual data points. We further address our similarities and differences to this work in Q2 and Q3.

We will include the appropriate citations and discussions for both papers in the revised manuscript, and we appreciate your suggestion.

Q2: Similarity of Key Takeaways to Prior Work. We appreciate your valuable question. Prabhu et al. select the coreset by first sorting all data points by difficulty and then uniformly sampling from various difficulty levels. They assert that this method is more effective than random sampling for sample-wise performance estimation, as suggested in their Figure 4(d).

However, in Figure 5, they demonstrate that the difference between sampling methods is smaller than the so-called “aleatoric sampling error,” which arises from their single rank order design, as defined in their equation 1. Therefore, they conclude that generalizing beyond a single rank ordering, rather than using better sampling strategies, is more effective in reducing overall error. However, this conclusion only applies to the framework defined by their equation 1 and does not generalize to other benchmark prediction methods (which do not use a “single rank ordering”).

In contrast, we have conducted an extensive empirical study on a set of benchmark prediction methods and found that coreset selection plays a minor role compared to prediction across methods. We appreciate your suggestion and will acknowledge Prabhu et al. for this relatedness in our revised version.

Q3: Validity of Aggregate Accuracy Metric. We appreciate your insightful question. The goal of benchmarking is to obtain an aggregated score, such as accuracy, from the testing data. According to Hoeffding's inequality, this score estimates the model's average performance on the data distribution and generalizes to unseen IID data. Therefore, the aggregated score serves as an effective indicator for comparing models and estimating performance. Benchmark prediction aims to estimate benchmark performance using a few data points to reduce computational burden, so it’s valid to use the aggregated metric.

Theorem 2.1 in Prabhu et al. essentially states that it is possible to predict aggregate performance without correctly predicting performance for individual samples. This is indeed true (most of the methods tested in our work do not even attempt to make performance predictions for individual samples). In our view, Section 2.1 seems to argue that using sample-wise prediction metrics is preferable, if you care about sample-wise prediction quality rather than directly motivating the focus on sample-wise quality. While sample-wise quality metrics can be useful, machine learning for better or worse mostly relies on aggregate benchmark scores, further motivating our focus on aggregate metrics.

We appreciate the comment and will add this discussion to the camera-ready version.

Q4: Performance on "Near-Extrapolation" Settings. We appreciate your insightful suggestion. In the rolling bin setting, without modifying the methods, we must evaluate all models in the previous bin using the full dataset for the next stage to proceed. However, the main purpose of benchmark prediction is to begin with the already evaluated models, rather than assessing new models on the entire dataset.

Therefore, we conduct an experiment on the HELM-GSM8K benchmark under the single-bin near-extrapolation setting. Different from the paper setup (lowest 50% as source models and top 30% as target models), we now use the lowest-performing 75% of models as source models and the remaining 25% as target models. The results are shown below. Consistent with the findings in the paper, most methods fail to outperform Random Sampling, except for AIPW. We appreciate the valuable question and will include these results in the revised version.

MethodAverage estimation gap in HELM-GSM8K under near extrapolation split
Random-Sampling3.2
Random-Sampling-Learn5.9
PCA5.1
AIPW2.9
Random-Search-Learn6.2
Lasso13.3
Double-Optimize4.7
Anchor-Points-Weighted5.4
Anchor-Points-Predictor14.1
P-IRT4.4
GP-IRT3.5

Thank you again for your valuable feedback and for highlighting these important points. We hope these clarifications and additions address your concerns, and we look forward to further discussion.

评论

Dear Reviewer V6AP,

We include additional experimental results of MetaBench. The original code repository is implemented in R, and we could not find the corresponding code for constructing the coreset and training the IRT model. Therefore, we implemented the method ourselves. We developed a 2PL IRT model, filtered the coreset using Fisher information, and employed a GAM model to predict performance, as suggested in the paper. Due to time constraints, we experimented with the method on three small to medium-sized benchmarks. The results are presented below.

Average estimation under Interpolation split:

HELM-GSM8KGLUE-RTEOpenLLM-MUSR
Random-Sampling4.45.25.4
Random-Sampling-Learn4.02.31.9
PCA3.72.32.3
AIPW3.82.63.3
Random-Search-Learn3.92.21.7
Lasso6.82.41.7
Double-Optimize3.92.82.6
Anchor-Points-Weighted5.62.25.0
Anchor-Points-Predictor6.12.62.3
P-IRT4.32.32.9
GP-IRT4.02.22.8
MetaBench5.03.51.8

Average estimation under Extrapolation split:

HELM-GSM8KGLUE-RTEOpenLLM-MUSR
Random-Sampling3.54.95.5
Random-Sampling-Learn10.512.96.1
PCA14.111.26.1
AIPW3.34.34.0
Random-Search-Learn9.212.15.6
Lasso22.012.35.7
Double-Optimize8.06.64.2
Anchor-Points-Weighted4.24.64.3
Anchor-Points-Predictor16.713.17.0
P-IRT5.36.33.8
GP-IRT4.15.63.6
MetaBench17.113.36.1

Similar to other methods, MetaBench can reduce the estimation gap compared to RandomSampling under the Interpolation model split, while it fails under the Extrapolation split.

Once again, we thank you for your valuable comments. We hope that these additional results address your concerns. If there are no further issues, we would greatly appreciate it if you could consider raising your score.

评论

I thank the authors for their efforts in the rebuttal phase. I particularly appreciate the comparison with Metabench, especially given the additional work needed to implement it. All my concerns have been addressed and I am happy to recommend the paper for acceptance. I think the paper will make for some very exciting discussions at the conference.

审稿意见
5

This paper is a critique of a popular research thread on efficient benchmarking or benchmark prediction (subsampling instances in a benchmark such that evaluating on that subset with a particular method gives a similar result to evaluating on all instances). The paper presents experiments showing that simple methods, such as taking the mean of randomly sampled instances as the prediction or that simply using a linear regression of full performance (y) onto subsample performance (x) performs as well if not better than many more sophisticated techniques, such as those using IRT. They consider two scenarios: Interpolation in which the set of models represented in subset & full are the same, and extrapolation in which the target models have higher ability than the models used for benchmark prediction. In interpolation they find that instance selection isn’t needed; simple baselines like the linear regression work great. In extrapolation they find that most methods perform poorly except for a new method AIPW. The key insight is that benchmark prediction quality depends on the target model (which we only evaluate a subset of instances) being similar to the observed population in terms of its item response pattern.

优缺点分析

Strengths

  1. The paper's core findings (weakness of myriad methods relative to random baselines) is compelling and experiments demonstrating this are convincing.

  2. The insight identifying these two settings (interpolation and extrapolation) are very useful, showing where these methods work & where they break. It's a shift in thinking away from "must identify high quality items that characterize a benchmark" to more thinking about the set of models used to perform the prediction. Will change how people think in this area of benchmark prediction.

Weaknesses

Honestly, the paper is quite good, I don't see a reason to reject it. If I had to pick something, it would be some issues with framing:

The techniques in benchmark performance (e.g. IRT) also have another motivation besides strictly worrying about efficient benchmarking. Methods that learn the relationship of item level results to full benchmark results can help us better understand which items are difficult or noisy (so we might correct them if they are mislabeled) or to produce a (weighted) aggregate score that is actually more reliable than the full benchmark score (not predicting the benchmark but making it even better). It’s definitely fine if the paper wants to consider this out of scope, but I think it should mention that there are other ways in which the critiqued methods are valuable beyond the scope considered in this work.

问题

Figures 2 and 3: The figure provides very detailed information but there isn’t really a clear take away. It seems like there are two views here: a viz to answer what benchmarks are easiest to estimate (sort rows by sum and just visualize estimation gap), and a viz to answer what estimation methods are the best (sort columns by sum and use reduction over random-sampling). Also having the takeaways written into the caption would be great

Paragraph 1 in related work (line 96) seems a bit tangential to the paper since subsampling items doesn’t really do anything about annotation cost or LLM-as-judge reliability etc.

line 184-192: the source model set for different benchmarks is not the same. This makes it hard to compare which benchmarks are easy to predict because this will depend a lot on the source model set. Can you characterize the population of models? What is the range of compute scales? What is the range of performances? How is this different for the different model populations you use.

Line 202: how is information from the 100 random trials used? are the estimated gaps in figure 2 etc actually mean estimated gap over 100 trials? If so it would be great to see a measure of variance over the trials as well.

One additional thing you could look at is for the methods that do select the set of core items, how similar is the set of items they select between the different methods? How does that set differ between the interpolation or extrapolation settings? To be clear the paper is already good without this, but I just think it would be an interesting additional line of inquiry.

line 261: the explanation here could be given a succinct intuition before getting into the details of the notation. Something like, c_obs is the observed agreement and c_exp is the chance agreement rate.

I would love to see some of the figures from 4.4 in the body of the paper if there’s space in a camera ready.

Since the subsample size (n) is always in {50, 100, 200} this means the ratio of n to N (full benchmark size) can be very different for different benchmarks. This makes it rather difficult to compare which benchmarks are harder or easier to estimate. This paper is more focused on which prediction methods are better or worse across benchmarks. But if there are any benchmark specific dynamics, do you think it might be better to size the subset as a constant ratio of the full benchmark size so that each benchmark contributes similarly to the estimation of the prediction method’s success?

局限性

Yes

最终评判理由

I enjoyed this work, and most of my questions were informational in nature, which were mostly answered. None of the clarifying answers at least would cause me to think any experiments (and thus findings) were invalid, and such I don't feel the need to revise my rating.

格式问题

n/a

作者回复

Dear reviewer iudt,

Thank you for your thoughtful and constructive review. We are encouraged that you found our core findings compelling, experiments convincing, and insights useful. We are glad that you recognized our potential impact on changing how people think about benchmark prediction. Below, we address your specific questions and suggestions.

Q1: Value of IRT methods. Thank you for highlighting the broader utility of IRT-based methods beyond efficient benchmarking. We agree that these methods can provide valuable insights, and we will acknowledge these additional motivations in the camera-ready version.

Q2: Presentation of Figures 2 and 3. Thank you for your suggestions regarding figure clarity. Our primary focus is to compare different benchmark prediction methods in both interpolation and extrapolation settings. We will clarify this in the camera-ready version and add explicit takeaways to the captions when we have more space.

We did not intend to compare the estimation difficulty of different benchmarks in these two figures, as this difficulty is largely determined by the score distribution. Consider the Random-Sampling method; the estimation difficulty for any target model is bounded by its standard deviation and the coreset size, as stated in line 271 of the paper. Consequently, the estimation difficulty for any benchmark largely depends on its model score distribution, since the standard deviation of Bernoulli is p(1p)\sqrt{p(1-p)}. If most models have very high or very low accuracy, the benchmark is easier to estimate; conversely, if accuracies are clustered around 50 %, the benchmark is harder to estimate. We appreciate the suggestion and will incorporate this discussion into the revised version of the paper.

Q3: Writing inaccuracy in paragraph 1 of the related work. We appreciate the comment and will revise this in the camera-ready.

Q4: Source model set not the same for different benchmarks. We mainly collect model scores from existing benchmarks, so it’s hard to use the same model list for all benchmarks. Furthermore, our work extends beyond LLM benchmarks; we also include ImageNet. We appreciate your suggestion and will include the model list and characteristics in the revised version. Please refer to the discussion about the comparison of benchmark estimation difficulties in response to Q2.

Q5: How is the information from 100 trials used? We report the average estimation gap over all target models in 100 trials, and the standard error is reported in Figures 6-11 in the Appendix. We thank you for the question and will be more explicit about this in the revised version.

Q6: Coresets selection similarity. We agree this is an interesting direction. However, we find it challenging to quantify the coreset similarity using a specific metric. This difficulty arises because we conducted 100 different random trials for each method, and each trial often involved a different selected coreset. For instance, on ImageNet, we examined the coreset selected by AnchorPoint under random seeds 0 and 1, and we found that only three data points overlapped out of 50.

Q7: Intuition of Notation. We appreciate your feedback on notation clarity, and we will provide more intuition in the revised version.

Q8: Placement of ablation results. We will move the results to the main body in the camera-ready version. Thanks for the suggestion.

Q9: Coreset size being a static number instead of a ratio. We appreciate your insightful question. In our experiments, the dataset sizes of the benchmarks vary significantly, ranging from 277 for RTE to 50,000 for ImageNet. Consequently, it is challenging to find a ratio that applies universally to all benchmarks. Additionally, as mentioned in the response to Q2, the difficulty of estimating the benchmark is partly influenced by the coreset size rather than the ratio. We believe it is more effective to compare benchmark estimation difficulties by keeping the coreset size consistent across all benchmarks, while this is beyond our primary focus.

Thank you again for your valuable feedback and suggestions. We hope these clarifications and planned revisions address your concerns, and we look forward to further discussion.

最终决定

Summary

The paper evaluates existing benchmark prediction methods for efficient model evaluation, comparing simple techniques like random sampling with more complex methods across scenarios of interpolation and extrapolation. The findings suggest that basic approaches, such as linear regression on random samples, often outperform sophisticated methods like IRT-based ones, highlighting the importance of target model similarity in benchmark prediction quality (interpolation). The study underscores the limitations of current strategies in predicting the performance of stronger, unseen models, with AIPW emerging as a consistently advantageous method in this challenging extrapolation scenario.

Strengths

  • The observation that random sampling + linear regression can outperform sophisticated approaches for benchmark prediction is important and novel.
  • The insights that benchmark prediction performance can vary in interpolation and extrapolation further allocate the key challenges in the studied problem.
  • It finds the advantages of AIPW for extrapolation scenarios.
  • Extensive experiments across a diverse set of benchmarks and methods.

Weaknesses

  • Evaluation metrics do not include rank correlation of the benchmark evaluations.
  • Missing analysis and comparison of benchmark prediction methods in more scenarios, e.g., when the number of source models or the size of the selected coreset is small.
  • There are additional related works with similar observations and baselines (e.g., multi-arm bandit) that need to be discussed or compared with.
  • More justification of the evaluation metrics and settings can strengthen the paper.
  • There are several places in the paper where the presentation can be improved.

Reasons to Accept

  • The paper presents an important discovery and a novel, comprehensive evaluation scheme of benchmark prediction.
  • The paper also points out strategies that may overcome the weaknesses of existing approaches.
  • Extensive experiments.
  • All reviewers support acceptance. They are satisfied with the rebuttal.

Discussion Summary

  • In the rebuttal, the authors provided more experimental results to answer the questions from the reviewers. Extensive discussions also help reviewers to better understand the details and potential impact of the work.
  • All reviewers responded with positive feedback to the paper and the rebuttal.