Thank you for your constructive and encouraging review -- we appreciate your comments and the time you took to evaluate our work!

Benefits of benchmark refinement

Thank you for raising this matter.

First, while a single evaluation run may appear inexpensive relative to pretraining, evaluation becomes significantly more costly when it is repeated frequently, as is typical in modern LM workflows. For example, evaluating a single model across the full HELM benchmark suite can cost over 4,000 GPU hours or $10,000 in API usage (Liang et al., 2023). Costs escalate further in in-loop setups, where frequent evaluations are needed during pretraining. In fact, when pretraining a 1B model and evaluating every 100 steps using the OLMES evaluation framework (Gu et al., 2025), the evaluation time is longer than the pretraining time.

Second, benchmark refinement is not only about efficiency, but also about improving other aspects, particularly validity, variance, and saturation -- critical properties for meaningful model evaluation. Gains in these dimensions can be orthogonal to efficiency: for example, even when evaluating on all items of a benchmark, Fluid Benchmarking puts less weight on mislabeled items, leading to a higher validity empirically (see the next point below).

We appreciate the opportunity to clarify and will include a more explicit discussion of these points in the final version of the paper.

Full-benchmark evaluation

This is an excellent question -- thank you for raising it.

When evaluation is performed on all items of a benchmark, the dynamic item selection of Fluid Benchmarking is neutralized. In this scenario, our method effectively reduces to the Random IRT baseline, which applies IRT-based ability estimation across all items. Crucially, even in such a full evaluation regime, IRT-based ability estimation offers significant advantages over raw accuracy. Specifically, while accuracy treats all items equally, IRT takes into account item difficulty and assigns less weight to mislabeled items.

Empirically, this distinction is reflected by consistent gains across validity, variance, and saturation when comparing IRT-based ability estimation vs. standard accuracy on all items:

Evaluation measure	Accuracy	Ability (IRT)
Validity (rank distance; lower is better)	9.1	7.9
Variance (total variation; lower is better)	10.0	8.7
Saturation (rank correlation; higher is better)	0.85	0.88

Comparing these numbers with Fluid Benchmarking (Table 2 in the paper), we observe that Fluid Benchmarking achieves similar (validity, saturation) or even better results (variance) than the full-benchmark IRT ability estimate, with much fewer items. The advantage becomes even more striking when comparing Fluid Benchmarking to full-benchmark accuracy: with just 50 items, Fluid Benchmarking outperforms full-benchmark accuracy across all three evaluation dimensions: validity (8.8 vs. 9.1), variance (6.5 vs. 10.0), and saturation (0.86 vs. 0.85). In other words, even if efficiency is not a concern, Fluid Benchmarking can improve evaluation quality.

We agree that these results provide a valuable contextualization, and we will include them in the final version of the paper.

Generalization beyond train models

Please note that IRT ability estimates are not upper bounded by the abilities of the train models (i.e., the models used to estimate item parameters). For instance, if the most capable train model correctly answered items from a benchmark, a new model that correctly answers those plus at least one additional item will receive a higher ability estimate.

That being said, you are absolutely right that fixed IRT models can be limited in how they adapt to better models. Specifically, consider the subset of items of a benchmark that were not answered correctly by any train model. These items are effectively assigned the same maximum difficulty. If we conduct Fluid Benchmarking with a model that is better than any train model, evaluation will quickly move to those most difficult items. However, the fixed IRT model cannot distinguish finer levels of difficulty among them. Therefore, in this extreme region, ability estimation reduces to something close to accuracy over those hardest items, rather than leveraging graded item properties to make finer distinctions.

This does not invalidate the IRT estimates and hence is not a terrible mode of failure, we believe, but it should be clearly understood by users of the method. Future work could explore extensions such as dynamic recalibration of item parameters. We will include a discussion in the final version of the paper.