/10

Poster4 位审稿人

最低3最高4标准差0.4

ICML 2025

From Crowdsourced Data to High-quality Benchmarks: Arena-Hard and Benchbuilder Pipeline

Tianle Li,Wei-Lin Chiang,Evan Frick,Lisa Dunlap,Tianhao Wu,Banghua Zhu,Joseph E. Gonzalez,Ion Stoica

OpenReview PDF

提交: 2025-01-24更新: 2025-07-24

TL;DR

Arena-Hard-Auto: An automatic LLM benchmark.

摘要

The rapid evolution of Large Language Models (LLMs) has outpaced the development of model evaluation, highlighting the need for continuous curation of new, challenging benchmarks. However, manual curation of high-quality, human-aligned benchmarks is expensive and time-consuming. To address this, we introduce BenchBuilder, an automated pipeline that leverages LLMs to curate high-quality, open-ended prompts from large, crowd-sourced datasets, enabling continuous benchmark updates without human in the loop. We apply BenchBuilder to datasets such as Chatbot Arena and WildChat-1M, extracting challenging prompts and utilizing LLM-as-a-Judge for automatic model evaluation. To validate benchmark quality, we propose new metrics to measure a benchmark’s alignment with human preferences and ability to separate models. We release Arena-Hard-Auto, a benchmark consisting 500 challenging prompts curated by BenchBuilder. Arena-Hard-Auto provides 3x higher separation of model performances compared to MT-Bench and achieves 98.6% correlation with human preference rankings, all at a cost of $20. Our work sets a new framework for the scalable curation of automated benchmarks from extensive data.

关键词

llmevaluations

评审与讨论

审稿意见

评分: 32025-03-09

This paper proposes Bench-O-Matic, a pipeline for curating high-quality benchmarks (Eval-O-Matic) from large volumes of crowdsourced queries. The pipeline combines hierarchical clustering with a set of LLM-based filters keyed to “prompt quality” dimensions (e.g., complexity, specificity, domain knowledge). The resulting prompt set is evaluated with a novel suite of metrics—like confidence agreement—that emphasize how effectively a benchmark can distinguish models and align with human preferences. Experimental comparisons show Eval-O-Matic outperforms or rivals popular benchmarks such as MT-Bench and AlpacaEval on both model separability and alignment with human preference rankings, all at lower cost.

给作者的问题

Could you intuitively show why the proposed by metrics can outperform spearman correlation? Does it really matter, given your experiment on 20 models?

论据与证据

yes. the experiment is extensive.

方法与评估标准

Yes.

理论论述

NA.

实验设计与分析

Yes.

补充材料

Yes. I review them all.

与现有文献的关系

The solution propose a pipeline to construct benchmark from the crowdsource data selected by LLM with customized prompt, which is of practical value in general.

遗漏的重要参考文献

No.

其他优缺点

Clear, Straightforward Method: Despite involving multiple steps (clustering, filtering, LLM-based annotation, and evaluation), the overall pipeline is easy to follow.
Novelty of Automated Curation: Reliance on LLMs for prompt selection is an interesting approach that addresses a real challenge—continual benchmark development without human-in-the-loop curation.
Empirical Gains & Metrics: The benchmark produced (Eval-O-Matic) yields high model separability and near-human ranking alignment. The authors propose new measures (e.g., confidence agreement) to quantify a benchmark’s ability to differentiate and rank models reliably.
Potential Impact: By open-sourcing the pipeline, others can frequently generate fresh benchmarks, mitigating the common pitfalls of data leakage and benchmark saturation.

其他意见或建议

Reliance on LLMs in Multiple Stages: The pipeline depends heavily on LLMs both for scoring prompt quality and for evaluating final responses. While the authors attempt validation, additional depth—e.g., human judgments or expanded evidence for each “key quality”—could improve trust.
Limited Generalization Evidence: Demonstration of generalizability is mainly restricted to one additional dataset (WildChat) with a relatively simple baseline for comparison.
Missing Ablations and details: The paper does not fully specify the mechanics of how many random seeds are applied, how precisely bootstrapping is conducted, or how the confidence intervals are chosen. This omission leaves readers uncertain about the exact procedure for computing separability and the sensitivity of that metric. Key design choices (like cluster count, weighting of quality dimensions, or dropping certain qualities) are not rigorously tested, which might leave readers unsure about the pipeline’s sensitivity to different configurations.
Minor Presentation Issues: Figure 1 and the key qualities are unreferenced, which may impede clarity. More elaboration on the seven quality criteria and how they were selected would be helpful.

作者回复

2025-03-31

We thank the reviewer for the constructive feedbacks. Below, we address your concerns and propose our revisions.

W1: We acknowledge that our pipeline heavily utilizes LLMs for prompt quality scoring and response evaluation. To address this concern and enhance credibility, we validated annotation quality through majority voting among state-of-the-art models, achieving an 85.6% agreement rate with GPT-4-Turbo on 200 prompts in the validation set.

After reading your review, we integrated human judgments by manually labeling a subset (50 out of 200) of the validation set, yielding an 84.3% agreement between GPT-4-Turbo and human labels, and a 94.9% agreement between majority votes from LLMs and human labels. While OpenReview rules currently prevent us from updating during the discussion period, we assure you this important validation step will be included in the final manuscript to enhance reproducibility and credibility (in Section 4.2).

W2: We agree that demonstrating broader generalization is important. Although our primary experiments focused on Chatbot Arena, we also evaluated the pipeline on the WildChat dataset (Wild-O-Matic), obtaining comparable improvements in benchmark quality. Both datasets are crowdsourced from real-world users, aligning with our goal of creating benchmarks reflective of genuine user interactions. By validating our method on two distinct real-world datasets, we demonstrate sufficient generalization of our approach.

We further highlight that benchmark validation is resource-intensive, as each evaluation of 20 models incurs significant costs (~$600 per validation cycle). Hence, we strategically focused our ablations on different LLM judges, quality annotation models, and datasets (Section 4).

W3:

We used a fixed random seed (42) across all experiments, including bootstrapping, and will explicitly document this detail in the revised manuscript (Section 6.1).
Confidence intervals were calculated using 100 bootstrap iterations and are 95% unless specified otherwise; this clarification will also appear in Section 6.1.
We will document the minimum cluster size (8) and note that the number of resulting clusters varies based on datasets and min cluster size (Section 4.1).

Regarding ablations on dropping qualities, we clarify that our experiments (Section 4.3, Figure 3) effectively cover this aspect by demonstrating improved separation between strong and weak LLMs as the number of included qualities increases. As we drop qualities, we see less distinction between strong and weak LLMs (e.g. GPT-4 vs Llama-2-70B and Claude-Opus vs Claude-Sonnet). We will further emphasize this connection in the revision.

W4: Thank you for the suggestions. In our revised paper:

Figure 1 will be referenced in related works.
Elaboration on the selection of the quality criteria, and their specific contributions, will be detailed in Appendix C (p. 15).

W5: Spearman correlation only tells you whether two rankings share a similar order: if Model A is above Model B under one benchmark, does that same ordering hold under human judgments? However, it glosses over two crucial aspects:

Confidence or “Separation” Among Models: Even if two benchmarks induce the same ordering of models, some may do so with very different levels of certainty. In practice, it matters whether “Model A beats Model B” is a large-margin result the benchmark can replicate consistently, or whether it’s a fragile result based on noise. Our “Separability” metric explicitly checks how often two models have non-overlapping confidence intervals in their benchmark performance. If a benchmark repeatedly yields overlapping intervals, you cannot reliably conclude that the top model is truly better.
Magnitude of Differences (Beyond Rank): Our “Brier Score,” for example, rewards a benchmark not just for correctly ranking pairs of models, but also for assigning appropriate probabilities to those rankings. A pairwise victory by 90–10 is stronger evidence of a real performance gap than a 55–45, yet Spearman treats them identically so long as the order is preserved.

Even with 20 models, fine distinctions can be critical. Modern LLMs often cluster close to one another in quality, so you need to know if a newly finetuned is truly outperforming a similar rival or if the difference is effectively within the margin of error. If your benchmark and evaluation metrics don't capture this level of granularity and confidence, you may end up training or deploying a model you think is better but is actually equal (or even worse) when considering statistical noise.

We will expand on this distinction clearly in the revised manuscript.

We greatly appreciate the reviewer’s constructive feedback, which significantly enhances the quality and clarity of our work. We respectfully ask the reviewer to reconsider their rating, given our revisions, clarifications, and the substantial potential impact of our contributions to the community.

审稿意见

评分: 42025-03-12

The paper introduces a new method for automatic generation of robust, high‐quality benchmarks for evaluating LLMs. The approach is designed with key features: it controls for the style and length of generated content, shows strong alignment with human preference, and can be done in a cost- and time-efficient manner. Extensive experiments show that the generated benchmarks have significant advantages in terms of confidence agreement and separability compared to existing evaluation systems such as AlpacaEval, MTBench, and Chatbot Arena. However, the authors focus their method on open-ended and single turn tasks.

给作者的问题

Given the increasing focus of evaluations on more technical and difficult tasks (like reasoning and coding), how do you think this method extends to such benchmarks?
Have you studied if the increased separability is the result of filtering difficult questions to include in the benchmark?

论据与证据

Authors claim that their method automatically generates robust and high quality benchmarks which:

Control for style and length of generation
Show strong agreement with human preference
Can be created with reduced cost and time

Their extensive examinations show strong advantages in confidence agreement and separability (compared to AlpacaEval, MTBench, Chatbot Arena). Their results show that controlling for style and length does help reduce the bias in evaluation.

However, the method and experimental results are limited to open-ended tasks where the reference is human preference, which is a limitation since many competitive LLM benchmarks (e.g., olympiad math, legal or medical exams) rely on ground truth.

方法与评估标准

As mentioned before, their evaluation criteria makes sense. Yet correlation with human preference is a limiting eval criteria.

理论论述

实验设计与分析

Yes, i checked for validity of their experimental analysis. The experiments and ablations are well-designed and results are carefully reported. However, results are limited to a small set of benchmark datasets. Expanding the experiments to include more difficult datasets would strengthen the paper.

补充材料

Yes, I have reviewed all sections on Brier Score, Controlling for length and style, and prompt examples.

与现有文献的关系

Focusing on evaluations, especially creating an automated evaluation framework which:

controls for potential biases in evaluations
provide a dynamically generated evaluation dataset which can mitigate test set contaminations
evaluates models and a diverse set of samples (expanding test set distribution)

遗漏的重要参考文献

I am not aware of such references.

其他优缺点

The choice of chat bot area as a reference for computing confidence agreement and separability is a bit limiting. Does this method extend to more difficult tasks, such as olympiad level math or SWEbench, where SoTA models in general do very poorly on?
Currently this setup is limited to single turn evaluation. This is a bit limitation cause many LLMs today are used for agentic or multi turn tasks.

其他意见或建议

When using the same family of models as the judge and annotator, the judge model usually shows a bias towards generations of the same family of models. Have you studied this? Would controlling for style reduce this bias?

作者回复

2025-03-31

We sincerely thank the reviewer for the valuable and insightful feedback. We address your concerns as follows:

"The choice of chatbot arena as a reference for computing confidence agreement and separability is somewhat limiting. Does this method extend to more challenging tasks, such as Olympiad-level mathematics or SWEbench, where state-of-the-art models generally perform poorly?"

We would like to clarify that while confidence agreement relies on human preference rankings as a reference, computing separability does not require human preferences or chatbot arena as a reference. Separability effectively captures a benchmark's ability to distinguish among model performances, a critical aspect as state-of-the-art models increasingly demonstrate similar levels of performance. This metric provides meaningful differentiation crucial for model developers.

We believe our method indeed extends to more challenging tasks. Our approach successfully extracts difficult tasks from vast unfiltered datasets, such as "PyTorch Autoencoder Implementation" and "Computability and Automata Theory" (see page 18, Figure 6). Here, we show topic clusters specifically ranked for their complexity and desirable quality and the highest scored topic clusters are selected for benchmark curation, suggesting our method aligns well with the current demands for more challenging tasks. We will make sure to clarify this connection in our revision.

"Currently, this setup is limited to single-turn evaluation, which is restrictive, as many contemporary LLM applications involve agentic or multi-turn interactions."

We acknowledge that our current method is indeed limited to single-turn evaluations. Multi-turn benchmarks are inherently challenging to automatically curate from crowdsourced datasets because responses beyond the initial turn can depend significantly on prior model outputs. Extending Bench-O-Matic to robustly support multi-turn evaluations represents a valuable future direction we intend to pursue.

"When using the same model family for both judge and annotator, the judge model often exhibits bias towards generations from the same family. Have you investigated this issue, and could style control potentially mitigate this bias?"

The reviewer raises an important point. We have studied and quantified biases arising from using judge and annotator models from the same family (see Section 6.6: Mitigating Self-Biases in LLM-Based Evaluation). While we have not yet specifically explored whether controlling for style mitigates this bias, we definitely plan to investigate this.

"Given the growing emphasis on evaluating technical and challenging tasks (such as reasoning and coding), how well do you believe your method extends to these benchmarks?"

Our analyses indicate that the clusters selected by our approach indeed prioritize more technically demanding tasks, such as reasoning and coding, over trivial ones (as demonstrated in Figure 6 on page 18). For example, clusters involving sophisticated topics score higher, increasing their likelihood of inclusion in benchmarks compared to simpler ones like "Flirty Texting Strategies." Thus, we believe our method aligns strongly with the trend towards evaluating models on more technical and challenging tasks.

"Have you studied whether the increased separability results from filtering for difficult questions within benchmarks?"

We directly investigated the impact of filtering for difficult questions on separability in our experiments (see Section 4.3, Figure 3). The results confirm that selecting tasks with higher difficulty indeed enhances the benchmark’s ability to differentiate between stronger and weaker LLMs, reinforcing the benefit of our filtering strategy.

We deeply appreciate the reviewer’s feedback and hope our responses fully address your concerns.

审稿意见

评分: 32025-03-13

This paper introduces Bench-O-Matic, a pipeline that automatically constructs high-quality, large scale benchmarks to evaluate LLMs from crowdsource datasets such as Chatbot Arena. To measure the quality of this benchmark, the authors proposed new metrics to measure properties that are important in when curating the data. The authors also present Eval-O-Matic (and Wild-O-Matic), which was curated using this pipeline, and demonstrated higher model separation of model performance compared to existing benchmarks, and also at a low cost.

给作者的问题

Prompts are filtered based on “quality scores” produced from an LLM, how was this threshold determined and do you know how adjusting this could potential affect the quality?

论据与证据

The authors claim that this pipeline produces benchmarks that are high quality, which is supported by details empirical results that show a higher model separation and correlation with human preferences compared to other well known benchmark datasets used to evaluate LLMs. The analysis of the cost was also provided and supported the claim that the proposed method is cost effective.

However one large concern is the fact that the experiments rely on the LLM-as-a-Judge evaluation framework which as noted in the paper is known to exhibit certain biases. While the authors have attempted to mitigate some issues, it is still an inherent limitation which could influence the robustness of the results.

方法与评估标准

Yes, the methods are suitable for the problem at hand. The novel metrics proposed provide a more nuance way of evaluation LLMs beyond the traditional statistical measures such as the Spearman or Pearson correlations, which I agree does allow for a more robust evaluation of LLM performance, and also its agreement with human preferences. Re benchmark datasets: the experiments compare Eval-O-Matic with popular benchmarks such as MT Bench and Chatbot Arena, which are suitable for the task.

理论论述

There are no formal proofs of any of the claims in the paper. The authors provide descriptive explanations of the metrics proposed and those are based on existing statistical foundations and are largely sound.

实验设计与分析

The experiments compare Eval-O-Metric to several well-known benchmark datasets, and also tested it using the top 20 LLMs from Chatbot Arena, comparing them using existing statistical metrics and also the newly proposed ones. These are thorough designs and does provide compelling empirical results. The measures to prevent any LLM induced bias were appreciated, however as mentioned above, there are still concerns regarding the limitations of LLM-as-a-Judge evaluations and how robust they are.

补充材料

Yes, I reviewed the material in the appendix. They provide additional technical details, evaluation findings and specific information regarding the implementation of Bench-O-Matic which support the main paper.

与现有文献的关系

The paper contributes to the domain of LLM evaluation and benchmark curation. It aims to extend benchmarking beyond the tradition static benchmarks such as MMLU with ground truth based evaluation, or even live benchmarks such as Chatbot Arena, by introducing an automated curation method that allows for evaluation on open-ended tasks. By automating this process, Bench-O-Matic also aims to address the existing issue of test set leakage by being able to frequently update benchmarks.

The introduction of new evaluation metric which supplement traditional statistical metrics could prove to be important metrics that help to measure properties that existing metrics could not, contributing to the ability to better measure the performance of LLMs in future work.

The authors also uses methods such as style control, and introduces Emsemble-as-Judges as an aim to improve the LLM-as-a-Judge framework which could help to improve the reliability of that framework.

遗漏的重要参考文献

N/A

其他优缺点

Strengths:

Bench-O-Matic offers a scalable, cost-effective, and automated approach to curate benchmarks, making it extremely impactful in practice
The new evaluation metrics introduced provide better evaluation and separability of LLMs, which would be crucial in helping advance future work in LLM development and benchmarking

Weaknesses:

Bench-O-Matic is currently quite rather limited to single-turn, English interactions, the authors acknowledge this limitation, however it does raise questions about the generalization of this pipeline for various real world applications
The reliance of LLMs as various parts of the pipeline may introduce bias which I believe has to be further investigated and regulated

其他意见或建议

N/A

作者回复

2025-03-31

We sincerely thank the reviewer for the thoughtful and constructive feedback. Below, we address each concern raised and propose revisions:

"However, one large concern is the fact that the experiments rely on the LLM-as-a-Judge evaluation framework, which, as noted in the paper, is known to exhibit certain biases. While the authors have attempted to mitigate some issues, it is still an inherent limitation which could influence the robustness of the results."

We agree that biases inherent to the LLM-as-a-Judge evaluation framework pose challenges. Nevertheless, it remains the most practical and cost-effective method to evaluate LLMs on diverse, open-ended real-user queries. Besides implementing multiple bias-mitigation strategies described in the paper, we highlight that our benchmark demonstrates greater robustness compared to similar LLM-as-a-Judge evaluations, as evidenced by the significantly higher agreement to human preference and stronger separability.

"Bench-O-Matic is currently rather limited to single-turn, English interactions. The authors acknowledge this limitation; however, it does raise questions about the generalization of this pipeline for various real-world applications."

We acknowledge that our current method is indeed limited to single-turn, English evaluations. Multi-turn benchmarks are inherently challenging to automatically curate from crowdsourced datasets because responses beyond the initial turn can depend significantly on prior model outputs. The authors are also primarily proficient in only English. However, extending Bench-O-Matic to robustly support multi-turn and multilingual evaluations represents a valuable future direction we intend to pursue.

"The reliance of LLMs as various parts of the pipeline may introduce bias which I believe has to be further investigated and regulated"

We recognize the reviewer’s concern regarding potential biases introduced by using LLMs throughout the evaluation pipeline. To address this, we validated the quality and reliability of our annotations using majority voting among several state-of-the-art models (GPT-4o, Claude-3-Opus, Gemini-1.5-Pro), achieving an 85.6% agreement with GPT-4-Turbo on 200 prompts in the validation set (as detailed in Section 4.2).

After reading your review, we additionally conducted human validations by manually annotating 50 prompts from our validation set. These showed an 84.3% agreement between GPT-4-Turbo and human annotations, and notably, a 94.9% agreement between human annotations and majority LLM votes. Due to OpenReview's policy restricting updates during the discussion phase, this important validation results will be fully detailed in the final manuscript revision to enhance Bench-O-Matic's reproducibility and credibility (will be detailed in Section 4.2).

"Prompts are filtered based on 'quality scores' produced from an LLM. How was this threshold determined, and do you know how adjusting this could potentially affect the quality?"

To clarify, we did not rely on LLM-generated numeric "quality scores." Instead, the LLM was instructed to provide binary judgments indicating whether each prompt met specific qualitative criteria (e.g., “Problem-Solving,” “Domain Knowledge”). The prompts and clusters with most of the criteria satisfied are selected for benchmark curation. We will clarify this distinction in Section 4.1 of our revision. Furthermore, our ablation analysis in Section 4.3 (Figure 3) demonstrates that including more qualitative criteria enhances the differentiation between stronger and weaker models (e.g., GPT-4 vs. Llama-2-70B, Claude-Opus vs. Claude-Sonnet), hence explains the enhanced separability of our final benchmarks (e.g. Eval-o-Matic, Wild-o-Matic) from other popular benchmarks. We will make sure to clarify this connection in our revisions.

We greatly appreciate the reviewer’s constructive feedback, which improves the quality and clarity of our work. We respectfully ask the reviewer to reconsider their rating, given our revisions, clarifications, and the substantial potential impact of our contributions to the community.

审稿意见

评分: 32025-03-14

This paper introduces Bench-O-Matic, an automated pipeline for curating high-quality benchmarks from large-scale crowdsourced datasets, and Eval-O-Matic, a benchmark dataset generated using this pipeline. The motivation is that existing benchmarks are either static (leading to saturation and test-set leakage) or require expensive human curation. Bench-O-Matic extracts prompts from datasets like Chatbot Arena and WildChat-1M, applying seven quality criteria (e.g., specificity, domain knowledge, problem-solving complexity) to filter high-quality prompts. The resulting benchmark, Eval-O-Matic, achieves 3× better model separation than MT-Bench and 98.6% correlation with human preference rankings at a fraction of the cost ($20 per evaluation). The paper also introduces new benchmark evaluation metrics—Separability with Confidence, Agreement with Confidence, and Pair Rank Brier Score—to assess benchmark effectiveness.

给作者的问题

How does Bench-O-Matic handle adversarial noise? Crowdsourced datasets may contain low-quality or adversarial prompts—how does your system filter these?
What is the impact of training data contamination? If LLMs are trained on Chatbot Arena-style queries, could this inflate correlation scores?
Why not compare against LiveBench or R2E? These benchmarks also focus on dynamic evaluation—how does Bench-O-Matic differ?
How would the method adapt to multimodal benchmarks? Can Bench-O-Matic curate image- or video-based prompts?

论据与证据

The claim that Bench-O-Matic extracts high-quality prompts aligned with human preferences is supported by Eval-O-Matic’s rankings showing 98.6% correlation with Chatbot Arena, which reflects real user interactions. But this assumes Chatbot Arena rankings are a gold standard, but prior work (Carlini et al., 2021) suggests human preference data may contain inconsistencies and biases. The authors should evaluate how much these factors impact benchmark quality.

The authors argue that existing metrics (e.g., Spearman correlation) fail to measure model separation and propose confidence-based alternatives. This assumes that LLM judges are unbiased. While the paper introduces ensemble-based methods to mitigate bias, it does not analyze failure cases where LLM judges systematically misrank models.

方法与评估标准

Benchmark Design: The hierarchical clustering method (BERTopic, UMAP, HDBSCAN) for grouping prompts is reasonable, but the paper lacks qualitative validation of cluster quality.

Comparison with Prior Work: The evaluation compares Eval-O-Matic with MT-Bench and AlpacaEval, but does not include LiveBench or R2E, which also focus on dynamic benchmarking.

理论论述

No formal proofs, but the paper’s proposed metrics (Separability with Confidence, etc.) are well-motivated. One potentially issue is with the bootstrapping methods used for confidence estimation, since the statistical robustness of these methods in this setting is not fully analyzed.

实验设计与分析

The clustering-based filtering approach ensures high-quality prompts, but the paper lacks qualitative analysis of outliers and failure cases.
Several ablations are performed, like different LLM judges, controlling for stylistic biases, and testing on alternative datasets.
But the paper seems to be missing some failure analysis, like discussion of when and why low-quality prompts are selected.
Some minor assumption about LLM costs remaining stable, when API pricing may change.

补充材料

The additional experiments e.g. styled-controlled comparisons.

与现有文献的关系

The paper builds on MT-Bench, AlpacaEval, and Chatbot Arena but improves by introducing an automated pipeline. This work is similar to LiveBench and R2E, though Bench-O-Matic focuses on prompt curation rather than live model evaluation.

Prior work has shown LLM-based evaluation correlates with human judgments, but the failure cases of LLM judges (e.g., hallucinations, self-reinforcement biases) are underexplored.

遗漏的重要参考文献

None that I'm aware of.

其他优缺点

Other weaknesses:

The pipeline is tested only on Chatbot Arena and WildChat-1M, with no evaluation on scientific, legal, or programming benchmarks.
The authors use ensemble LLM judges but do not analyze cases where LLMs systematically misrank models.
The paper does not discuss when Bench-O-Matic selects poor-quality prompts or how often prompt selection fails.

其他意见或建议

Including more qualitative examples of prompts selected by Bench-O-Matic would be helpful.

作者回复

2025-03-31

We sincerely thank the reviewer for the thoughtful and constructive feedback. We address your concerns and propose corresponding revisions:

W1: We do not treat Chatbot Arena’s user-vote-based rankings as an unquestionable “gold standard.” Instead, we and other works (e.g., Dubois et al., Lin et al.) view it as a practical "silver standard" worth approximating. Although Chatbot Arena likely contains biases and noise, it currently represents the largest publicly available dataset of human evaluations, with over two million interactions. Hence, benchmarks demonstrating high agreement and correlation with Chatbot Arena rankings provide valuable guidance to model developers. We will include this in our revisions.

W2: Our critique of Spearman correlation does not depend on assuming unbiased LLM judges. Rather, confidence-based metrics provide clarity on two critical questions:

When benchmarks predict model performance, do these predictions align with actual user preferences?
Can benchmarks effectively distinguish between similar model checkpoints?

If LLM judges exhibit systematic biases, this would naturally result in lower confidence, agreement, and correlation metrics. Conversely, if our evaluation demonstrates high confidence, agreement, correlation, and separability, it affirms the benchmark's usefulness.

We acknowledge potential biases inherent to LLM-based evaluations and address them through multiple strategies:

Style Control: As detailed in Section 6.5, our methods significantly reduce biases toward particular styles or longer responses, improving benchmark alignment with human preference rankings.
Bias Diagnostics: We systematically compare judge rankings with human leaderboards across multiple scenarios, using metrics like confidence agreement to quantify biases and misalignments.

We agree that detailed exploration of systematic biases is important and plan to expand our analysis.

W3: Our primary focus was comparing Eval-O-Matic with benchmarks like MT-Bench and AlpacaEval, which similarly utilize LLM-based judges for open-ended tasks without ground-truth references (instead of LiveBench for example). While recognizing that LLM-as-a-Judge inherently introduces challenges, this method remains the most scalable and affordable approach to model developers for open-ended prompts.

W4: We agree with the reviewer regarding the presence of low-quality and unsafe prompts within crowdsourced datasets. In Section 4.1 (line 196), we explicitly describe our process for filtering prompts and retaining only high-quality clusters based on average quality scores. Figures 4 and 6 (Section 4.3) illustrate how clusters are differentiated qualitatively, with higher scores correlating with complexity and relevance, while lower scores correspond to trivial or ambiguous prompts. Further details and examples are provided in Appendix C.

Moreover, our ablation study (Section 4.3, Figure 3) demonstrates how qualitative criteria effectively differentiate strong and weak models, explaining Bench-O-Matic's produced benchmarks' enhanced separability compared to other benchmarks. We will clarify these points further in the revision.

W5: We agree with the importance of evaluating benchmarks in specialized domains (scientific, legal, programming). However, our core contribution specifically targets diverse, real-world, open-ended interactions. Benchmarks such as Chatbot Arena, MT-Bench, and AlpacaEval are widely used precisely because they reflect authentic user interactions. Hence, Eval-O-Matic remains valuable for assessing practical, downstream performance. Nevertheless, we recognize this limitation and will clearly outline it in our manuscript.

W6: We have validated Bench-O-Matic’s performance beyond Chatbot Arena by also evaluating on WildChat-1M. Ensured no Eval-O-Matic prompts overlap with publicly released Chatbot Arena data. If a model trained on similar queries generally improves at addressing real-world interactions, we anticipate performance gains across both Eval-O-Matic and actual deployment settings.

W7: While our current framework focuses exclusively on text-based evaluations, it could be adapted to multimodal benchmarks by substituting LLM annotators and evaluators with VLMs. Indeed, Chou et al. have successfully applied a similar approach to the Vision Arena dataset, demonstrating the feasibility of extending Bench-O-Matic’s principles to image-based tasks.

Chou et al. "VisionArena: 230K Real World User-VLM Conversations with Preference Labels."

Lin et al. "Wildbench: Benchmarking llms with challenging tasks from real users in the wild."

最终决定Accept (poster)

2025-05-01

The paper introduces BenchO-Matic, an automated pipeline that uses large language models (LLMs) to generate high-quality, open-ended prompts from large, crowd-sourced datasets, enabling continuous benchmark updates without human involvement. The idea is timely and relevant, especially given the growing need for scalable benchmarking in the LLM era.

While the reviewers appreciated the motivation and potential of the work, several shared concerns were raised. These include:

The inherent biases and limitations of using LLMs as evaluators ("LLM judges"),
The lack of ablation studies to isolate the contributions of different pipeline components,
The limited set of evaluation datasets, which makes it difficult to assess the generalizability of the approach.

The authors provided a thoughtful rebuttal and addressed many of these concerns with promising insights and proposed experiments. Reviewers remained positive to the rebuttal. AC recommends acceptance and encourages the authors to incorporate the additional experiments and clarifications discussed in the rebuttal into a revised version of the manuscript.