PaperHub
7.8
/10
Oral8 位审稿人
最低6最高10标准差1.6
8
6
10
6
6
8
10
8
3.5
置信度
正确性3.5
贡献度3.1
表达3.4
ICLR 2025

Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates

OpenReviewPDF
提交: 2024-09-26更新: 2025-03-13
TL;DR

We show that null models that always return the same cheating responses can achieve high win rates on automatic LLM benchmarks.

摘要

Automatic LLM benchmarks, such as AlpacaEval 2.0, Arena-Hard-Auto, and MT-Bench, have become popular for evaluating language models due to their cost-effectiveness and scalability compared to human evaluation. Achieving high win rates on these benchmarks can significantly boost the promotional impact of newly released language models. This promotional benefit may motivate tricks, such as manipulating model output length or style to game win rates, even though several mechanisms have been developed to control length and disentangle style to reduce gameability. Nonetheless, we show that even a **"null model"** that always outputs a **constant** response (*irrelevant to input instructions*) can cheat automatic benchmarks and achieve top-ranked win rates: an $86.5\%$ LC win rate on AlpacaEval 2.0; an $83.0$ score on Arena-Hard-Auto; and a $9.55$ score on MT-Bench. Moreover, the crafted cheating outputs are **transferable** because we assume that the instructions of these benchmarks (e.g., $805$ samples of AlpacaEval 2.0) are *private* and cannot be accessed. While our experiments are primarily proof-of-concept, an adversary could use LLMs to generate more imperceptible cheating responses, unethically benefiting from high win rates and promotional impact. Our findings call for the development of anti-cheating mechanisms for reliable automatic benchmarks. The code is available at https://github.com/sail-sg/Cheating-LLM-Benchmarks.
关键词
Large Language ModelsCheatingAutomatic LLM Benchmarks

评审与讨论

审稿意见
8

The paper evaluates the effectiveness of automatic benchmarks for language models, while posing the new and interesting question of whether you can cheat on these benchmarks. The authors show that even a simple “null model” can manipulate these benchmarks, achieving impressive win rates—9.55 on MT-Bench for instance. This demonstrates vulnerabilities in the benchmarks, as crafted outputs can bypass their intended controls. The study emphasizes that these outputs can be transferred across benchmarks, raising ethical concerns about unfair advantages. Ultimately, the authors advocate for developing anti-cheating mechanisms to ensure the reliability of automatic evaluations.

优点

  • This paper possesses a very thoughtful and straightforward idea which is executed well
  • The authors describe (perhaps for the first time?) a threat model of chasing on LLM-graded evaluations. This model will be useful for other researchers. I believe that this work has a higher than average potential for impact within the field.
  • I find the authors have done a decent job at evaluating the obvious questions in this space. They provide a thorough detail of the types of questions that are natural to ask about cheating in LLM-graded evaluations. For example, they ask about what type of null-model responses are effective, how to craft the null-model responses, etc. The authors could do a better job at locating at more models (see below), but this work gives a great foundation for posing an interesting problem and interrogating specific and interesting questions.
  • I particularly appreciate that the authors did not just identify a potential vulnerability in the LLM-graded evaluation space, but they also looked at mitigation techniques (Section 6). This is an often missing part of papers like this, and I am glad the authors have spent time to explore this space initially.

缺点

  • I’m sure the authors would agree that their analysis only applies to LLM-graded forms of auto-evaluations. There have been more recent benchmarks looking at auto-evaluation of LLMs based on ground truth scoring. Benchmarks like LiveBench and datasets like MATH, WikiQA, etc are such examples. While this doesn’t invalidate the argument, it is an important point that I feel the authors should include, and rather prominently, in the introduction, related work, and conclusion of the paper. For example, throughout the paper, the authors describe “auto-annotators” where they really mean “LLM-based auto-annotators.” This would also be a good thing for the authors to mention that cheating on benchmarks like this would be an area for future study since the paradigm is so different. Even better, the authors could pose a few suggestions for this future research direction to inspire the community to pursue this direction as well as the LLM-graded cheating.
  • The authors could improve on the clarity of the presentation of this work in some places. For example:
    • The authors introduce for the first time “optimized adversarial suffix” in the first sentence of Section 3 where they are presenting the results of that method. It opens a lot of questions which go un-answered: what suffixes did they test? how did they test them?, etc. I’d suggest the authors perhaps restructure this section to guide the reader more directly through these strategies. (See last bullet for suggested restructure.)
    • See 3rd question below
  • I feel like the authors could be more careful in their language in places. For example, they title the subsection at L296 “Is the structured response useful on open-source auto-annotators?” however the authors only look at Llama-3-8/70B. This hardly represents open source models. I suggest the authors make a pass and tone down any language that is overly broad and unsupported by their data.
  • I think structurally in the paper, the authors can/should dedicate more space in this work to a discussion of their findings. I suggest the authors pick up space in the Related Work section by moving some of it to the Appendix. Then specifically, take more care in Section 3 to detail the differing ways in which response can be generated, describing that ablation/initial experiment in more detail, and then providing additional explanations and hypotheses throughout Section 4-6. The current draft reads more as a presentation of detailed results, and less about a thoughtful discussion of what those results mean in the context of their inquiry. Incorporating answers to the questions below (and those of the other reviewers) would be a good place to start.

Minor point:

  • L210-211 deserves a citation, something like this

问题

  • In reference to Figure 3 and knowing the GPT4 prefers the first response in auto-evals, can the authors speculate on the impact of their approach (“Ours”) on this phenomenon?
  • Can the authors describe why they think there is the following difference between GPT4 and Llama-3-70B: both 70B and GPT4 show positional bias, but only GPT4 is susceptible to the structured cheating method? I don’t particularly buy into the argument in L309-311 because the index 0 method doesn’t really look like an instruction to follow.
  • Can the authors also clarify whether the Index 0 method, as shown in Figures 3 and 4 and Table 6 is the same for all models? Put another way, is index 0 fixed throughout?
评论

Thank you for your supportive review and suggestions. Below we respond to the comments in Weaknesses (W) and Questions (Q).


W1: Include auto-evaluation of LLMs based on ground truth scoring.

Indeed, our research focuses on LLM-graded auto-evaluations rather than benchmarks based on ground-truth scores. In the Introduction (Page 2), we add the footnote statement that ''Our analyses focus on LLM-based auto-annotators rather than ground-truth auto-scoring of LLMs. We have also emphasized LLM-based evaluations in Preliminaries and Related Work.

As proposed by Q3 of Reviewer 2MD2, we can also use this type of cheating pattern in ground-truth-based auto evaluations. Specifically, We could ensemble multiple models and optimize the adversarial suffixes to increase their averaged log-probability on the wrong options. These adversarial suffixes in wrong options can make benchmarks like MMLU more difficult and assess models' ability to resist irrelevant information. As shown in [1], advanced LLMs cannot even resist non-adversarial irrelevant information in reasoning tasks.

[1] GSM-Symbolic: Limitations of Mathematical Reasoning in LLMs


W2 (a): The first sentence of Section 3 introduces for the first time “optimized adversarial suffix”, what suffixes did they test? How did they test them?

The ineffective adversarial suffix and our structured response are shown in textrmcolorblueFigure19\\textrm{\\color{blue}Figure 19} (Page 34). Both of them are optimized by random search to minimize the logp(winner=NullModel)-\log p(\mathtt{winner}=\mathtt{NullModel}).

In the default setting described in textrmcolorblueFigureA\\textrm{\\color{blue}Figure A} (Page 4), where the reference model's responses are placed first and our submitted responses are placed second, the objective is logp(mathttgptoutput=mathttM)-\\log p(\\mathtt{gpt_output}=\\mathtt{M}) on each test instruction, which is computed from GPT-4 API's return logits. Similarly, for the swap position described in textrmcolorblueFigure8\\textrm{\\color{blue}Figure 8} (Page 20), the objective is logp(mathttgptoutput=mathttm)-\\log p(\\mathtt{gpt_output}=\\mathtt{m}). We ensemble the above losses on multiple instructions and both the default and swapped positions to find a universal prefix using RS.


W2 (b) & Q3: Whether the Index 0 method, as shown in Figures 3 and 4 and Table 6 is the same for all models? Put another way, is index 0 fixed throughout?

Yes, index 0 remains fixed throughout. textrmcolorblueTable6\\textrm{\\color{blue}Table 6} (Page 23) shows the mapping between index id and response.


W3: Be more careful in their language in places. The authors only look at Llama-3-8/70B. This hardly represents open source models.

Thank you for your kind suggestions. In the final revision, we will thoroughly polish our claims to ensure they are precise and not overstated.

In terms of open-source auto-annotators, we expand our evaluation beyond Llama-3-8/70B to include Mistral-7B-Instruct and SOLAR-10.7B-Instruct, with the results shown in textrmcolorblueTable9\\textrm{\\color{blue}Table 9} (Page 30). By broadening the range of models evaluated, we provide a more comprehensive demonstration of our method's effectiveness while also emphasizing its potential applicability to various models in the open-source ecosystem. We will add experiments on even more open-source auto-annotators in the final revision.


W4: I think structurally in the paper, the authors can/should dedicate more space in this work to a discussion of their findings.

Thank you for the constructive suggestions! We have moved Related Work into Appendix A, and added textrmcolorblueFigureA\\textrm{\\color{blue}Figure A} (Page 4) into Section 3 to further elaborate the idea behind our structured response. We also describe more details of cheating mechanisms in the captions.

Due to time constraints, we may not completely reorganize our paper during rebuttal, but we will thoroughly polish it and provide additional details/discussions in the final revision.


Minor W5: L210-211 deserves a citation

Thank you for the reference; we've cited and discussed this paper accordingly.

评论

Q1: In reference to Figure 3 and knowing the GPT4 prefers the first response in auto-evals, can the authors speculate on the impact of their approach (“Ours”) on this phenomenon?

textrmcolorblueFigureA\\textrm{\\color{blue}Figure A} (Page 4) and its caption show that under the default configuration, the auto-annotator will prefer the first response, suggesting a preference for the first-position response. This highlights the position bias of the GPT-4-based auto-annotator, which often favors the first response.


Q2: Describe why the authors think there is the following difference between GPT4 and Llama-3-70B.

We use the official AlpacaEval 2.0 LC win rates to measure the instruction-following ability, and then study the relationship between structured response success (measured by logp(mathttwinner=mathttNullModel)-\\log p(\\mathtt{winner}=\\mathtt{NullModel})) and the instruction-following ability. As shown in textrmcolorblueFigure20\\textrm{\\color{blue}Figure 20} (Page 34), we find that as the instruction-following ability grows, the optimization objective logp(mathttwinner=mathttNullModel)-\\log p(\\mathtt{winner}=\\mathtt{NullModel}) decreases.

评论

Thank you for your work improving this paper based on the feedback of all the reviewers. I appreciate your pdf updates and clarifications. I have updated my score.

One last note though, I'm not following with your response here:

Figure A (Page 4) and its caption show that under the default configuration, the auto-annotator will prefer the first response, suggesting a preference for the first-position response. This highlights the position bias of the GPT-4-based auto-annotator, which often favors the first response.

and the corresponding caption of Figure A

Finally, when the auto-annotator is successfully deceived into believing the two model outputs are the same (i.e., both are empty), it will prefer the first one and return “M” as the best model identifier.

From my reading, you're making a logical jump here which is inappropriate. It isn't clear to me that the LLM would (from first principles not data) prefer the first response from the prompt in Figure A. With my question, I'm asking for data to support this claim and a hypothesis as to why. It seems equally plausible that the LLM would prefer the second response under a misleading and adversarial injection like in Figure A.

评论

We appreciate your detailed feedback and suggestions, which greatly help us to improve our work!


Regarding the last note, as described in Lines 265-268, we were motivated by previous research (Wang et al. 2023) that GPT-4 auto-annotators prefer the first response. On AlpacaEval 2, we further validate this position bias by setting both output_1 and output_2 to be empty, and observe that log(textrmwin=textrmfirst)=1.65\\log(\\textrm{win}=\\textrm{first})=-1.65 and log(textrmwin=textrmsecond)=3.43\\log(\\textrm{win}=\\textrm{second})=-3.43. We will better clarify this point in the final revision. Thank you again!

Ref: Wang et al. Large language models are not fair evaluators. 2023

审稿意见
6

This paper explores a cheating method to deceive LLM-based evaluation benchmarks. Specifically, it demonstrates that a particular cheating template, which exploits the standardized templates used in these evaluations, can be an effective attack. Additionally, the authors introduce an optimization technique to make the prefix, which, when combined with the cheating template, increases the performance of null responses. The paper emphasizes the need for more robust development of LLM-based evaluations to prevent such cheating.

优点

  • The paper demonstrates that even a null model, which always outputs the same response, can manipulate results on benchmarks that use LLMs for evaluation, highlighting the need to improve the robustness of these benchmarks.
  • The paper is easy to read and follow, and the high-level process of creating a cheating prompt along with its optimization is well explained.
  • The paper also explores some solutions to this cheating issue such as perplexity filtering, although these do not perform well.

缺点

  • Although demonstrating that a simple fixed response can exploit the evaluation benchmark is novel and exciting, I believe it is also a limitation of the paper in terms of practicality. In real-world scenarios, many open-source models on leaderboards are also evaluated qualitatively by a lot of users. If a model produces an obvious cheating prompt, people would likely disregard it. It would have been better if the paper had provided a scenario illustrating how this finding could be applied in real settings and how it might mislead actual users of LLMs. Without this, the notion that LLM evaluation is imperfect and susceptible to manipulation is already widely known, making the paper’s message seem incremental, despite showing vulnerability to simple cheating.
  • I think the paper lacks some details on the methods. For example, in Figure 1, what is the difference between the text in blue and the text in black? It seems that the blue text represents a prefix optimized version, but this is unclear in the caption and figure. Additionally, what loss function is used in the universal random search? There is no commentary on each part of the algorithm, which affects clarity. Moreover, how could we design a new structured cheating method for a new benchmark based on LLM evaluation?

问题

  • For the paraphrased result, what if we use a random selection among multiple paraphrased samples during evaluation? Is it still susceptible to cheating?
  • In terms of a solution, what if the evaluation LLMs are informed about the possibility of cheating and are therefore prompted to prioritize the first instruction?

Please refer to the weaknesses section for critical questions.

评论

W2 (c): Moreover, how could we design a new structured cheating method for a new benchmark based on LLM evaluation?

The inspiration came from analyzing benchmark weaknesses and understanding how LLMs respond to structured prompts, with an emphasis on predictable behaviors or biases such as positional sensitivity. Similar studies on prompt engineering and adversarial testing also provided valuable insights. The starting point was to create simple response templates and refine them through manual experimentation. Our structured cheating responses evolved in three stages:

  • The first stage focused on developing responses that were effective for a single instruction in a single positional setup (either default or swapped). This included analyzing model behavior for individual instructions, identifying performance patterns, and iteratively refining prompts according to observed weaknesses or strengths.

  • The second stage expanded this approach to include multiple instructions while keeping a single positional setup. Insights from the first stage were used to ensure that the structured responses were clear and effective across a wide range of instructions, balancing generality with specificity. This stage required rigorous testing with a variety of instruction types, as well as ongoing refinements to improve performance.

  • The third stage addressed the issue of position invariance, aiming to create responses that worked equally well whether the positions were defaulted or swapped. This necessitated modifying the structure to neutralize positional changes via symmetry or redundancy, followed by extensive testing and iterative adjustments to ensure robustness across both settings.

The three-stage process includes clear steps and methods, making it reproducible for others. Future work could use reinforcement learning or LLM-based optimizers (such as DSPy [1] or TextGrad [2]) to automate parts of this manual process, reducing effort and improving scalability.


Q1: For the paraphrased result, what if we use a random selection among multiple paraphrased samples during evaluation?

As shown in textrmcolorblueTable5\\textrm{\\color{blue}Table 5} (Page 10), our cheating is effective across multiple paraphrased prompt templates and reaches over 90% LC win rates on each rewritten template. Randomly selecting different paraphrased templates will just result in the average win rates of these templates and thus be ineffective in defending our cheating.


Q2: What if the evaluation LLMs are informed about the possibility of cheating and are therefore prompted to prioritize the first instruction?

Following your suggestion, we add a safety reminder: ''You should prioritize the first instruction'' to both the system and user messages of the judge template. As shown in textrmcolorblueTable8\\textrm{\\color{blue}Table 8} (Page 30), the Self-Reminder, which prompts the model to prioritize the first instruction, is slightly effective but cannot completely reduce the win rates of our structured response cheating.

评论

Thank you for your valuable review and suggestions. Below we respond to the comments in Weaknesses (W) and Questions (Q).


W1 (a): Limitation of the paper in terms of practicality in real-world scenarios.

Thank you for your kind words about our novelty. As you mentioned, we tackle the most challenging scenario: a 0KB null model that only produces a constant output, without accessing test instructions. Such a non-informative constant response would certainly receive a zero win rate from a human user, yet it achieves the highest win rate from the auto-annotators. This illustrates our key point: automated benchmarks are far easier to cheat than previously assumed, which is the core message we aim to convey with this paper.

In real-world scenarios, the adversary will not use a null model; instead, the adversary may rely on an advanced LLM and optimize its outputs using LLM-based optimizers (such as DSPy [1] or TextGrad [2]) to adversarially maximize win rates or reward scores, similar to the mechanisms described in the adversarial literature [3, 4]. LLM-based optimizers produce semantic adversarial outputs that are mostly imperceptible to human users. Furthermore, if the test instructions are kept private, the adversary could optimize the outputs on, e.g., UltraFeedback and then post-train the LLM using preference optimization methods such as DPO.

[1] https://github.com/stanfordnlp/dspy
[2] TextGrad: Automatic "Differentiation" via Text
[3] Jailbreaking Black Box Large Language Models in Twenty Queries
[4] AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models


W1 (b): LLM evaluation is imperfect and susceptible to manipulation is already widely known.

We agree that LLM evaluation is widely known to be imperfect and susceptible to manipulation. However, most related studies in the LLM-as-a-Judge community focus on issues like data contamination [5], output length [6], style [7], and format biases [8], with little attention given to adversarial attempts. This motivated us to bridge the gap between the LLM-as-a-Judge and adversarial research communities, demonstrating how well-established adversarial techniques can be adapted to cheat LLM judges while potentially benefiting from promoting impacts.

As a red-teaming paper, we hope that this research will encourage the development of anti-cheating mechanisms, particularly from the adversarial perspectives, to ensure reliable LLM-based evaluations.

[5] Benchmark Data Contamination of Large Language Models: A Survey
[6] Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
[7] Does style matter? Disentangling style and substance in Chatbot Arena
[8] From Lists to Emojis: How Format Bias Affects Model Alignment


W2 (a): Lacks some details on the methods. For example, in Figure 1, what is the difference between the text in blue and the text in black?

In the paper revision, we highlight our cheating response in textrmcolorblueFigureA\\textrm{\\color{blue}Figure A} (Page 4, default setting) and textrmcolorblueFigure8\\textrm{\\color{blue}Figure 8} (Page 20, swap setting). This cheating response is manually crafted and independent of the optimizable adversarial prefix. We describe the cheating mechanisms in the captions of Figures A and 8.


W2 (b): Additionally, what loss function is used in the universal random search?

The loss function used to guide the Random Search (RS) is logp(mathttwinner=mathttNullModel)-\\log p(\\mathtt{winner}=\\mathtt{NullModel}). In the default setting described in textrmcolorblueFigureA\\textrm{\\color{blue}Figure A} (Page 4), where the reference model's responses are placed first and our submitted responses are placed second, the objective is logp(mathttgptoutput=mathttM)-\\log p(\\mathtt{gpt_output}=\\mathtt{M}) on each test instruction, which is computed from GPT-4 API's return logits. Similarly, for the swap position described in textrmcolorblueFigure8\\textrm{\\color{blue}Figure 8} (Page 20), the objective is logp(mathttgptoutput=mathttm)-\\log p(\\mathtt{gpt_output}=\\mathtt{m}). We ensemble the above losses on multiple instructions and both the default and swapped positions to find a universal prefix using RS.

评论

Thank you so much for the kind and thorough rebuttal, and I apologize for the late reply. I have a few remaining questions.

For W1 (b), although the method is different, is the core message—that LLMs are susceptible to adversarial prompts—substantially different from [1] in the related work?

For Q1, I’m a bit confused. Regarding the template: did you perform a random search across multiple paraphrase templates to find one universal cheating sample? Or did you find a cheating sample for each paraphrased version individually? I feel my question might be interpreted differently based on this answer.

[1] Raina et al. "Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment"

评论

Thank you for your feedback! Below we respond to the remaining questions.


For W1 (b), [1] applies the universal adv perturbation delta_[1]\\delta\_{[1]} onto LLM responses xx, and feed x+delta_[1]x+\\delta\_{[1]} to the judges. In contrast, our null models directly feed the cheating response delta_textrmours\\delta\_{\\textrm{ours}} to the judges. Please note that the scenario in [1] is much easier than ours, because the LLM responses xx are already informative to the input instructions, and the role of delta_[1]\\delta\_{[1]} is simply to further improve the judge scores. In textrmcolorblueTable7\\textrm{\\color{blue}Table 7} (Page 30), we evaluate the effectiveness of [1] under our null model setup, and as seen, [1] achieves zero win rates.

For Q1, we perform a random search across multiple paraphrase templates to find one universal cheating sample.

评论

Dear Reviewer ELbG,

Thank you for your insightful feedback, and we have provided responses for your remaining questions. If you find our responses satisfactory, we appreciate it if you could kindly revise your rating based on your updated assessment of our work.

Thank you again for your valuable input!

Best,
The Authors

评论

Thank you for your detailed response.

While I remain somewhat skeptical about the practicality of the proposed adversarial attack, I see no strong reason to reject this paper. The finding is still exciting and adds value to the field.

Thank you again for your hard work.

Thus I have decided to raise my score to 6.

评论

Thank you for your constructive review and encouraging words. We really appreciate it! In our final revision, we will polish the paper further to incorporate the valuable insights gained from the rebuttal discussions. Thank you again!

审稿意见
10

Traditional benchmarks are, in general, manually curated and static (not updated after release). Automatic LLM benchmarks fix several issues that come with static benchmark datasets, can improve over time due to automatic curation, and can be less costly. This paper explores different ways in which a model can cheat these benchmarks, and emphasizes the necessity for counter mechanisms. The method involves utilizing a model that responds with constant outputs. Two strategies are discussed: (1) structured cheating responses in which confuses the model by replacing the original comparison, (2) using adversarial prefixes optimized with random search. It is shown that the structured responses method achieves SOTA across all benchmarks used in the experiments.

优点

  • The idea that Automatic LLM benchmarks can be cheated is clearly demonstrated.
  • Using the proposed method, SOTA scores are achieved for all the benchmarks tested, highlighting the core goal of the paper.
  • The jailbreak method exhibits some robustness because of the optimized prefix strategy, making it transferable as pointed out in the text.
  • The authors discuss potential anti-cheating strategies.

缺点

  • The paper could benefit from additional comparator methods.

问题

  • Would it be possible to create additional baselines for different degrees of structured response complexity?
  • How much manual iteration did you do to reach the final structured responses?
评论

Thank you for your supportive review and suggestions. Below we respond to the comments in Weaknesses (W) and Questions (Q).


W1: The paper could benefit from additional comparator methods.

Following your suggestions, we adapt several existing attacks on LLM-as-a-judge for our ''NullModel'' experimental setup (model can only return a constant response and test instructions are kept private). These adaptations are made to ensure that our approach can be directly compared to prior work, providing fair evaluations. As shown in textrmcolorblueTable7\\textrm{\\color{blue}Table 7} (Page 30), we first observe that existing attacks yield near-zero win rates, demonstrating their ineffectiveness in our ''NullModel'' experimental setup.


Q1: Additional baselines for different degrees of structured response complexity.

Following your suggestions, we create two additional structured responses with different degrees of complexity as shown in textrmcolorblueFigure15\\textrm{\\color{blue}Figure 15} (Page 28, medium complexity) and textrmcolorblueFigure16\\textrm{\\color{blue}Figure 16} (Page 29, low complexity). We then evaluate these two baselines. As shown in textrmcolorblueTable7\\textrm{\\color{blue}Table 7} (Page 30), the results from structured responses with varying levels of complexity show that a sufficiently complex structure is required to achieve high win rates. This emphasizes the significance of response structure in increasing the success of cheating.


Q2: How much manual iteration did you do to reach the final structured responses?

The inspiration came from analyzing benchmark weaknesses and understanding how LLMs respond to structured prompts, with an emphasis on predictable behaviors or biases such as positional sensitivity. Similar studies on prompt engineering and adversarial testing also provided valuable insights. The starting point was to create simple response templates and refine them through manual experimentation. Our structured cheating responses evolved in three stages:

  • The first stage focused on developing responses that were effective for a single instruction in a single positional setup (either default or swapped). This included analyzing model behavior for individual instructions, identifying performance patterns, and iteratively refining prompts according to observed weaknesses or strengths.

  • The second stage expanded this approach to include multiple instructions while keeping a single positional setup. Insights from the first stage were used to ensure that the structured responses were clear and effective across a wide range of instructions, balancing generality with specificity. This stage required rigorous testing with a variety of instruction types, as well as ongoing refinements to improve performance.

  • The third stage addressed the issue of position invariance, aiming to create responses that worked equally well whether the positions were defaulted or swapped. This necessitated modifying the structure to neutralize positional changes via symmetry or redundancy, followed by extensive testing and iterative adjustments to ensure robustness across both settings.

The three-stage process includes clear steps and methods, making it reproducible for others. Future work could use reinforcement learning or LLM-based optimizers (such as DSPy [1] or TextGrad [2]) to automate parts of this manual process, reducing effort and improving scalability.


References:
[1] https://github.com/stanfordnlp/dspy
[2] TextGrad: Automatic "Differentiation" via Text

评论

Thank you for your responses.

Q1 and W1: Table 7 looks great, thank you for adding my suggestion. It is interesting how results can be gradually improved for prompts at different levels of complexity. Given that SOTA results in our field increasingly differ by single digit scores, this suggests to me that in some cases one could use some hybrid form of cheating where a capable model is mixed with a low complexity cheating scheme when the completion is not good enough. While I agree with Reviewer ELbG that this method can be easily detected during qualitative user reviews, I still think it is highly relevant to the general evaluations discussion as automated LLM benchmarks become more popular.

Q2: I think the paper could benefit from a short appendix discussion like the one in your response.

The authors have taken the time to include my suggestions as well as those of the other reviewers. I will raise my score to a 10 and hope this paper is highlighted at the conference.

I also want to commend the authors for their fantastic approach to the review process and their detailed responses to the other reviewers.

评论

We appreciate your detailed feedback and suggestions, which greatly help us to improve our work!

Indeed, a hybrid form of cheating could be carried out by mixing the outputs of a capable model with a low complexity cheating prompt, which may be optimized by LLM-based optimizers like DSPy and TextGrad to adversarially maximize win rates or reward scores. LLM-based optimizers produce semantic cheating prompts that are mostly imperceptible to human users. Furthermore, if the test instructions are kept private, the adversary could optimize the cheating prompts on, e.g., UltraFeedback and then post-train the LLM using preference optimization methods such as DPO.

We will also include the discussion of Q2 into the final revision of our paper. Thank you again!

审稿意见
6

The paper investigates the vulnerability of automatic LLM evaluation benchmarks such as AlpacaEval 2.0, Arena-Hard-Auto, and MT-Bench. These benchmarks, which use LLM-based auto-annotators like GPT-4, aim to provide scalable alternatives to human evaluation. The authors introduce a “null model” that outputs a constant, irrelevant response, demonstrating that even non-informative, structured responses can achieve deceptively high win rates. By incorporating structured responses with a carefully optimized adversarial prefix generated through random search, this null model exploits weaknesses in benchmark evaluation, achieving competitive results without access to benchmark instructions. To address these vulnerabilities, the authors test anti-cheating strategies, including template paraphrasing and perplexity-based filters, though they find these defenses insufficient on their own. The study underscores the need for more advanced anti-cheating mechanisms, as current methods do not fully prevent adversarial manipulation in automatic LLM benchmarks.

优点

  1. The paper presents an original approach by introducing a “null model” that generates irrelevant yet structured responses, revealing a new vulnerability in LLM evaluation benchmarks. This novel perspective highlights how even non-informative outputs can achieve high win rates, setting this work apart from prior adversarial attacks that rely on relevant responses to manipulate evaluations.
  2. The null model is evaluated across multiple benchmarks like AlpacaEval 2.0, Arena-Hard-Auto, and MT-Bench.
  3. The inclusion of an anti-cheating section adds value by addressing practical defenses against the observed vulnerabilities. The authors discuss template paraphrasing and perplexity-based filtering, though they recognize that these are limited in effectiveness. Therefore, they suggest actionable strategies for improving benchmark robustness and encourages the development of more secure evaluation standards.

缺点

  1. The structured response method, while effective, relies on manual crafting without sufficient explanation of the process, which affects the soundness and generalizability of the approach. A more detailed description of how the structured responses were developed—such as the specific guidelines or iterative steps followed—would improve clarity and allow for reproducibility.
  2. Although the paper demonstrates that random search (RS) enhances the success of adversarial prefixes, it lacks clarity on the specific optimization criteria and alternative strategies that could guide this process. Providing more detail on the objective and discussing potential alternative optimization algorithms would improve the soundness of the approach. This expanded discussion would also enhance clarity, helping readers understand why RS was chosen and how it compares to other possible methods in achieving competitive results.
  3. While the paper distinguishes the null model approach from prior efforts in attacking LLM-based evaluations, it would be beneficial to include a comparison with existing algorithms applicable to the evaluation settings examined here or to clarify why those methods may not apply. This additional context could better position the novelty of the work, reinforcing its contribution by distinguishing it within the broader landscape of adversarial techniques for LLM evaluation.

问题

  1. Could you describe the process used to manually craft the structured responses? Specifically, what inspired your initial approach, and how did you refine these responses to enhance their effectiveness against benchmarks?
  2. What was the ineffective adversarial suffix mentioned in L152/153, and how was it optimized?
  3. What is the loss shown on the y-axis of Figure 2?
  4. What’s the optimization objective of the adversarial prefix with random search? The objective for this random search is not clearly defined in Section 3 or Algorithm 1. Could you specify what loss function or evaluation metric was used to guide this optimization?
  5. Is there a reference error to Table 5 in L312/313? The current reference to Table 5 appears to be incorrect—should it instead reference Table 4?
  6. What results do you observe when using only the adversarial prefix without the structured response? Would the adversarial prefix alone achieve competitive win rates, and if so, could it be more effectively countered through template rephrasing?
  7. Could you provide additional details on the setting where the structured cheating response is combined with normal responses? How was the cheating response appended to the original responses, and were the results specifically achieved with the Structured+RS approach?
  8. Would the PPL filter be effective for open-source auto-annotators? Based on Figure 6, it seems the PPL filter may be insufficient for GPT-based auto-annotators. Could you provide insights into whether the PPL filter would work more effectively against open-source models like Llama-3-8B-Instruct and Llama-3-70B-Instruct, which may exhibit different perplexity thresholds?
评论

Q2: What was the ineffective adversarial suffix mentioned in L152/153, and how was it optimized?

The ineffective adversarial suffix and our structured response are shown in textrmcolorblueFigure19\\textrm{\\color{blue}Figure 19} (Page 34). Both of them are optimized by random search to minimize the logp(winner=NullModel)-\log p(\mathtt{winner}=\mathtt{NullModel}). As seen in Figure 19, the major difference is whether or not a response is structured, which highlights the importance of response structure in boosting the success of cheating.


Q3: What is the loss shown on the y-axis of Figure 2?

The loss shown on the y-axis of Figure 2 is logp(mathttwinner=mathttNullModel)-\\log p(\\mathtt{winner}=\\mathtt{NullModel}).


Q5: Is there a reference error to Table 5 in L312/313?

Thank you for pointing out the typo. Yes, it should refer to Table 4; we fixed it in the paper revision.


Q6: What results do you observe when using only the adversarial prefix without the structured response?

Using only the adversarial prefix without the structured response is the Suffix line shown in textrmcolorblueFigure2\\textrm{\\color{blue}Figure 2} (Page 5), which is ineffective and results in almost zero win rates.


Q7: Additional details where the structured cheating response is combined with normal responses? How was the cheating response appended to the original responses, and were the results specifically achieved with the Structured+RS approach?

The related experiments are shown in textrmcolorblueFigure5(b)and(d)\\textrm{\\color{blue}Figure 5 (b) and (d)} (Page 9): at Step 0, the win rates are near 50% because the GPT-3.5-Turbo-0613’s model responses are high quality and meaningful, in contrast to the near zero win rates of our NullModel setting. Additionally, applying Random Search further improved the win rates to nearly 100%, which demonstrates the effectiveness of our Structured+RS approach.


Q8: Would the PPL filter be effective for open-source auto-annotators?

We set the perplexity threshold based on GPT-4-1106-Preview’s responses because GPT-4-1106-Preview is the default reference model and thus we need to ensure that the PPL filter will not filter out its responses. Additionally, from the side of the benchmark maintainer, it is impractical to set a low PPL threshold because we may fill out normal responses, and result in too many false-positive cases.

评论

Thank you for your supportive review and suggestions. Below we respond to the comments in Weaknesses (W) and Questions (Q).


W1&Q1: A more detailed description of how the structured responses were developed.

The inspiration came from analyzing benchmark weaknesses and understanding how LLMs respond to structured prompts, with an emphasis on predictable behaviors or biases such as positional sensitivity. Similar studies on prompt engineering and adversarial testing also provided valuable insights. The starting point was to create simple response templates and refine them through manual experimentation. Our structured cheating responses evolved in three stages:

  • The first stage focused on developing responses that were effective for a single instruction in a single positional setup (either default or swapped). This included analyzing model behavior for individual instructions, identifying performance patterns, and iteratively refining prompts according to observed weaknesses or strengths.

  • The second stage expanded this approach to include multiple instructions while keeping a single positional setup. Insights from the first stage were used to ensure that the structured responses were clear and effective across a wide range of instructions, balancing generality with specificity. This stage required rigorous testing with a variety of instruction types, as well as ongoing refinements to improve performance.

  • The third stage addressed the issue of position invariance, aiming to create responses that worked equally well whether the positions were defaulted or swapped. This necessitated modifying the structure to neutralize positional changes via symmetry or redundancy, followed by extensive testing and iterative adjustments to ensure robustness across both settings.

The three-stage process includes clear steps and methods, making it reproducible for others. Future work could use reinforcement learning or LLM-based optimizers (such as DSPy [1] or TextGrad [2]) to automate parts of this manual process, reducing effort and improving scalability.

[1] https://github.com/stanfordnlp/dspy
[2] TextGrad: Automatic "Differentiation" via Text


W2&Q4: Providing more detail on the objective of Random Search (RS) and discussing potential alternative optimization algorithms.

The Greedy Coordinate Gradient (GCG) method is commonly used to optimize adversarial responses. However, GCG requires computing gradients via the LLM, which is not feasible with GPT-4 because it only returns the top-20 token probabilities for each next-token prediction. To overcome this limitation, we replace GCG with random search, a proven alternative in previous studies [3].

The loss function used to guide the Random Search (RS) is logp(mathttwinner=mathttNullModel)-\\log p(\\mathtt{winner}=\\mathtt{NullModel}). In the default setting described in textrmcolorblueFigureA\\textrm{\\color{blue}Figure A} (Page 4), where the reference model's responses are placed first and our submitted responses are placed second, the objective is logp(mathttgptoutput=mathttM)-\\log p(\\mathtt{gpt_output}=\\mathtt{M}) on each test instruction, which is computed from GPT-4 API's return logits. Similarly, for the swap position described in textrmcolorblueFigure8\\textrm{\\color{blue}Figure 8} (Page 20), the objective is logp(mathttgptoutput=mathttm)-\\log p(\\mathtt{gpt_output}=\\mathtt{m}). We ensemble the above losses on multiple instructions and both the default and swapped positions to find a universal prefix using RS.

[3] Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks


W3: Include a comparison with existing adversarial algorithms applicable to the LLM evaluation settings.

Following your suggestions, we adapt several existing attacks on LLM-as-a-judge for our ''NullModel'' experimental setup (model can only return a constant response and test instructions are kept private). These adaptations are made to ensure that our approach can be directly compared to prior work, providing fair evaluations.

As shown in textrmcolorblueTable7\\textrm{\\color{blue}Table 7} (Page 30), we first observe that existing attacks yield near-zero win rates, demonstrating their ineffectiveness in this experimental setup. Furthermore, the results from structured responses with varying levels of complexity show that a sufficiently complex structure is required to achieve high win rates. This emphasizes the significance of response structure in increasing the success of cheating.

审稿意见
6

The paper studies null models that output a constant response but take advantage of failures in LLM-based judging in benchmarks. The response is crafted with a template that takes into account its placement either at the the start or the end of the context, and contains some additional tokens that are optimized with random search to obtain the desired outcome. The null model is able to cheat its way to getting a very high win rate on LLM benchmarks, and the authors include some analysis and ablation studies at the end.

优点

  • The template is shown to be quite high "performing" on various benchmarks
  • While the template alone does not work for all models, like Llama-3, with the additional optimized tokens (i.e. jailbreak attack) it can still reach high efficacy rates (so the randomized search is to some degree necessary)
  • The authors include additional studies showing that summarization and perplexity filters are insufficient automated techniques for preventing this type of cheating. So the developed template and search algorithm have some evidence of being robust to trivial solutions.

缺点

  • The main technical contribution is the construction of a template plus randomized search, the latter of which comes from the jailbreaking literature. The findings are likely of more interest than the technique.
  • As I understand it, any human that looks at any of the answers outputted by this technique would immediately see that it is wrong. So the cheating technique is only effective on fully automated benchmarks where no human is checking any outputs. Unless we have fully closed-source automated benchmarks (where nobody is allowed to inspect the outputs), I suspect that the fear of being called out by the rest of the community would prevent researchers from carrying out such cheating techniques in practice. So I view these results as more proof-of-concept and academic rather than a current problem affecting automated benchmarks today, especially since an LLM company's goal of doing well on benchmarks is to create a better LLM, rather than cheat the benchmark and not have a product.

问题

  • In the paper, transferrability is mainly discussed in terms of transferring between instructions. To what degree do the tokens found with randomized search to jailbreak the benchmark transfer between different judges?
评论

Thank you for your supportive review and suggestions. Below we respond to the comments in Weaknesses (W) and Questions (Q).


W1: The findings are likely of more interest than the technique.

We agree that technically, our proposed method is not new in the adversarial literature. However, most related studies in the LLM-as-a-Judge community focus on issues like data contamination, output length/style, and format biases, with little attention given to adversarial attempts. This motivated us to bridge the gap between the LLM-as-a-Judge and adversarial research communities, demonstrating how well-established adversarial techniques can be adapted to cheat LLM judges while potentially benefiting from promoting impacts. We hope that this study will encourage the development of anti-cheating mechanisms to ensure reliable LLM-based evaluations.


W2 (a): Any human that looks at any of the answers outputted by this technique would immediately see that it is wrong.

As you pointed out, our null models are primarily academic and proof-of-concept. We tackle the most challenging scenario: a 0KB null model that only produces a constant output, without accessing test instructions. Such a non-informative constant response would certainly receive a zero win rate from a human user, yet it achieves the highest win rate from the auto-annotators. This illustrates our key point: automated benchmarks are far easier to cheat than previously assumed, which is the core message we aim to convey with this paper.

In practical scenarios, the adversary will not use a null model; instead, the adversary may rely on an advanced LLM and optimize its outputs using LLM-based optimizers (such as DSPy [1] or TextGrad [2]) to adversarially maximize win rates or reward scores, similar to the mechanisms described in the adversarial literature [3, 4]. LLM-based optimizers produce semantic adversarial outputs that are mostly imperceptible to human users. Furthermore, if the test instructions are kept private, the adversary could optimize the outputs on, e.g., UltraFeedback and then post-train the LLM using preference optimization methods such as DPO.


W2 (b): An LLM company's goal of doing well on benchmarks is to create a better LLM, rather than cheat the benchmark and not have a product.

We totally agree that the goal of doing well on benchmarks should be to create a better LLM. Unfortunately, it’s an open secret that some LLM companies would overfit their models (via data contamination or other tricks) to achieve high scores on benchmarks, while real user experiences are usually inferior.

Given the expensive and sometimes unaffordable trial-and-error cost of developing a better LLM, it is possible that some LLM companies/startups/institutions will be motivated to cheat on benchmarks in order to, for example, have high promotion impacts and successfully raise capital. Instead of trusting everyone to be honest, we advocate for anti-cheating mechanisms to ensure that no one can cheat.


Q1: To what degree do the tokens found with randomized search to jailbreak the benchmark transfer between different judges?

Following your suggestions, we investigate the judge-level transferability of our structured response, which is optimized for GPT-4 and then transferred to a different judge model, GPT-3.5. In this experiment, we try to transfer the response directly to GPT-3.5, but the results are disappointing, as the response fails to produce significant success on this model, as shown in textrmcolorblueTable10\\textrm{\\color{blue}Table 10} (Page 31). This finding raises important questions about what strategies could be developed to work across multiple judge models with varying capabilities. We will make this one of our primary goals in future work.


References:
[1] https://github.com/stanfordnlp/dspy
[2] TextGrad: Automatic "Differentiation" via Text
[3] Jailbreaking Black Box Large Language Models in Twenty Queries
[4] AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

审稿意见
8

This paper shows surprising results that benchmarks based on LLM evaluations can be attacked by "null model" which outputs a constant response. The authors adopt the random search methods from universal adversarial attack literatures on the nlp domain to optimize "constant response" by "null model" to attack automatic benchmarks. Notably, the proposed "null mode" achieved SOTA LC win rate of 86.5% on AlpacaEval 2.0. Although the proposed "constant response"s are obviously perceptible, the authors emphasize the risk of imperceptible cheating mechanism against automatic benchmarks and the importance of developing anti-cheating mechanisms for automatic benchmarks.

优点

  • Although the proposed optimization method is not new, the empirical results showing high win rate (86.5%) provided in this paper is surprising.
  • The writing is clear and the paper is easy to read. The authors provide sufficient implementation for reproducing results.
  • The automatic benchmarks with LLM evaluations are widely used due to their high alignment with human evaluations and low cost. Highlighting the potential risk of these benchmarks is important for maintaining trustworthiness.

缺点

  • The proposed examples of "constant response"s are too random. It would be better if there is an example with higher naturality that attack the automatic benchmarks. Even with a lower win rate than 86.5%, providing imperceptible and natural examples that are harder to detect could raise greater awareness among readers.
  • The proposed method is not new. I agree with that this work is a valuable research, but this paper may be more suitable to "position paper" on nlp conferences.

问题

Please refer to the "Weaknesses" section.

伦理问题详情

The authors provide possible ethic concerns in the paper ("ethics statement" in the page 11).

评论

Thank you for your supportive review and suggestions. Below we respond to the comments in Weaknesses (W).


W1: The proposed examples of "constant responses" are too random.

In the paper revision, we highlight our cheating response in textrmcolorblueFigureA\\textrm{\\color{blue}Figure A} (Page 4, default setting) and textrmcolorblueFigure8\\textrm{\\color{blue}Figure 8} (Page 20, swap setting). Our cheating response is manually crafted and semantic, which is independent of the optimizable adversarial prefix.

To create more natural cheating patterns, the adversary may optimize the response using LLM-based optimizers (such as DSPy [1] or TextGrad [2]) to adversarially maximize win rates or reward scores, similar to the mechanisms described in the adversarial literature [3, 4]. LLM-based optimizers produce semantic cheating responses that are mostly imperceptible to human users. Furthermore, if the test instructions are kept private, the adversary could optimize the constant response on, e.g., UltraFeedback and then transfer it to cheat automatic benchmarks.


W2: The proposed method is not new.

We agree that technically, our proposed method is not new in the adversarial literature. However, most related studies in the LLM-as-a-Judge community focus on issues like data contamination, output length/style, and format biases, with little attention given to adversarial attempts. This motivated us to bridge the gap between the LLM-as-a-Judge and adversarial research communities, demonstrating how well-established adversarial techniques can be adapted to cheat LLM judges while potentially benefiting from promoting impacts.

While our paper focuses on cheating LLM benchmarks, our method can be easily applied to cheat multimodal benchmarks such as WV-Bench [5] and LLaVA-Critic [6]. As a red-teaming paper, we hope that this research will encourage the development of anti-cheating mechanisms, particularly from the adversarial perspectives, to ensure reliable LLM-based and MLLM-based evaluations.


References:
[1] https://github.com/stanfordnlp/dspy
[2] TextGrad: Automatic "Differentiation" via Text
[3] Jailbreaking Black Box Large Language Models in Twenty Queries
[4] AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
[5] WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences
[6] LLaVA-Critic: Learning to Evaluate Multimodal Models

评论

Thank you for your kind responses on my review. I raised my score to 8. While I still believe the paper might be more suited for the position paper track at NLP conferences, I agree that its value is sufficient for acceptance, and I hope it gets accepted.

评论

We appreciate your kind support! In our final revision, we will further improve the paper by incorporating the valuable insights gained from the rebuttal discussions. Thank you again!

审稿意见
10

The paper shows vulnerabilities in LLM-as-a-judge paradigms, showing that a "null model" with a crafted, constant output can achieve high scores (strongly beating SOTA) against such judges. The paper motivates how this vulnerability is concerning given how strong results on such automated benchmarks help model developers achieve promotional benefits. A prompt-injection style attack is proposed, with random search prefix optimisation making the attack stronger, leading to length controlled (LC) win rates up to 86.5% on alpaca-eval against a GPT-4 annotator without access to test set instructions. Baseline defences like perplexity filters and releasing paraphrased prompts are not enough to defend against the proposed attack. Further analysis also shows that access to test-set instructions can help strengthen the attack, and the method can also be combined with normal (real) model responses to inflate auto-benchmark scores.

优点

  1. The paper highlights the vulnerability of using LLM judges, motivating why they are becoming more common, existing issues and how the issue they demonstrate is much more glaring.
  2. The attack succeeds (much better than SOTA model performance) under a strong threat model of giving a constant output to the LLM judge.
  3. The attack transfers across 3 popular benchmarks (alpaca-eval, arena-hard-auto, mt-bench)
  4. The attack is extensively ablated, showing the importance of using both the structured prompt injection and prefix optimisation.
  5. The paper is well written, easy to follow, and provides interesting analysis on how the attack can be combined with normal responses.

缺点

  1. It's unclear why the structured response doesn't work for llama. Could you back the claim of lower instruction following capabilities being the reason with a) Instruction Following eval scores for GPT-4, Llama-3-70B Instruct, b) showing a plot of structured response success vs instruction following results for different auto-annotator models?

  2. Currently there are no comparisons to attacks on LLM-as-a-judge performed by prior work. It would be informative to see how existing attacks on LLM-as-a-judge mentioned in the related work section [Raina et al. (2024), Shi et al. (2024), Chen et al. (2024c)] empirically perform in the null model setting (with some reasonable adaptations).

  3. While I appreciate showing results on baseline defences to prompt injections like perplexity filters and paraphrasing, there are stronger defences available now, such as those in this link: https://github.com/tldrsec/prompt-injection-defenses. It would be useful to know if using any of these defences fixes this issue. Some of these involve using LLMs out-of-the-box to filter prompt injections, whereas others train for this task. It would be good to see if the former is enough, or the latter is needed to prevent null model attacks.

  4. The claim about paraphrasing defences not working should be qualified further. For example, I believe that if the template uses a sufficiently different format, or includes specific instructions to ignore the prompt injection, the attack might not transfer. Thus, it is helpful to rephrase the claim as 'trivial paraphrasing is not enough, though more targeted ones could work'.

问题

  1. Do the authors have any evidence for benchmark gaming already occurring based on degenerate artefacts being added to model responses such as the null model string found here?

  2. Is it possible to expand the scope of the evals affected? Can this be done against arbitrary reward models? In this case, why doesn’t reward model optimisation collapse to such degenerate solutions?

  3. Can we use this to bring down accuracy of models on human-generated mcqs, such as in MMLU? i.e. can adding adversarial suffixes to MMLU wrong options make the model prefer these, leading to lower accuracies?

评论

Q1: Any evidence for benchmark gaming already occurring based on degenerate artifacts being added to model responses such as the null model string found here?

As far as we can tell from academic literature and rumors, there is no benchmark gaming based on adversarial cheating described in this paper. Existing benchmark gaming relies on data contamination [1], controlling output length [2] or style [3], and format biases [4].

[1] Benchmark Data Contamination of Large Language Models: A Survey
[2] Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
[3] Does style matter? Disentangling style and substance in Chatbot Arena
[4] From Lists to Emojis: How Format Bias Affects Model Alignment


Q2 (a): Is it possible to expand the scope of the evals affected? Can this be done against arbitrary reward models?

Yes, our cheat can be done against arbitrary reward models. Unlike AlpacaEval and Arena-Hard-Auto, which compute win rates using pairwise comparisons, MT-Bench uses a reward model to judge model outputs and assigns scores to them. Our results in textrmcolorblueTable2\\textrm{\\color{blue}Table 2} (Page 6) show that we can cheat the reward model inside MT-Bench to assign a high score for our null model. Similar cheating prompts and adversarial techniques can also be applied to cheat any other reward models.


Q2 (b): Why doesn’t reward model optimization collapse to such degenerate solutions?

This is an insightful question! We think there are mainly two reasons:

  • The typical RLHF forward pipeline is as follows: Input -> LLM -> Output -> Reward Model. In our work, we directly design cheating Output to maximize the reward, whereas RLHF optimizes the LLM to maximize the reward. Namely, they are carried out in different optimization spaces.

  • Our null models are constrained to return a constant response for any input, whereas LLMs can return different responses for different inputs. Thus, even if reward model optimization collapses during RLHF, LLMs will not collapse to our cheating responses; instead, they may collapse to the patterns described in ''reward hacking'' [5].

[5] ODIN: Disentangled Reward Mitigates Hacking in RLHF


Q3: Can we use this to bring down accuracy of models on human-generated mcqs, such as in MMLU?

This is an interesting idea! We could ensemble multiple models and optimize the adversarial suffixes to increase their averaged log-probability on the wrong options. These adversarial suffixes in wrong options can make benchmarks like MMLU more difficult and assess models' ability to resist irrelevant information. As shown in [6], advanced LLMs cannot even resist non-adversarial irrelevant information in reasoning tasks.

[6] GSM-Symbolic: Limitations of Mathematical Reasoning in LLMs

评论

Thanks for the answers, they make sense. It was a pleasure to review this paper, I learnt a lot :)

评论

Thank you for your timely feedback and kind words. We greatly appreciate it!

In the final revision, we will incorporate your suggestions by adding more explanation to Figure 20 and conducting additional ablation studies to explore the relationship between instruction following and the success rate of cheating responses. We will also re-organize the paper to include the attack/defense baselines within the main text.

Thank you once again for your insightful review, which has greatly inspired us to improve our work. Your idea of incorporating adversarial patterns into benchmarks such as MMLU is also very intriguing. It was a pleasure to discuss these points with you, thank you!

评论

Thank you for your supportive review and suggestions. Below we respond to the comments in Weaknesses (W) and Questions (Q).


W1: It's unclear why the structured response doesn't work for llama.

We use the official AlpacaEval 2.0 LC win rates to measure the instruction-following ability, and then study the relationship between structured response success (measured by logp(mathttwinner=mathttNullModel)-\\log p(\\mathtt{winner}=\\mathtt{NullModel})) and the instruction-following ability. As shown in textrmcolorblueFigure20\\textrm{\\color{blue}Figure 20} (Page 34), we find that as the instruction-following ability grows, the optimization objective logp(mathttwinner=mathttNullModel)-\\log p(\\mathtt{winner}=\\mathtt{NullModel}) decreases.


W2: Currently there are no comparisons to attacks on LLM-as-a-judge performed by prior work.

Following your suggestions, we adapt several existing attacks on LLM-as-a-judge for our ''NullModel'' experimental setup (model can only return a constant response and test instructions are kept private). These adaptations are made to ensure that our approach can be directly compared to prior work, providing fair evaluations.

As shown in textrmcolorblueTable7\\textrm{\\color{blue}Table 7} (Page 30), we first observe that existing attacks yield near-zero win rates, demonstrating their ineffectiveness in this experimental setup. Furthermore, the results from structured responses with varying levels of complexity show that a sufficiently complex structure is required to achieve high win rates. This emphasizes the significance of response structure in increasing the success of cheating.


W3: It would be useful to know if using any stronger defenses fixes this issue.

Following your suggestions, we evaluate several defense strategies based on their ability to detect and purify adversarial manipulation, ensuring a comprehensive assessment of the defensive landscape.

As shown in textrmcolorblueTable8\\textrm{\\color{blue}Table 8} (Page 30), the Self-Reminder, which prompts the model to prioritize the first instruction, is slightly effective but cannot largely reduce the win rates of our structured response cheating. We also test SmoothLLM with various perturbation strategies, such as Insert, Swap, and Patch. Both the Insert (20%) and Swap (20%) perturbations are highly effective in defending against our cheating, reducing win rates to almost zero. The Patch (20%) perturbation also demonstrates significant defense efficacy.

However, as seen in textrmcolorblueFigure18\\textrm{\\color{blue}Figure 18} (Page 31), we take Swap as an example and show that even with a small perturbation budget like 1.25%, it will severely degrade the quality of clean model responses generated by GPT-4 Omni (GPT-4o), causing them to drop to near-zero win rates, making it impractical for realistic scenarios. In textrmcolorblueFigure17\\textrm{\\color{blue}Figure 17} (Page 31), we plot the win rates of our cheating response v.s. perturbation budgets of Swap. As seen, for Swap (1.25%), our structured cheating response can still achieve >40% LC win rates.


W4: The claim about paraphrasing defenses not working should be qualified further.

Thanks for your suggestion. We have revised our claim accordingly as shown in L481-L482.

评论

Thanks for adding more attacks and defences.

Some minor suggestions:

  1. Please add more text to explain Figure 20 in the paper. Also, I would consider doing a scatter plot between Instruction following and the response success objective, with model names in brackets above each point. I would also add more points to this scatter plot and report statistical significance of the observed trend.

  2. Please consider shifting the prior attacks/defences to the main paper in the final version, as I personally feel these are as important if not more than the baseline defences currently reported.

This mitigates my main remaining concern with the paper regarding comparisons. I think the choice of baselines is well motivated and the results clearly show that plainly selecting prior methods would not lead to these insights. In conclusion, this paper:

  1. Showed an important, new form of vulnerability in AI models, which I believe is the pinnacle of contributions that can be made in the adversarial literature.
  2. Showed important novel insights needed to exploit this vulnerability
  3. Comprehensively benchmarked existing defences and attack design choices

This puts it among the best papers I've seen coming out of the adversarial ML community and is of immense value to the LLM evaluation community where LLM judge based benchmarks are rising in popularity. Therefore, I have increased my score to 10, as I would love to see this paper highlighted at ICLR to the broader community.

审稿意见
8

This paper investigates vulnerabilities in automatic LLM benchmarks (like AlpacaEval 2.0, Arena-Hard-Auto, and MT-Bench) by demonstrating that even a "null model" - which outputs the same constant response regardless of input - can achieve high performance scores. The authors show that through carefully crafted structured responses and random search optimization, their null model can achieve an 86.5% LC win rate on AlpacaEval 2.0, an 83.0 score on Arena-Hard-Auto, and a 9.55 score on MT-Bench, matching or exceeding state-of-the-art models. The work highlights significant vulnerabilities in current automatic evaluation methods and calls for developing robust anti-cheating mechanisms for LLM benchmarks.

优点

Pros:

  • I think it’s a great paper that brings up the important point about the possibility of overfitting to LLM judges.
  • It’s also very interesting to see that one can fool LLMs with the same (completely non-sensical) response across different requests. This clearly highlights how brittle LLM as judges are.
  • On the other hand, as, e.g., Figure 2 illustrates, coming up with an effective null model is not straightforward. In particular, a simple adversarial suffix isn’t effective.
  • The evaluation methodology looks good: the generalization of the null model between a validation and test set is clearly evaluated.
  • The paper also shows a significant gap between LLM and human perception (i.e., humans don’t have any issue with these “null models”).
  • The paper is clearly written, original, and the findings are quite significant.

缺点

(Minor) weaknesses:

  • It’s probably not too surprising that LLMs are vulnerable to these attacks given all the literature on adversarial examples, prompt injections, and jailbreaks for LLMs. But on the other hand, I believe there should be a clear and systematic reference that illustrates this fact for LLMs as judges that play a key role now for LLM evaluation, data filtering, etc. And this paper does a good job at this.
  • The name “null model” is slightly confusing since it’s not really a model but rather a text snippet. Moreover, it’s not just a “null” model, but a very specific one (i.e., adversarially optimized), which is not reflected in the name “null model”. Although, I can imagine, it’s probably “too late” to rename a central concept of the paper.
  • The paper could have discussed a bit more about how likely the patterns from these adversarial null models to occur more naturally, i.e., without an adversarial procedure involved.

Update after the rebuttal: I'm satisfied with the authors' response. I think it's a good paper and should be accepted at ICLR. For me, it's a strong 8/10. I would put 10/10 if the paper uncovered a rather unexpected vulnerability in LLMs as judges. In my opinion, this failure is rather expected, although it's very nice to document this failure systematically.

问题


评论

Thank you for your supportive review and suggestions. Below we respond to the comments in Weaknesses (W).


W1: It’s not too surprising that LLMs are vulnerable and there should be a systematic reference that illustrates this fact for LLMs as judges.

We appreciate your kind words and fully agree! Researchers in the LLMs-as-Judges community focused on data contamination and the effects of output style/length, paying little attention to adversarial attempts. Thus, we were motivated to bridge the gap between the LLMs-as-Judges and adversarial communities, demonstrating how well-established adversarial techniques can be adapted to cheat LLM judges while potentially benefiting from promoting impacts. We hope that this study will encourage the development of anti-cheating mechanisms to ensure reliable LLM-based evaluations.


W2: The name "null model" is slightly confusing.

Indeed, we struggled to find a name that is both concise and accurately describes our method. We finally chose "null model" because automatic benchmarks are conducted at the model level. It may be more precise to state that there exist null models (not all, but only adversarial ones) that can cheat LLM judges, and we will attempt to clarify this point in the final revision.


W3: How likely the patterns from null models to occur more naturally.

In the paper revision, we highlight our cheating response in textrmcolorblueFigureA\\textrm{\\color{blue}Figure A} (Page 4, default setting) and textrmcolorblueFigure8\\textrm{\\color{blue}Figure 8} (Page 20, swap setting). This cheating response is manually crafted and semantic, which is independent of the optimizable adversarial prefix.

To create more natural cheating patterns, the adversary may optimize the response using LLM-based optimizers (such as DSPy [1] or TextGrad [2]) to adversarially maximize win rates or reward scores, similar to the mechanisms described in the adversarial literature [3, 4]. LLM-based optimizers produce semantic cheating responses that are mostly imperceptible to human users. Furthermore, if the test instructions are kept private, the adversary could optimize the constant response on, e.g., UltraFeedback and then transfer it to cheat automatic benchmarks.


References:
[1] https://github.com/stanfordnlp/dspy
[2] TextGrad: Automatic "Differentiation" via Text
[3] Jailbreaking Black Box Large Language Models in Twenty Queries
[4] AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

评论

Thanks for the clarifications. I like the paper - I think 8/10 is the right score, and I hope the paper gets accepted.

评论

Thank you for your prompt feedback and encouraging words. We really appreciate it! In our final revision, we will polish the paper further to incorporate the valuable insights gained from the rebuttal discussions. Thank you again!

评论

We thank all reviewers for their constructive feedback, and we have responded to each reviewer individually. We have also uploaded a Paper Revision including additional results and illustrations:

  • textrmcolorblueFigureA\\textrm{\\color{blue}Figure A} (Page 4): We move the structured cheating response illustration from Appendix to the main text and describe cheating mechanisms in detail.
  • textrmcolorblueFigure15\\textrm{\\color{blue}Figure 15} (Page 28): A structured cheating response with medium complexity.
  • textrmcolorblueFigure16\\textrm{\\color{blue}Figure 16} (Page 29): A structured cheating response with low complexity.
  • textrmcolorblueFigure17\\textrm{\\color{blue}Figure 17} (Page 31): Win rates of our method against the SmoothLLM Swap variants on AlpacaEval 2.0.
  • textrmcolorblueFigure18\\textrm{\\color{blue}Figure 18} (Page 31): The bar plot of original and perturbed win rates of GPT-4 Omni (GPT-4o).
  • textrmcolorblueFigure19\\textrm{\\color{blue}Figure 19} (Page 34): The ineffective adversarial suffix and our structured response.
  • textrmcolorblueFigure20\\textrm{\\color{blue}Figure 20} (Page 34): Structured response success log-prob v.s. the instruction-following ability for different auto-annotators.
  • textrmcolorblueTable7\\textrm{\\color{blue}Table 7} (Page 30): Win rates of different attacking baselines/our method on AlpacaEval 2.0.
  • textrmcolorblueTable8\\textrm{\\color{blue}Table 8} (Page 30): Win rates of our method against different defenses on AlpacaEval 2.0.
  • textrmcolorblueTable9\\textrm{\\color{blue}Table 9} (Page 30): Win rates of the cheat against more open-source judges on AlpacaEval 2.0.
  • textrmcolorblueTable10\\textrm{\\color{blue}Table 10} (Page 31): Win rates of applying our structured cheats to another judge GPT-3.5-Turbo-1106.
评论

Dear Reviewers,

Thank you again for your valuable comments and suggestions, which are really helpful for us. We have posted responses to the proposed concerns and included additional experiment results.

We totally understand that this is quite a busy period, so we deeply appreciate it if you could take some time to return further feedback on whether our responses solve your concerns. If there are any other comments, we will try our best to address them.

Best,

The Authors

AC 元评审

This paper shows that a constant response can game certain automatic LLM benchmarks. This contribution will make many in the community rethink LLM benchmarking which is topical right now. The reviewers unanimously vote for acceptance, and I agree. Reviewers raised several weaknesses, for example that the results are not so surprising or that the methodology is not novel, but I do not think these weaknesses are disqualifying since the message of the paper will be surprising to many and the point is the observation, not the methodology. Reviewers also made suggestions related to the writing, and many of these suggestions have been addressed by the authors through revisions, and for others, the authors promise to make further edits. At least one reviewer requested that the authors include prior work on attacks, but this was already in the appendix, and the authors have promised to move to the main body. Having read through the reviewer’s feedback, I am not bothered by any of the weaknesses, and the reviewers are very positive overall.

审稿人讨论附加意见

The reviews were already positive before rebuttals, but nonetheless, the authors have uploaded a highly revised draft to respond to feedback.

最终决定

Accept (Oral)