PaperHub
4.9
/10
Poster4 位审稿人
最低1最高4标准差1.3
2
1
4
4
ICML 2025

Automated Red Teaming with GOAT: the Generative Offensive Agent Tester

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24
TL;DR

We propose an automated agentic red teaming system that simulates plain language adversarial conversations while leveraging multiple adversarial prompting techniques to identify key vulnerabilities in LLMs.

摘要

关键词
red teamingadversarial machine learningadversarial examplesattacks on language models

评审与讨论

审稿意见
2

The authors introduce a new method for an important scenario of multiturn red teaming. The authors propose to use a helpful LLM agent system to dynamically use and combine existing jailbreaking methods to adaptively attack target models. Experiments show its superior performance and efficiency over baseline methods.

给作者的问题

  • what model is used for attack agent? Give the exact name of the model, or if it is self trained, provide details on data / training method. Provide them in revised manuscript

论据与证据

The claims made are sound

方法与评估标准

The proposed method makes sense for the application at hand

理论论述

N/A

实验设计与分析

The experiment design are valid

补充材料

N/A

与现有文献的关系

The key method is a good way to automate the laborous red teaming process

遗漏的重要参考文献

N/A

其他优缺点

Strengths

  • relevant and important topic in LLM safefy
  • method is using agentic solution, which is novel in this new application
  • writing is easy to understand

Weaknesses

  • it is unlikely for the agent to invent new jailbreak techniques, since it is incentivized to reuse or combine existing techniques
  • a handcrafted workflow CoT pattern is used to better guide the attack LLM at every turn, the current way might not be the most optimal strategy
  • overall, the paper lacks novelty; it presents a manually designed agent workflow that, while demonstrating strong engineering effort, lacks the conceptual innovation necessary for acceptance at ICML

其他意见或建议

N/A

作者回复

Thank you for your comments! We would like to point out that the novelty of our work lies in automating manually discovered red teaming strategies. We achieve very high attack success rates with a lower computational budget than the most comparable work. We do this by following an established evaluation protocol on a curated dataset of adversarial prompts published in peer review work. Our attack method is easily extendible with newly discovered techniques simply by describing them in plain English.

Our method is also easily extensible to new models. We have conducted additional experiments showing that the deepseek-r1 model can be “plugged in” to GOAT without any modifications. Under the same evaluation protocols, When we use DeepSeek-r1 as the attacker model and:

  • GPT-4o as the target, we achieve 63% ASR@1 (compare to 54% from Fig. 2 with the original attacker and 70% from Fig. B.1 with GPT-4o as the attacker)
  • DeepSeek-r1 as the target: 54%
  • o1 as the victim: 30%

These are the strongest ASR@1 results for any attacker and victim and we believe this shows the method is applicable as-is to newer models with more advanced capabilities.

Additionally, our method is directly applicable to new datasets. To show this, we conducted additional experiments without modifying GOAT on a new benchmark. We picked AILluminate – a safety benchmark released by the MLCommons alliance and endorsed by 72 universities and research labs, including 4 major foundational model labs (OpenAI, Anthropic, Meta, and Google) and prestigious universities such as Stanford, MIT, and Cambridge.

When rerunning on the widely endorsed Ailluminate benchmark with the original the attacker and Llama 3.1 70B as the target, GOAT maintains its advantage over Crescendo:

AttackAilluminateJailbreakBench (from Fig. 2)
GOAT43%55%
Crescendo25%28%

We used the sample publicly available at https://github.com/mlcommons/ailuminate and first described in arXiv:2404.12241. We excluded the prompts on the topic of CSE (due to legal restrictions) but otherwise used all prompts labeled as “unskilled” - since GOAT and Crescendo both transform basic questions into successful jailbreaks. This left us with 550 prompts.

We hope these additional results prove the versatility and usefulness of the GOAT method! We look forward to engaging with the reviewer at length about the work and welcome any thoughtful and detailed comments!

审稿意见
1

The paper introduces GOAT (Generative Offensive Agent Tester), an automated multi-turn red teaming framework to assess how large language models (LLMs) respond to adversarial prompts. Unlike single-prompt attacks, GOAT uses a separate “attacker model” that dynamically applies various prompt-level “jailbreaking” strategies (such as role-play, hypotheticals, and refusal suppression) during extended back-and-forth conversations with the target LLM. By adapting its approach in each dialogue turn, GOAT more effectively uncovers hidden vulnerabilities and elicits policy-violating outputs. Experiments show that GOAT achieves high attack success rates on both open-source (e.g., Llama) and commercial (e.g., GPT-4) models, outperforming other multi-turn methods within a small number of conversation turns.

给作者的问题

Please see my aforementioned comments.

论据与证据

  1. Automated vs. Human-Like Behavior: GOAT is presented as simulating realistic adversarial behavior (“like a human red teamer”), yet the paper does not provide direct empirical comparisons to actual human-generated multi-turn attacks. Demonstrating that GOAT’s conversation strategies align closely with real user behaviors—or are at least comparably effective—would help bolster this claim.

  2. The author states that they conduct experiments on "the latest iteration of this curation and its corresponding scoring method," which refers to JailbreakBench. However, the JBB-Behaviors dataset from JailbreakBench includes only 100 harmful questions, partially derived from existing works. While I acknowledge that JailbreakBench represents the latest benchmark, the limited number of harmful questions may be insufficient for a comprehensive evaluation. I suggest conducting additional experiments on AdvBench, HarmBench, and JAMBench.

方法与评估标准

  1. 7 attacks. The rationale behind selecting these specific seven attacks is unclear. What criteria were used for their inclusion, and why were certain cipher techniques, such as [1], excluded? A justification for these choices would strengthen the paper.

  2. Ablations. Although the authors highlight seven specific attacks, the paper does not present experiments where one or more techniques are removed or substituted to measure their relative contributions. Moreover, there is no clear breakdown of how crucial elements—such as the “Chain-of-Attack-Thought” reasoning or multi-turn strategy selection—individually impact overall success rates. Without an ablation analysis, it remains difficult to determine which parts of GOAT are most responsible for its observed performance gains.

[1]. Yuan Y, Jiao W, Wang W, et al. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher[J]. arXiv preprint arXiv:2308.06463, 2023.

理论论述

This paper focuses on empirical experimentation rather than formal theoretical frameworks or proofs. While the pseudocode effectively illustrates the general loop of the GOAT approach, it lacks explicit definitions, thorough commentary, and deeper integration with the main text.

实验设计与分析

  1. More Benchmarks: I suggest evaluating on JAMBench [1], which includes more challenging and severe harmful questions.

  2. More Target LLMs: The proposed method relies on reasoning and conversation chains. Would it be effective against highly reasoning models such as OpenAI's o1, o3, and DeepSeek-R1?

  3. Baselines: Although the paper shows GOAT outperforming Crescendo (another multi-turn attack), it does not compare to a broader range of multi-turn or single-turn jailbreak methods beyond Crescendo, such as PAIR, COLD-Attack, and etc. This narrower scope of baselines makes it harder to assess whether GOAT’s improvements generalize across a wider variety of adversarial strategies[2-4].

  4. The attacker LLM. The authors evaluated their approach using GPT-4o as the attacker LLM and concluded that the proposed method enables a plug-and-play adversary with off-the-shelf models, without being dependent on a specific attacker model. However, I am still curious—would using reasoning models such as OpenAI’s o1, o3, or DeepSeek-R1 improve the effectiveness of jailbreak attacks on the target LLM?

  5. Human Evaluation To validate GOAT’s “human-like” behavior claims, a direct human evaluation or user study could be conducted. By recruiting experienced red teamers and comparing their multi-turn attack conversations against GOAT’s logs, the authors could quantitatively assess similarities in strategies, escalation patterns, and overall effectiveness. Such a study would not only highlight how closely GOAT mirrors real human adversaries, but also provide insights into potential blind spots or additional techniques real users might employ that the automated system currently lacks.

[1]. Jin H, Zhou A, Menke J, et al. Jailbreaking large language models against moderation guardrails via cipher characters[J]. Advances in Neural Information Processing Systems, 2024, 37: 59408-59435.

[2]. Ren Q, Li H, Liu D, et al. Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues[J]. arXiv preprint arXiv:2410.10700, 2024.

[3]. Jin H, Chen R, Zhou A, et al. Guard: Role-playing to generate natural-language jailbreakings to test guideline adherence of large language models[J]. arXiv preprint arXiv:2402.03299, 2024.

[4]. Sun X, Zhang D, Yang D, et al. Multi-Turn Context Jailbreak Attack on Large Language Models From First Principles[J]. arXiv preprint arXiv:2408.04686, 2024.

补充材料

The authors did not upload supplementary material. However, I encourage the authors to provide additional resources, such as code or extended experimental details, to enhance the reproducibility and transparency of their work.

与现有文献的关系

In the broader context of adversarial prompting and red teaming, GOAT extends prior single-turn jailbreak techniques (e.g., gradient-based or chain-of-thought “single prompt” attacks) by explicitly modeling multi-turn adversarial conversations. While earlier works like Crescendo also explored multi-turn escalation, GOAT differs by employing a separate “attacker model” that dynamically selects from multiple prompting techniques, building on findings that human red-teamers often switch strategies over a conversation. The authors draw upon recent public datasets (e.g., JailbreakBench) and automated evaluators to benchmark, which likewise examine LLM policy refusal rates in adversarial settings. By treating each attack style as an in-context “tool,” GOAT synthesizes ideas from chain-of-thought reasoning and multi-step adversarial probing, contributing a more generalized, agentic red teaming approach to the literature.

遗漏的重要参考文献

Please see my aforementioned related papers.

其他优缺点

Several multi-turn jailbreaking attacks have been proposed in the past, including Crescendo, as mentioned by the authors. This diminishes the novelty and impact of the proposed work.

其他意见或建议

Including an illustrative figure of the proposed method would greatly enhance reader comprehension by providing a visual representation of its key components and workflow.

作者回复

We thank the reviewer for their thorough and constructive feedback!

Response to Concerns and Questions

We agree with the reviewer that some additional experiments are warranted and we have expanded to cover additional attacker models (deepseek-r1), target models (deepseek-r1 and o1), additional scoring functions (the OpenAI Moderation API), and additional benchmarks (AIlluminate – a safety benchmark endorsed by 72 major AI institutions). In all cases, GOAT maintains its advantage over Crescendo, even if the numbers vary slightly.

We agree with the reviewer that comparability to other works is important and that is why we have chosen JailbreakBench with its clear and reproducible evaluation protocols and leaderboard with results for multiple attacks. We have included these comparisons in the last section.

Next, we will cite all related works brought up by the reviewer in case of acceptance. We selected the manual attacks we included for automation in GOAT based on our own experience manually red teaming LLMs. There are certainly many more jailbreaking techniques available but we do not claim or aim for comprehensiveness. Instead, GOAT offers a starting point and is designed to be easily extensible. In fact, we believe that narrowly scoped attacks like the ones suggested can easily be automated with the GOAT method.

We also thank the reviewer for suggesting that an overview figure be provided. We agree and we refer the reviewer to Figure 1. We welcome all further constructive feedback on the figure!

The reviewer also raises an interesting point about ablations – that it is not clear how the multiturn nature of the attacks contributed to the ASR. In Appendices B.2 and B.3, we conducted experiments extending to 50 conversational turns. Based on Figure B.2, we can conclude that the multi-turn nature of the attack is crucial: ASR is low on the first turn and it climbs over the course of a conversation. However, in Figure B.3, we also observe that there is a diminishing return to conversation turns as the ASR plateaus around 10.

We look forward to further discussion and appreciate the constructive feedback!

Additional Experimental Results

New Attacker Model

When we use DeepSeek-r1 as the attacker model and:

  • GPT-4o as the target, we achieve 63% ASR@1 (compare to 54% from Fig. 2 with the original attacker and 70% from Fig. B.1 with GPT-4o as the attacker)
  • DeepSeek-r1 as the target: 54%
  • o1 as the victim: 30%

Additional Scoring Functions

When using the the OpenAI moderation API as the scoring function and using the logged conversations of the original attacker model with Llama 3.1 70B as the target, GOAT maintains its advantage over Crescendo.

JudgeOpenAI Moderation APIJailbreakBench (as reported in Fig. 2)
GOAT36%55%
Crescendo18%28%

Additional Datasets

Unfortunately, we found that the JAMBench dataset is not readily available. The cited paper contains no download link and the paper itself does not contain the exact set of 160 prompts curated by the authors. Additionally, we note that JailbreakBench represents a curated peer-reviewed selection of HarmBench and AdvBench selected for behaviors that are diverse and achievable.

Instead, we picked AILluminate – a safety benchmark released by the MLCommons alliance and endorsed by 72 universities and research labs, including 4 major foundational model labs (OpenAI, Anthropic, Meta, and Google) and prestigious universities such as Stanford, MIT, and Cambridge. When rerunning on the widely endorsed Ailluminate benchmark with the original the attacker and Llama 3.1 70B as the target, GOAT maintains its advantage over Crescendo:

AttackAilluminateJailbreakBench (from Fig. 2)
GOAT43%55%
Crescendo25%28%

We used the sample publicly available at https://github.com/mlcommons/ailuminate and first described in arXiv:2404.12241. We excluded the prompts on the topic of CSE (due to legal restrictions) but otherwise used all prompts labeled as “unskilled” - since GOAT and Crescendo both transform basic questions into successful jailbreaks. This left us with 550 prompts.

Other Comparisons

Since we followed the standard evaluation protocol of JailbreakBench, we can directly compare to other methods on the leaderboard at https://jailbreakbench.github.io. In order to allow for an apples-to-apples comparison, we focus on attacks reporting results on the same models as us (Llama 2 7B and GPT-4) and filter out attacks with greater access than ours (e.g. attacks requiring logprobs or white-box access).

AttackLlama 2 7B (ASR@1)GPT-4 (ASR@1)
PAIR0%34%
AIM0%0%
GOAT30%46%
审稿意见
4

This paper introduces a novel automated red teaming approach (GOAT) for conversational AI systems. GOAT uses recent reasoning capabilities of advanced LLMs equipped with a set of tools, ie. attack models (here, multiple adversarial prompt strategies). In a multi-turn conversation with the target model, GOAT selects these tools and reasons on the LLM's responses (e.g., refusal) to develop attack strategies with the goal of eliciting harmful behavior from the target LLM. The presented red team system is evaluated on the JailbreakBsnch dataset and compared to the related Crescendo approach.

给作者的问题

  • In line 180, you describe that you encourage the attack model to stay on topic and reason adversarially, progressing each turn of the conversation towards the desired unsafe objective. Were you able to observe cases where the attack model is deviating from the desired 4-step reasoning process? If so, how do you handle such cases?

  • I was not able to find detailed information on the used attack model / helpful-only within the experiments. Could you provide more details? In B.1, you describe re-conducted experiments with Gpt-4o, but what was the original model used in the experiments outlined in the main text?

  • Could you provide more details on your statement, "Thus, this attack success rate is achieved with very limited access and a very low computational budget relative to other published work,“ While you touch upon it, based on your experimental settings, what is the difference between the required computational budget of GOAT and Crescendo?

  • Similar to the examples presented in Fig 4 and 5 could you elaborate on failure cases? For example, are there failure cases where the attacker model assesses that the attack goal is reached, but the judge evaluates the response as safe? If yes, is the judge wrong or the attack model?

论据与证据

The paper is motivated by (Li et al., 2024), demonstrating that human red teamers can circumvent LLM defenses more easily with multi-turn jailbreaks. The paper claims that it improves existing automated red teaming approaches. To this end, it is compared to the recent Crescendo approach, which outperforms previous approaches by a large margin. For the evaluation, the authors used a LLM-judge assessing the safety of the target LLM after being attacked by the red team approaches, which is one of the common practices. The presentation of evidence is clear.

方法与评估标准

The choice of evaluation datasets makes sense for the type of assessment. However, as the authors already mentioned, the selected evaluation (LLM-as-a-judge) approach has its limitations. As in the original Crescendo paper, such the evaluation could be extended by considering external APIs.

理论论述

The paper does not introduce new theoretical foundations, therefore N/A.

实验设计与分析

  • The selection of the evaluation benchmark JailbreakBench (including the exclusion of specific topics that violated the terms of use of target models) seems to be reasonable and sufficient. Various target models are considered, and a fixed budget for attack model queries and the use of the same attack model are used throughout the experimental setup, so the comparison with Crescendo seems fair.

  • Experimental evaluation could be extended beyond the usage of an LLM-judge, see above.

  • Next to quantitative results, the authors present two sample conversation turns leading to successful attacks. These could be extended by failure cases and a discussion of those to provide more details on limitations.

补充材料

I briefly check the Appendix, especially section B.1. to find more details on the attack model used. Unfortunately, no information is provided on the attack model used in the main text.

与现有文献的关系

The paper is well grounded related works. Li et al. (2024) demonstrate that human red teamers can circumvent LLM defenses more easily with multi-turn jailbreaks. The paper draws inspiration and extends the related approach Crescendo (Russinovich et al., 2024) which also relies on this observation to gradually escalate a conversation into violating outputs.

遗漏的重要参考文献

The paper adequately covers key literature on LLM adversarial robustness, jailbreaking and (automated) red teaming.

其他优缺点

Strengths:

  • well-described limitations and future work

Weaknesses:

  • missing details on attack model implementation (see questions).

其他意见或建议

As mentioned above, I would suggest considering additional evaluation approaches, such as external API, to evaluate the safety of the responses of the target model.

Minor suggestions to improve clarity and readability.

  • line 51 left column: (Zou et al., 2023) seems to be misplaced
  • readability of figure 3 could be increased
  • place Fig 4 and 5 on the top of the page.
作者回复

We appreciate the reviewer’s thorough review and suggestions that help us make the paper stronger! We agree with the reviewer about the suggestions for better clarity and readability and will fix these in a potential camera-ready submission!

The attacker model

First of all, we apologize as we dropped the discussion of the attacker model used during editing! We used a generic helpful-only language model. Due to the conference’s anonymization policies, we can only share the following information at this point: this model was trained on data that bordered between harmless and harmful content, where the desired response was always to be helpful regardless of safety compliance. No explicit specific attack training data was introduced. All red teaming information is later introduced to the model through in-context learning via the system prompt.

Additionally, we have now conducted further experiments with the unmodified deepseek-r1 model and unmodified GOAT system under the same evaluation protocols (see last section), indicating that the exact choice of the attacker model is not critical to the method.

The computational budget

Crescendo requires 2 more queries for any given attack turn as it makes use of two supplementary judges external to the attacker model, making it more expensive than GOAT at any given time step. We also note that GOAT achieves any given ASR in fewer turns. (See Appendix B.3, where we extend the number of conversational turns to 50.) Therefore, GOAT requires both less computation per turn and fewer turns overall, making it more efficient than Crescendo.

Handling deviations from reasoning

This is a very important question! While we did not observe deviation from the chain of thought format, we experienced a similar issue in early experimentation. In some cases, the attacker model would restate the goal in the prompt it suggested for the target, which would trigger safety due to the overtly unsafe statement of the goals. That is why we introduced reminders in the followup prompt about the expected format.

If this issue does appear with other language models, we would suggest automated retries with a higher temperature when deviations are encountered based on string matching. The followup prompt can also be tweaked further to encourage adherence to the reasoning format.

Expanding on failure cases

The reviewer brings up a very interesting observation about failure cases! In practice, the failure cases we observe are more often in the opposite direction – the attacker model can be slow to reach a violation. This is because it tends to refrain from directly asking for more violating responses and sometimes runs out of turns before it has reached true violating territory. This can be seen in Figures 3, B.2, and B.3 where the judge’s assessment that a conversation is violating increases with the number of turns. We also agree that early stopping based on an external judge is a fruitful direction for future work to improve on the efficiency of the method.

Additional Experimental Results

New Attacker Model

When we use DeepSeek-r1 as the attacker model and:

  • GPT-4o as the target, we achieve 63% ASR@1 (compare to 54% from Fig. 2 with the original attacker and 70% from Fig. B.1 with GPT-4o as the attacker)
  • DeepSeek-r1 as the target: 54% ASR@1
  • o1 as the victim: 30% ASR@1

Additional Scoring Functions

When using the the OpenAI moderation API as the scoring function and using the logged conversations of the original attacker model with Llama 3.1 70B as the target, GOAT maintains its advantage over Crescendo.

JudgeOpenAI Moderation APIJailbreakBench (as reported in Fig. 2)
GOAT36%55%
Crescendo18%28%

Additional Datasets

When rerunning on the widely endorsed Ailluminate benchmark with the original the attacker and Llama 3.1 70B as the target, GOAT maintains its advantage over Crescendo:

AttackAilluminateJailbreakBench (from Fig. 2)
GOAT43%55%
Crescendo25%28%

Context on AILluminate

Note: AILluminate is a safety benchmark released by the MLCommons alliance and endorsed by 72 universities and research labs, including 4 major foundational model labs (OpenAI, Anthropic, Meta, and Google) and prestigious universities such as Stanford, MIT, and Cambridge. We used the sample publicly available at https://github.com/mlcommons/ailuminate and first described in arXiv:2404.12241. We excluded the prompts on the topic of Child Sexual Exploitation (due to legal restrictions) but otherwise used all prompts labeled as “unskilled” - since GOAT and Crescendo both transform basic questions into successful jailbreaks. This left us with 550 prompts.

审稿意见
4

The paper introduces GOAT (Generative Offensive Agent Tester), an automated red teaming system designed to identify vulnerabilities in LLMs. GOAT simulates adversarial conversations by leveraging various known prompting techniques to jailbreak LLMs. The key innovation is its multi-turn conversational approach that uses a general-purpose LLM to reason through attack strategies dynamically. This mimics how real users might attempt to circumvent safety guardrails. The authors implement GOAT with 7 different red teaming techniques and demonstrate its effectiveness against various models. They report high attack success rates (ASR@10 of 96% against Llama 3.1 8B, 91% against Llama 3.1 70B, and 94% against GPT-4o) while requiring fewer conversation turns than comparable methods.


update after rebuttal:

I would like to thank the authors for a clear and comprehensive rebuttal.

The authors have addressed the limitations I raised, and with these changes I believe this is a good paper. I will raise my recommendation from weak accept to accept.

给作者的问题

  1. Could you clarify which model was used as the "generic helpful-only language model" in your main experiments? This information is critical for reproducibility and for understanding the full capabilities of your system.

  2. The paper shows impressive attack success rates, but how might systems defend against GOAT attacks?

  3. How might the results change if the context window limitation (5 turns) was removed? The paper mentions this as a limitation, but some additional analysis on how performance might scale with more turns would help readers understand if GOAT's advantages would persist in longer conversations.

论据与证据

The paper's primary claims are generally well-supported.

However, the claim that GOAT outperforms existing approaches is limited as they only compare to a re-implemented version of Crescendo. No other methods are used for comparisons.

方法与评估标准

The authors evaluate GOAT on a diverse set of LLMs. The benchmark used (JailbreakBench) is appropriate.

理论论述

The paper does not make theoretical claims.

实验设计与分析

The authors do not disclose what model they use for their main experiments and only say it is a generic helpful-only language model. They do show similar performance if they use GPT-4o as the adversary model (appendix B1) but I find the omission of the details of the main model puzzling.

The authors re-implement Crescendo to use the same model as their GOAT method used. But it would be interesting to also compare the original Crescendo with GPT-4.

补充材料

I reviewed the appendix.

与现有文献的关系

The paper does a good job situating GOAT within the broader red teaming and jailbreaking literature. The authors appropriately differentiate their approach from both single-turn methods and other multi-turn approaches like Crescendo. They acknowledge prior work on adversarial prompting techniques and explain how GOAT builds upon and extends these approaches.

遗漏的重要参考文献

N/A

其他优缺点

Strengths:

  • The GOAT framework is simple and sufficiently novel
  • The paper is well written

Weaknesses:

  • Some key details regarding the model used are not disclosed
  • The paper could be stronger with discussion on possible defensive measures that could mitigate GOAT attacks.
  • The context window limitation (capping at 5 turns) may not fully demonstrate the system's capabilities
  • Only comparison is to Crescendo

其他意见或建议

Some of the figure labels are too small.

L93 (right column): space missing

作者回复

We thank the reviewer for their thoughtful engagement with the work! We believe the questions and concerns raised will help strengthen the paper.

Q1 Clarifying the attacker model used

We apologize for the omission and we will certainly explain this in a potential final revision! For the experiments in the main body, we used a generic helpful-only language model. Due to the conference’s anonymization policies, we can only share the following information at this point: this model was trained on data that bordered between harmless and harmful content, where the desired response was always to be helpful regardless of safety compliance. No explicit specific attack training data was introduced. All red teaming information is later introduced to the model through in-context learning via the system prompt.

However, we believe that GOAT is directly applicable to other off-the-shelf attacker models (see appendix B.1 for GPT-4o and the additional results below for deepseek-r1).

Q2 Discussing defenses

We believe that models become more robust through an attack/defense iteration cycle. By proposing GOAT, we are encouraging the community to offer stronger defenses to multi-turn attacks. Dedicated moderation classifiers may filter out some of the worst responses, LLMs can be post-trained to better handle multi-turn attacks and GOAT itself can be used to generate such mitigation data. We would be happy to add these suggestions in our Limitations and Future Work section.

Q3 Longer conversations

Thank you for bringing up this important point! In Appendix B.3, we have extended conversations up to 50 turns with the Llama 3.1 70B model as the target. GOAT maintains its higher ASR for any given turn over Crescendo. It achieves this with 2 times less queries per turn. This is due to the Crescendo method’s need for two external judge model calls at each turn. We also note that the ASR plateaus at 10 turns, revealing a diminishing return on computational budget.

Other comparisons

Since we followed the standard evaluation protocol of JailbreakBench, we can directly compare to other methods on the leaderboard at https://jailbreakbench.github.io. For an apples-to-apples comparison, we focus on attacks reporting results on the same models as us (Llama 2 7B and GPT-4) and filter out attacks with greater access than ours (e.g. logprobs or white-box access).

AttackLlama 2 7B (ASR@1)GPT-4 (ASR@1)
PAIR0%34%
AIM0%0%
GOAT30%46%

External API as Judge

To address this concern, we have now used the OpenAI Moderation API and rescored the conversations where Llama 3.1 70B was the victim model. We notice a slight drop in ASR but note that GOAT maintains its relative advantage over Crescendo. (See last section.) We hope this lends further credence to the claim that GOAT outperforms the most comparable method available and we thank the reviewer for the suggestion!

Misc

We will certainly adjust our figure sizes for readability and fix the spacing. Thank you for pointing these out!

Additional Experimental Results

New Attacker Model

When we use DeepSeek-r1 as the attacker model and:

  • GPT-4o as the target, we achieve 63% ASR@1 (compare to 54% from Fig. 2 with the original attacker and 70% from Fig. B.1 with GPT-4o as the attacker)
  • DeepSeek-r1 as the target: 54% ASR@1
  • o1 as the victim: 30% ASR@1

New Scoring Functions

When using the the OpenAI moderation API as the scoring function and using the logged conversations of the original attacker model with Llama 3.1 70B as the target, GOAT maintains its advantage over Crescendo.

JudgeOpenAI Moderation APIJailbreakBench (as reported in Fig. 2)
GOAT36%55%
Crescendo18%28%

Additional Datasets

When rerunning on the widely endorsed benchmark Ailluminate with the original the attacker and Llama 3.1 70B as the target, GOAT maintains its advantage over Crescendo:

AttackAilluminateJailbreakBench (from Fig. 2)
GOAT43%55%
Crescendo25%28%

Context on AILluminate

AILluminate is a safety benchmark released by the MLCommons alliance and endorsed by 72 universities and research labs, including 4 major foundational model labs (OpenAI, Anthropic, Meta, and Google) and prestigious universities such as Stanford, MIT, and Cambridge. We used the sample publicly available at https://github.com/mlcommons/ailuminate and first described in arXiv:2404.12241. We excluded the prompts on the topic of CSE (due to legal restrictions) but otherwise used all prompts labeled as “unskilled” - since GOAT and Crescendo both transform basic questions into successful jailbreaks. This left us with 550 prompts.

审稿人评论

I would like to thank the authors for a clear and comprehensive rebuttal.

The authors have addressed the limitations I raised, and with these changes I believe this is a good paper. I will raise my recommendation from weak accept to accept.

In addition, I have read the other reviews and their associated rebuttals. The other reviewers raised some valid concerns, but I believe the authors have responded well.

I don't completely agree with Reviewer vgNe on two points:

  • "The paper does not provide direct empirical comparisons to actual human-generated multi-turn attacks". I agree this would strengthen the paper. However, I don't think this comparison is necessary to validate the claim that "[GOAT] responds dynamically based on how the conversation is trending and picks and chooses from multiple techniques available at its disposal, just like a human red teamer would". I think the authors' own domain knowledge is sufficient to make this claim. The authors might consider clarifying this statement in the paper.
  • The lack of an illustrative figure of the proposed method. I find Figure 1 to be a good high-level overview of the method.
最终决定

This paper presents GOAT, an automated multi-turn red teaming framework for evaluating the robustness of LLMs through dynamic adversarial prompting. The work is timely, well-executed, and supported by a range of experiments across benchmarks, attacker models, and scoring protocols. Reviewers commend the writing clarity and the system's practical utility, while also noting some concerns around novelty, baseline breadth, and limited ablation analysis.