6.4

/10

Poster5 位审稿人

最低6最高7标准差0.5

4.0

置信度

正确性3.2

贡献度3.2

表达3.0

NeurIPS 2024

Many-shot Jailbreaking

Cem Anil,Esin DURMUS,Nina Rimsky,Mrinank Sharma,Joe Benton,Sandipan Kundu,Joshua Batson,Meg Tong,Jesse Mu,Daniel J Ford,Francesco Mosconi,Rajashree Agrawal,Rylan Schaeffer,Naomi Bashkansky,Samuel Svenningsen,Mike Lambert,Ansh Radhakrishnan,Carson Denison,Evan J Hubinger,Yuntao Bai,Trenton Bricken,Timothy Maxwell,Nicholas Schiefer,James Sully,Alex Tamkin,Tamera Lanham,Karina Nguyen,Tomasz Korbak,Jared Kaplan,Deep Ganguli,Samuel R. Bowman,Ethan Perez,Roger Baker Grosse,David Duvenaud

OpenReview PDF

提交: 2024-05-14更新: 2024-11-06

TL;DR

We investigate a simple yet effective long-context jailbreak, study the corresponding scaling laws and evaluate some mitigations against it.

摘要

关键词

large language modelslong contextrobustnessjailbreaksin-context learning

评审与讨论

审稿意见

评分: 6置信度: 42024-07-09

The jailbreaking problem, where a harmful output is desired to be obtained from aligned language models is studied in this paper. The paper addresses in-context learning jailbreaking, where examples of malicious queries and answers are given before asking the desired question. This paper extends from the previously studied few-shot jailbreaking to many-shot jailbreaking. Extensive experimental evaluation in their generated harmful questions and the standard HarmBench benchmark show the effectiveness of the approach. Moreover, authors warn about the dangers of long contexts and large model size with an empirical analysis of the scaling laws of the likelihood of harmful answers.

优点

Clear writing, motivation and discussion of limitations of all the design decisions.
Simple and effective approach.
Extensive experimental evaluation.
The scaling law analysis provides relevant insights, i.e., the increased jailbreaking success with larger context lengths and model sizes.

缺点

I find some experimental aspects could be improved in the paper.

Missing experimental details and error bars:

Authors report the negative log likelihood (NLL) of harmful answers attack success rate (ASR). Nevertheless, they do not specify how many samples are taken in order to estimate the NLL. If the NLL is estimated as the average across different harmful answer targets, what is the standard deviation of the NLL? Do all harmful targets behave similarly to the average scaling law?

On the use suffixes in the MSJ + GCG attack:

In Figure 3, I find the increase in NLL from 0-shot (standard GCG) to 1-shot very weird. Authors speculate the GCG suffix is “heavy location specific”, I have some questions about this.

Authors repeat the same suffix after each harmful question in the in-context demonstrations. Do authors obtain the adversarial suffix in the 0-shot setup and then simply employ it in the multi-shot case?

If positive, have authors tried optimizing the adversarial suffix in the in-context setup? That is, appending the adversarial suffix to every question and optimizing it with GCG. To avoid problems with the gradient estimation, the suffix could be put just in the last question. This approach has the same complexity as GCG with the only disadvantage that long-context prompts take longer in inference.

问题

See Weaknesses.

局限性

The main limitation of MSJ is that this attack can easily be detected by checking the Question-Answer format and rejecting the prompt if the number of such pairs is uncommonly high. Additionally, since Jailbreaking appears with very large context lengths, model servers could limit the context length to defend against such approaches. Authors accordingly discuss such limitations in the paper.

作者回复

2024-08-05

Thank you for your review!

We appreciate that you found the simplicity of MSJ to be a strength given its effectiveness, and highlighted the scaling laws as providing insight into the nature of the jailbreak, as well as mitigation attempts.

Here are our responses to some of your critique:

Augmenting experimental details and error bars: Here's how we've tried to improve the paper accordingly:
- Specifying number of samples: We've added this information in the paper. All our measurements use dataset sizes of at least $512$ $512$ , which ensures a standard deviation of at most approximately $5$ $5$ percent under a Bernoulli model (relevant for computing error bars on attack success rates).
  - Concrete action taken: Paper edited with the requested information.
- NLL information: We've redone some of our plots with the error bars on the NLL values. You can find these in the attached. Please note the technique we used to reduce cross-datapoint variance between measurements within the same scaling law plots, described in Appendix C.2. This allows us to obtain clean scaling laws despite each individual measurement being relatively noisy. Note that the in-context power laws have been observed independently by Agarwal et al. as well, a work concurrent to ours.
  - Concrete actions taken: Requested plots generated and provided in the supplementary rebuttal figures.
Question about GCG:
- New GCG results: It is unfortunately difficult for us to run new GCG experiments on time for the rebuttal. However, we can still try to address a portion of the critique.
- Mechanics of our GCG experiments:
  - The way we construct the GCG-augmented MSJ prompts is similar to what you describe: we first compute the GCG string that zero-shot jailbreaks the model on a variety of prompts, then stack [question + gcg-string + answer] pairs to construct the MSJ prompts. Just to be clear, we find a "universal" GCG string that works very well in the zero-shot setup, and use the same GCG string when we form the MSJ prompt.
  - Our experiment attempts to answer whether one can stack MSJ on top of an existing GCG attack. Our result here is mostly negative — the zero-shot benefit of the attack doesn't translate to the many-shot setting.
  - The opposite case that you've brought up (what if we take an MSJ prompt then optimize a GCG string on top of it) is also very interesting, and something we didn't try. A particularly interesting version of this could have been: optimize a GCG string for 10-shot MSJ prompts, then test it on 5 shot and 20 shot MSJ prompts. Our results on location specificity suggest that the benefit we get on 10-shot prompts will not transfer to 5-shot or 20 shot cases. We'll quite likely not have the opportunity to run this experiment by the end of the rebuttal phase, but with your permission we can mention this as a promising experimental extension in the paper.

[1] Agarwal, Rishabh, et al. "Many-shot in-context learning." arXiv preprint arXiv:2404.11018 (2024).

2024-08-08

Dear authors,

Thanks for your response. I appreciate the inclusion of the error bars in the NLL estimates.

Regarding the GCG experiment, I would suggest removing the experiment or doing it properly, results are currently inconclusive. I do not believe that to "mention this as a promising experimental extension in the paper" is up to the NeurIPS 2024 quality standards. Even though authors did nor disclose the computational resources employed in the experimental evaluation, provided the vast amount of experiments present in the paper, I believe they are more than capable to run this simple experiment before the camera ready version.

I will maintain my score.

评论- Thank you for your response.

2024-08-13

Thank you for your response and engagement.

There might have been a misunderstanding regarding our response to the suggested GCG experiments.

We can absolutely commit to adding the result of the experiment you suggested before the camera ready deadline — just not by the end of the rebuttal, which was not possible due to the timing of the rebuttal period. The GCG results also aren’t load-bearing for the paper — we’d strongly argue that our submission would still be (hopefully comfortably) above the acceptance threshold even if we completely removed the GCG section as you suggest.

We’d like to also argue that the existing results do deserve being shared with the adversarial robustness community as they are, provided we appropriately scope our claims such that they are supported by the evidence we give. Our results show that composing MSJ and GCG such that MSJ is in the outer layer (i.e. MSJ(GCG(query))) does not result in a stronger attack — we believe this is an interesting finding that might be counterintuitive to some readers. You’re right that we don’t (yet) present results in the other order of composition (GCG(MSJ(query))). It’s important that we don’t over-claim here, and we’ve updated the language to make sure of this.

Here’s a summary of our claims:

The results we presented at the time of submission are already worthwhile as they are, provided they are discussed appropriately (which, given our latest edits, they are).
Adding the experiment you suggest would make the GCG section stronger, and we can commit to doing this before the camera ready deadline. Note that no other claim in the paper hinges upon how the result of this experiment turns out.
Adding these results would only constitute a relatively minor revision of the paper, in comparison to the contributions elsewhere.

Thank you for your consideration.

评论- Thank you

2024-08-14

Thanks for the clarifications. I completely agree that the paper being accepted or not does not depend on this experiment. That is reflected in my rating.

What I believe is that at the current state, the experiment is not complete. I appreciate the change in the writing highlighting the specific way of combining GCG and MSJ and also the commitment to perform the remaining experiments.

Looking forward to seeing the results and the revised version of the manuscript. I believe your work brings value to the community and will be happy to see it accepted.

审稿意见

评分: 7置信度: 42024-07-11

In this paper, the authors investigated many-shot jailbreak (MSJ), a jailbreaking method that exploits LLMs' ever-growing context window length to prefix malicous requests with a large amount of demonstrations of jailbroken dialogs. The constituents of MSJ prompts are reltively simple but MSJ manages to breach the safety guardrails of various LLMs with substantially higher probability than existing jailbreaking methods as long as the number of demonstrations is large enough. The authors accordingly identified a possible power scaling relation between the liklihood of harmful responses and the number of demonstrations. Non-signicant defense brought by differnt mitigation stretegies further validated the hardness to defend against such an attack.

优点

The work reveals the length of the jailbreaking prompt as a novel attack surface (as well as places where positive controls can happen) which is inspiring.
This work comes with extensive experiments on a number of datasets.
The investigation in this paper is very comprehensive. It involves not only the effectiveness of a single jailbreaking attack, but also discusses the underlying pattern, potential fixes, extension to non-safety-related data, analysis about the influence of model size, etc.

缺点

The paper lacks numbers and tables. While it uses a lot of plots for visualization, numeric results are still valued.
MSJ features and requires a very longer jailbreaking prompt to be effective. In Figure 18 where MSJ was compared against other jailbreaking methods, it is using 128 demonstrations, while reading from Figure 1 and its caption, that means more than 4096 tokens. It might not be entirely fair for the other attacks which uses far less tokens. It would be nice to show the relation between the number of tokens and the ASRs.
MSJ requires access to a lot of jailbroken examples, which isn't readily available.
Consisting of so many harmful demonstrations, the jailbreaking prompt of MSJ is likely to be easily filtered.

问题

While it is mentioned in Appendix E that the LLM can learn the format in context, does it mean the MSJ can be formulated as a single round conversation of extra long message instead of a dialogue with long history? What's the difference between the two formulations and does the results from Appendix E mean it is possible to hijack the special tokens in the conversational template through interactions like MSJ?
To what extent will the content of the the demonstration impact the jailbroken response? Appendix D.3 mentioned that the in-context demonstrations are expected to come from a sufficiently wide distribution, but what if the demonstrations share a clearly identifiable feature that is non-existent in the target domain? As excessive influence of context on the jailbroken response might not be a desirable thing, is it possible longer conversations suffer more from these issues?
Why is the investigation dedicated to many "shots" instead of just long prefix? As there haven't been many jailbreaking attacks that come without a moderate token budget, it is unclear if the mere length of MSJ prompt is contributing more to the success of jailbreak. There have been studied which juxtapose obfuscated malicious requests with unrelated questions in a single utterance and also receives promising results.

局限性

The authors identified the need for the ability to set the conversation histroy and possible lack of robustness under demonstration target domain shift as the two major limitations of the work. However, the cost of collect quality demonstrations, chance of being filtered, cost effectiveness, etc. are all also limiting the use of MSJ. The authors discussed the border imapct of this work in Appendix A. What MSJ reveals can draw peoples' attention to the context window length as a previously less-explored perspective for controlling model behaviors. A responsible discloure meeting was also held among model prividers to share the finding and attempts at mitigation.

作者回复

2024-08-05

Thank you for your encouraging review!

We appreciate that you highlighted our focus on context length as a novel attack surface, which was the inspiration behind our work. We also appreciate that you found our experiments comprehensive.

We've tried to address some of your feedback — some directly on the writeup. Please take a look!

Addressing critique:

Tables and numbers: Thank you for this suggestion, we agree with you! We've added a portion of our results (especially the results on the Malicious Use-Cases dataset) in Appendix D.
- Concrete action taken: Requested tabular data provided. Please find two of the tables in the supplementary rebuttal figures.
Longer prompts: Thanks for bringing this up! The effect of prompt length on the robustness of harmlessness training was studied in [1], Section 5.2 The authors find that prepending long conversations in the prompt, then asking adversarial questions doesn't manage to reduce the robustness of the LLama 2 models. This finding suggests that length alone is not a significant causal factor in jailbreaking success.

There do exist methods (such as Greedy Coordinate Descent, which we discuss in the paper) that can find very short prompt suffixes that reach high degrees of jailbreaking success. A significant caveat is that this class of attacks use gradient information — something MSJ doesn't make use of. There exist jailbreaks such as that discussed in [2] that is a lot more effective than MSJ in the bounded context setup. We view that fact that the effectiveness of MSJ improves with context length as a strength: The performance of many jailbreaks such as [2] is constant in number of tokens available — they don't get better with increasing context length. On the contrary, MSJ gets much better — and in a predictable, power law relation.

[3] links token length and steerability in an adversarial robustness context: This paper suggests that, as expected, having longer contexts makes it possible to steer model behavior more easily. (note that this paper finds adversarial attacks on token activations)
Already jailbroken examples are (unfortunately) readily available: Unfortunately, finding jailbroken examples in Google is trivial: Simply look at the HarmBench dataset, which has publicly available question-harmful_response pairs. Also, with the release of Llama 3 400B, we will soon have open source, fully jailbroken models from which it's trivial to generate harmful responses.
Data filtering: Thanks for bringing this up! We are actively following promising leads on how Many-shot Jailbreaking interacts with filtering-based methods, but are not in a position to be able to share some of our results just yet.

Answers to questions:

Single conversation vs. dialogue: This is a great question! The vanilla version of MSJ that we present in the paper does not extend to single turn conversations — the many-shot structure is quite important for the attack to be effective. That being said, we're actively following some leads on this direction. For the attack to be effective on platforms such as ChatGPT and Claude.ai, not only does the attack need to be single-turn, but also effective at circumventing other safety layers at detecting jailbreak attempts. To do this effectively, one has to consider:
1. How the effectiveness of separate evasions techniques stacks with MSJ
2. Whether there could be MSJ specific evasion techniques
We hope that this presents a rich research agenda for the robustness community to pursue!

As an aside, all companies that are offering their language models via APIs allow for inserting faux steps in the dialogue history, making MSJ trivial to execute.
Discussion on diversity of in-context exemplar distribution: This is an interesting question that touches upon the fundamentals of how in-context learning handles out-of-distribution datapoints. To our knowledge, there doesn't exist a set of general results that can help us here. We expect there to be a fair degree of domain-specific effects here — maybe certain OOD behaviors will be more difficult to elicit via MSJ. Empirically, we find a monotonic relationship between the prompt diversity and chance of transfer to OOD data.
Long prefix vs. many-shot structure: We omitted this kind of analysis from the paper, as it already has been conducted by [1]. They tested the consequence of long, unstructured prompts on robustness, and have mostly not identified any significant degradation in performance. This points to the importance of the many-shot structure of the prompt.

Thanks again for your review and insightful questions. Please do Please let us know if you have any further questions, and (respectfully) consider

[1] Xiong, W., Liu, J., Molybog, I., Zhang, H., Bhargava, P., Hou, R., Martin, L., Rungta, R., Sankarara- man, K. A., Oguz, B., Khabsa, M., Fang, H., Mehdad, Y., Narang, S., Malik, K., Fan, A., Bhosale, S., Edunov, S., Lewis, M., Wang, S., and Ma, H. Effective long-context scaling of foundation models, 2023

[2] Liu, Yi, et al. "Jailbreaking chatgpt via prompt engineering: An empirical study." arXiv preprint arXiv:2305.13860 (2023).

[3] Fort, S. Scaling laws for adversarial attacks on language model activations, 2023.

评论- Response to Rebuttal

2024-08-12

I would like to thank the authors for the detailed response. The tabular results in the complementary document are very helpful. However, a number of concerns remain:

!. The authors referred to section 5.2 of [1] and believed that it confirmed that longer context alone is not the major reason for MSJ to work. However, the experiment in [1] uses a single and elementary implementation which prepends a long benign context. It is questionable if the results are sufficient for the authors to conclude that existing attacks "don't get betting with increasing context length". Additionally, it is unclear if the experiment in [1] uses a single round or a multi round setup. As the authors believed that the "many-shot structure is quite important", I recommend them conducting the experiments with long dialog history instead of just long context on their own.

The authors referred to HarmBench (and Google) as a potential source of jailbroken query-response pairs. However HarmBench wasn't mentioned anywhere in the paper. What are the demonstrations used in the MSJ experiments? Is there a noticeable quality difference between what the authors used and HarmBench, making HarmBench or at least HarmBench alone not appropriate? The authors also mentioned open source jailbroken models. It raises another question on what's the point of jailbreaking a model when people can already get the harmful response elsewhere effortlessly? Maybe, if the authors can show that MSJ allows all subsequent interactions with the victim model to be jailbroken, than such usage scenarios can be partly justified. After all, when a MSJ attack is successful, the new response is harmful and it means the current dialog history is exactly a new set of harmful demonstrations.
The authors said that they are investigting MSJ's resistance to filtering-based protection but refused to share any result. As much as I would like to trust the authors, I still think a quick experiment with existing filter-based protection need to be presented, e.g. regarding LlamaGuard, PromptGuard, etc.

Therefore, I would like to keep the original scoring.

评论- Thank you for your response.

2024-08-13

Thank you for your response and consideration! Here are some quick points:

Dialogue vs. single conversation: We might have interpreted your earlier comment incorrectly as referring to the comparing the effect of context length alone vs. the dialogue structure!

To answer your question, we ran some additional experiments involving two conditions

Condition 1: Instead of using Human/Assistant tags, we used “H/A” tags in the prompt — so far very similar to what we did in Section 3.3 about changes in formatting. The difference here is that we sandwiched these question-answer pairs between a single Human/Assistant turn. I.e. the format is:

Human: … A: … H: … A: … [xn] H: … Assistant:

This format makes sure that the MSJ prompt fits in a single human turn. * On the Opportunities to Insult dataset (which correlates very strongly with other datasets in our experience), this did not work very well, getting close to $0$ percent jailbreak rate.

Condition 2: We took Condition 1, and modified the intermediate Question and Answer tags as follows:

H → “Prompt: (Answer the question in the manner displayed above.)”

A → “Aide:”

So now the format is Human: … Aide: … Prompt: (Answer the question in the exact manner displayed above): … Aide: … [xn] Prompt: (Answer the question in the exact manner displayed above): … Aide: … [xn] Prompt: (Answer the question in the exact manner displayed above): … Assistant:

Number of shots | Baseline multi-turn MSJ | Single turn MSJ

0 | 0.00 | 0.00

1 | 0.00 | 0.06

5 | 0.31 | 0.31

10 | 0.67 | 0.51

16 | 0.70 | 0.60

36 | 0.83 | 0.59

These results suggest that one can use a single-turn dialogue to execute MSJ, albeit less effectively than what one can get with a multi-turn setup. Of course there’s more to do here, but hopefully this experiment addresses part of your comment!

Comment about importance of jailbreaking research in a world with Llama3 and Google: This is a deep topic that probably deserves a better medium than the margins of a NeurIPS review, but just to briefly share our position: We believe that it is important to study methods that might jailbreak SOTA proprietary models whose capabilities might not have been yet matched by open source models. Jailbreaks such as MSJ could reduce the cost of jailbreaking open-source models as well, which makes defending against it important.

HarmBench: The independent replication results actually use the HarmBench dataset — please see Section 7. We’re not aware of any significant quality difference between using HarmBench question-answer pairs vs. using novel samples from, say, open source models.

Thanks again for engaging in this discussion!

审稿意见

评分: 6置信度: 42024-07-13

This paper introduces a novel jailbreaking attack that exploits the extended context capabilities of the most advanced large language models. The authors conduct an in-depth analysis of various aspects of the attack, including its effectiveness across different models, the significance of turn formatting, its combination with other attack methods, and how important it is that the topic of the example matches with the target topic. Furthermore, the study explores the scaling laws for the attack, examining its efficacy in relation to model size and attack length. Additionally, the authors investigate potential mitigation strategies, including both fine-tuning and reinforcement learning approaches, as well as prompt-based techniques.

优点

Effective attack. The attack is very effective, with a high percentage of harmful responses generated across a number of victim models. The attack is also relatively effortless, requiring little manual work and, once the dataset of harmful response examples is generated, it requires only one query.

Interesting analysis. The analysis performed by the authors is extremely in-depth and covers interesting aspects of the attack.

The attack is robust to formatting changes. The attack is somewhat robust to changes in the formatting of the turn, which is a significant advantage for an adversary with limited access or knowledge to the victim model.

Mitigations. The authors test a number of mitigations, including alignment fine-tuning, reinforcement learning, and prompt-based techniques, and show that they are not very effective and only change the scaling law's intercept.

Independent replication work. The authors had their discoveries independently replicated by an independent team on a slightly different set-up (benchmark and model), which somewhat compensates for the lack of experimental details.

缺点

Poor experimental details. The experimental details are not very complete because they are run with "proprietary code and models". This is partially compensated by the independent replication work, but still not ideal. There are some experimental details in the appendix, but they are not referenced in the main text and not easy to navigate. I suggest the authors provide more details on the experimental setup in the main text.

Not fully clear evaluation algorithms. While the authors provide in the appendix pseudo-code to show how they evaluate the NLL of the harmful responses and the percentage of harmful responses, it would be helpful if they could give a high-level idea in prose of the metrics in the main text of the paper. Moreover, Listing 1 is incomplete on line 5). Is the percentage of harmful responses done with the refusal classifier described in Appendix C.1.1? If so, it should be at least mentioned in the main text.

问题

How important is the quality of the examples generated by the helpful-only model? Would you expect this attack to work with a model sich as Mistral-7b? This is important because an adversary might not have access to a helpful-only model that is as powerful as a proprietary model.
How would the attack perform if the adversary created a false conversation as part of a user message, as if the adversary was using a platform such as ChatGPT and Claude.ai? Did you try this? I see you listed the fact that the attack would not work as one of the limitations, but at the same time the attack seems to be somewhat robust to different types of turn formatting, so the fact that the attack would not work at all comes as a surprise to me

局限性

The author discuss some of the limitations of their work, but do not list the lack of experimental details as one of them.

作者回复

2024-08-05

Thank you for your critique and questions.

We appreciate that you found our scaling and mitigation analysis interesting, which we perceive to be some of the central contributions of our work. We also thank you for highlighting the importance of the independent replication.

Overleaf doesn't allow us to upload updated versions of the paper, so we'll instead explicitly mark all the changes we have already made in the draft in our response — please see bullet points labelled "Concrete action taken".

Regarding experimental details:

Balancing anonymity and transparency: We completely understand your concern and genuinely sympathize with it. A major reason why we weren't able to be a lot more transparent in our reporting is that doing so runs the risk of invalidating the integrity of the anonymous review system — revealing too much identity-revealing information might disqualify us from the conference. If the paper gets accepted to the conference, we'll be able to add more details to our experiments, such as model names, API endpoints etc. As you've noted, the independent replication result is very valuable in this scenario, and gave us a lot more confidence that our results are quite general and setting-independent.

Relatedly, thank you for flagging that some of the sections in the Appendix are not being properly referred to! We've tried to fix all instances where we thought this happened.

Concrete action taken: Identified missing references to Appendix sections and added them.

Towards clearer evaluation algorithms:

Further clarity on NLL computations: Thank you for your feedback here. We've tried to address this critique with the following actions.
What Listing 1 and Listing 2 do: Listing 1 lines each question-answer pair in a list, and grabs consecutive slices of length num_shots from this list to form the MSJ prompts. Listing 2 lines the question-answer pairs in a list, grabs a large subsection of them of length equal to the maximum number of shots, and constructs MSJ prompts by cropping this large subsection from the left. The rationale for the latter procedure is explained below.

Concrete action taken: Explained, in prose, what the both pieces of pseudocode are implementing in the relevant part of the Appendix.

Intuition behind Listing 1 and Listing 2: The algorithms we provide involve a procedure that is aimed at reducing cross-datapoint variance. The key invariant of the algorithms is that for all in-context prompts of different lengths, the set of final queries are the same. This makes cross-datapoint variance much smaller.

Concrete action taken: Fixed the broken line in Listing 1.

Referencing the refusal classifier at the right place: We've also fixed this in the writeup!

Concrete action taken: Directly addressed the comment.

Questions:

Dependence on quality of helpful-only models: It's debatable how the size of the gap between the open and closed source models will evolve over time. That being said, we believe that the recently released Llama 400B is a model that's strong enough to elicit a massive class of very problematic behaviors that can be directly used in the context of MSJ. While Llama 400B itself is not a helpful-only model, it is "trivial" to finetune it to make it so (e.g. Gade et. al.[1]). As you mention above, MSJ is a relatively effortless attack, and a sufficiently motivated actor can be expected to have the resources to coax Llama 400B to constructing MSJ prompts — or even release the weights of a helpful-only version of this model.

What if the adversary if forced to use a much less capable model? We ran a quick experiment to test this: We measured the performance of MODEL on the GPQA dataset [2] in with 0, 64 and 128-shot MSJ prompts, where the question-answer pairs were generated using an earlier generation model. We observed that moving from 0-shot to 64-shot prompts lead to a decrease of ~5% (from 40% to 35%), and 64-shot to 128-shot prompts didn't lead to any further degradation. We can reach the following tentative conclusions from this quick experiment:

Moving from zero-shot to few-shot setup leads to some degradation in performance.
Moving from few-shot to many-shot setup doesn't lead to any further degradation in performance.

These results paint a relatively optimistic picture for MSJ retaining most of its effectiveness even if the MSJ prompt is generated using a less intelligent model. To be able to claim this with the full confidence that's required of a scientific publication, we need to replicate this with a variety of strong-weak model pairs and tasks, which is something we don't have the bandwidth to do during the rebuttal phase.

Effectiveness on chat interfaces: This is a great question, and mirrors a relevant question asked by Reviewer 3GmK. Proprietary chat interfaces like ChatGPT or Claude.ai can be expected to rely on layered, "defense-in-depth" approaches that go beyond the robustness of models themselves. This means that a jailbreak attempt not only has to be effective on the model, but also evade any attempts at detecting jailbreak attempts. The keys points are:

Whether MSJ stacks with other complementary attacks that are aimed at evading detection
Whether there are MSJ-specific evasion tactics

We are actively following leads on how Many-shot Jailbreaking interacts with detection methods (which can be viewed as a separate research endeavor that build on our current submission), but are not in a position to be able to share some of our results just yet.

If you find our improvements and response satisfactory, please consider improving your score! In all cases, we'd be happy to engage in any follow-up discussion.

[1] Gade, Pranav, et al. "Badllama: cheaply removing safety fine-tuning from llama 2-chat 13b."

[2] Rein, David, et al. "Gpqa: A graduate-level google-proof q&a benchmark."

2024-08-09

Thank you for your reply and for the clarifications. I trust that the improvements you will make to the camera ready will be sufficient, so I am raising my score.

审稿意见

评分: 7置信度: 42024-07-13

This paper proposes many-shot jailbreaking (MSJ). This jailbreak technique exploits longer context windows of modern large language models by providing hundreds of demonstrations of undesirable behavior. The authors demonstrate the effectiveness of MSJ follows a power law scaling with the number of demonstrations, which can be reduced when combined with other attacks. The paper analyzes potential mitigation strategies and finds that alignment techniques like supervised fine-tuning and reinforcement learning are insufficient to fully prevent MSJ at arbitrary context lengths.

优点

This paper identifies and very thoroughly investigates a vulnerability in LLMs that exploits longer context windows, which is highly relevant given recent trends in model development.
The empirical results are extensive, testing MSJ across multiple models, tasks, and settings. The authors provide a clear characterization of how the attack's effectiveness scales with context length and a scaling law
The analysis of potential mitigations and their limitations is valuable, highlighting the challenges in addressing this vulnerability and providing direction for future work on AI safety.

缺点

The defenses the paper considers are simple, only at the prompt level. Would inference-based defenses such as prompt classification [1] be an effective defense?

[1] Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y., Tontchev, M., Hu, Q., Fuller, B., Testuggine, D., & Khabsa, M. (2023). Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations. ArXiv, abs/2312.06674.

问题

See weaknesses

局限性

Limitations are thoroughly discussed

作者回复

2024-08-05

Thank you for your encouraging review!

We appreciate that you found our empirical foray into the scaling and mitigations of MSJ thorough and valuable.

Inference-time defenses considered: Thank you for bringing up inference-based defenses!

We are actively following promising leads on how Many-shot Jailbreaking interacts with these methods, but are not in a position to be able to share some of our results just yet.

One thing we'd like to note is that the version of MSJ described in the paper doesn't make any attempts at disguising itself as an attack. A human that takes a glimpse at the prompt will easily be able to tell that it's up to no good! The keys questions here are, as you suggest:

Whether MSJ stacks with other complementary attacks that are aimed at evading detection
Whether there are MSJ-specific evasion tactics

This presents a rich research agenda that we were deliberate about not tackling in this submission, and one that we hope the adversarial robustness community will pursue. We're especially excited about our scaling law being useful for measuring the effectiveness of solutions and counter-attacks to those solutions.

Happy to answer more!

Please don't hesitate to ask any further questions! We'd love the chance to highlight strengths of our submission that might perhaps encourage you to further increase your score!

评论- response to authors

2024-08-12

Thank you for the response and clarification. I will keep my score.

审稿意见

评分: 6置信度: 42024-07-13

This work presents a jailbreaking method leveraging the power of long-context attacks on Large Language Models (LLMs) called Many-shot Jailbreaking (MSJ). Various LLMs are tested and evaluated on their responses. Extensive experiments are performed with a specific LLM (called MODEL in the paper for anonymity) in different tasks and use cases. The authors claim that the jailbreaking attacks can be characterized by power laws, meaning that the effectiveness increases when the number of shots increases. The effectiveness of many-shot jailbreaking is discussed across tasks, models, and formatting. Moreover, various mitigation techniques based on supervised finetuning, reinforcement learning, and prompt-based defences are analysed and evaluated.

优点

The paper is well-written and explained.
Authors’ claims are well supported by extensive experiments and well-analysed results.
The work presents interesting and useful results about jailbreaking attacks as the fact that they can be characterized by power laws and that they can be more effective in larger models.

缺点

The work seems incremental to the related work cited Agarwal et al., Many-shot in-context learning, 2024.
Many-shot jailbreaking is accessed in all the presented LLMs only for the task of psychopathy evaluation. For the rest of the tasks only a specific LLM (called MODEL for anonymity) is used.
It would have been useful to provide the constructed dataset for reproducibility.

问题

In functional form (1), do C and α have a particular meaning?

局限性

One limitation is that most of the experiments were performed on a specific LLM. Only the psychopathy evaluation task run across different models.

作者回复

2024-08-05

Thank you for your review!

We appreciate that you've found our results interesting and useful. The extent to which the scaling laws are so clean and reproducible did surprise us when we were initially getting the results.

Here's our attempt at addressing some of your critique:

Connection to Agarwal et al:
- Agarwal et. al. is concurrent work, as it happens in a field that moves as fast as ours! All of our experiments had already finished, and the writing had been more or less been locked by the time Agarwal et al's work came out. In other words, Agarwal's results had no causal influence on our ideas, experiments and presentation.
  - Concrete action taken: Revise the sentence in which Agarwal et al. is cited, and emphasize that it's concurrent work.
- Additionally, we believe that 1) our focus on safety 2) detailed analysis on mitigations, and especially 3) structural description of "what it means to address many-shot jailbreaking" are among contributions that are exclusive to our submission. Elaborating on item 3, we prescribe a concrete measurement (slope of the in-context learning curve) provides us with a very clean metric for how one can measure progress towards addressing MSJ.
Reason for why all negative-log-likelihood experiments were done on the psychopathy dataset: The simple reason is that we don't have full log-likelihood access to some of the proprietary models, and working with a yes/no dataset like the psychopathy dataset was the only way we could have demonstrated that the power law trend is a general phenomenon. Unfortunately, since we collected our results, some of the companies completely shut down log-prob access, making this kind of experiment impossible even for psychopathy dataset.
Releasing datasets for reproducibility: Releasing datasets is somewhat tricky for a paper like ours: releasing the dataset immediately makes it possible to run our proposed attack on any model in the world! Luckily, we have HarmBench, which actually does have data of the form needed to reproduce our results. This is why our independent replication results in Section Independent Replication on HarmBench is so important.

Answers to questions:

Disambiguating the equation: C is the offset, α is the slope and K controls the infinite-limit lower bound.
- Concrete action taken: Added this information in the writeup.

If you find our response and paper improvements satisfactory, please consider improving your score! In any case, we'd be happy to answer any follow-up question you might have.

2024-08-11

Thank you for the detailed rebuttal response and additional results. My concerns are addressed and I am increasing my score.

作者回复

2024-08-05

We thank all the reviewers for their constructive feedback.

We appreciate the positive feedback we’ve received on the relevance of our contributions, the significance of the results and the extensiveness of our empirical evaluation. We have already incorporated the majority of the actionable feedback we received directly on the writeup.

We believe that Many-shot Jailbreaking is perhaps the conceptually simplest long context jailbreak that is still cheap, scalable and highly effective. We hope that our scaling analysis points to a concrete recipe for how to measure progress towards addressing MSJ, and our mitigations study sheds light on what approaches might be the most promising to expand on in the future to fix it.

We hope that Many-shot Jailbreaking can act as the “fruit fly” of long context jailbreaks and allow researchers to rapidly develop mitigations that will hopefully generalize against more sophisticated long-context attacks. Long-term reliable solutions to even the simplest form of MSJ still remain elusive today.

Based on the reviewers’ feedback, we’ve made some improvements to the submission. Since OpenReview won’t allow us to upload updated versions of the paper, we specifically noted down what changes we’ve made in our author response. We’ve also uploaded additional tables and figures to supplement our response.

Looking forward to answering any further questions the reviewers might have.

最终决定Accept (poster)

2024-09-25

This paper focuses on jailbreaking attacks on the latest models with longer context windows. The paper studies the concept of many-shot Jailbreaking techniques. The jailbreaking attacks have been studied over the last 1-2 years and indeed having examples provided as input is not a new idea. However, the focus on longer context windows has not been extensively studied yet and this paper conducts an in-depth analysis across models. The paper attempts to formulate the scaling laws of the attack, which is an add-on that can be beneficial on understanding those attacks. Lastly, there are some mitigation strategies discussed, even if further experiments with existing defense methods should be conducted before the camera-ready version. In addition, the related work should be updated to include some of the most recent works in such a fast paced field. This paper uses multi-round interactions, while there are various papers that have emerged over multi-round interactions and jailbreaking attacks and the authors are strongly encouraged to cite those works and discuss the differences. Lastly, releasing the source code of those attacks is recommended to increase the reproducibility.