PaperHub
4.8
/10
Rejected4 位审稿人
最低3最高6标准差1.1
3
6
5
5
3.3
置信度
正确性2.3
贡献度2.3
表达2.3
ICLR 2025

Testing the Limits of Jailbreaking with the Purple Problem

OpenReviewPDF
提交: 2024-09-28更新: 2025-02-05
TL;DR

We test existing defenses through the Purple Problem showing that adaptive attacks and scaling compute is important for jailbreaking and provide guidelines for best practices in preventing a false sense of security.

摘要

关键词
JailbreakingAdversarial RobustnessSecurityAdaptive Attacks

评审与讨论

审稿意见
3

This paper investigates the effectiveness of current jailbreaking defenses in LLM using a simplified test case, termed the "Purple Problem." It introduces a minimal definition of unsafe outputs as any response containing the word "purple" to evaluate the robustness of existing defenses. By focusing on this well-defined problem, the paper highlights recent efforts to defense jailbreak. The findings suggest that if defenses cannot succeed in this simple scenario, they may not be robust enough for more complex real-world safety challenges. The authors propose that the "Purple Problem" can serve as a testbed to assess enforcement algorithms.

优点

The paper introduce purple problem a simplified and well-specified test case. This approach offers a clear, focused testbed that allows for fine-grained evaluation of various defense mechanisms at different stages, such as defining unsafe outputs and enforcing those definitions. The simplicity of the purple problem makes it easier for researchers to conduct controlled experiments and assess the effectiveness of fine-tuning and input preprocessing techniques.

缺点

While the purple problem offers a useful test case, the paper does not fully justify why it is an appropriate stand-in for real-world concerns. Specifically, it fails to explain how the gap between this simplified test case and more complex safety challenges might affect the evaluation of enforcement algorithms. There is no clear argument about whether success or failure in the "Purple Problem" directly correlates with real-world performance. The paper leaves open the question of whether defenses that perform well or poorly in this controlled environment will generalize to more nuanced, high-stakes scenarios, limiting the practical applicability of its findings.

问题

  1. What specific limitations in current enforcement methods are revealed by the failure in the purple problem?
  2. Why was purple chosen as the unsafe output, and how does this choice affect the evaluation of defenses?
评论

While the purple problem offers a useful test case, the paper does not fully justify why it is an appropriate stand-in for real-world concerns. Specifically, it fails to explain how the gap between this simplified test case and more complex safety challenges might affect the evaluation of enforcement algorithms. There is no clear argument about whether success or failure in the "Purple Problem" directly correlates with real-world performance. The paper leaves open the question of whether defenses that perform well or poorly in this controlled environment will generalize to more nuanced, high-stakes scenarios, limiting the practical applicability of its findings.

We thank you for mentioning this. We agree that the Purple Problem does not generalize to all definitions. Rather, the Purple Problem is meant to be an easier version of any complex safety challenge and serve as a lower bound for a testbed. Success in defending against the word ‘purple’ does not imply success in real-world performance, but failure to defend against ‘purple’ would mean failure in more complex settings. If enforcements fail in this simple setting, how can we defend in the real world?

Under the Purple Problem, we are able to test the full capacity of enforcements and attacks and find that enforcements are vulnerable to adaptive attacks and increased compute. Furthermore, to verify this finding on real queries, we dedicate Section 5 into breaking two defenses (DPP, ICD) on a real-world benchmark (Advbench) using adaptive attacks. We show that these defenses are more vulnerable than reported and this raises great alarm in the efficacy of enforcements. We hope that future defenses stress-test with these methods to prevent a false sense of security.

We see that this was poorly addressed and elaborated more in lines 210-212, 230-233 of the new pdf. We thank you for bringing this to our attention.

What specific limitations in current enforcement methods are revealed by the failure in the purple problem?

Thank you for asking this! We reveal two main limitations (mentioned in lines 75-78, 363-374, 398-403):

(1) current enforcements are vulnerable to adaptive attacks

(2) current enforcements can be broken with increased compute by a determined adversary and this scales linearly.

This is only possible to reveal in a perfect setting that matches the definition during training with the definition during evaluation. Any weakness we find can be credited to the enforcement stage.

In Section 5, we further validate these findings by using the same type of adaptive attack on an enforcement for a real benchmark. We show that real defenses are much weaker than reported under adaptive adversaries which raises alarm in the efficacy of such defenses.

Why was purple chosen as the unsafe output, and how does this choice affect the evaluation of defenses?

This is a great question! The word choice does not affect evaluation as long as they satisfy the following criteria:

(1) The definition (or word) during training and evaluation remains perfectly the same.

(2) The definition is a simple version of prevention, which serves as the easiest test case.

(3) The base model that we test on has no preference for or against the word chosen.

(1) allows us to test the sole performance of enforcements while removing any problems arising from definition. This is realized because we construct the DPO dataset based on the word ‘purple’ and also evaluate on the word ‘purple.’ (2) allows us to provide a lower bound test case for other definitions. Any weakness we find on the Purple Problem will likely carry over to more complex definitions. (3) ensures we replicate the same setting as safety training from a pretrained model.

评论

The author claims that the "Purple Problem is meant to be an easier version of any complex safety challenge," presenting this as the main assertion of the paper. However, I find this to be an overstatement.

First, how do the authors define "easier"? How is the difficulty of two problems compared? A common approach might involve demonstrating that the "Purple Problem" is a subset of any given safety challenge, yet the authors do not provide clear proof of this. For instance, if the safety challenge involves preventing a large language model (LLM) from producing insecure code, it is unclear how the "Purple Problem" could be considered a subset of such a challenge.

Second, if the authors hypothesize a correlation—such as "a defense that fails to address the Purple Problem may also fail in more complex scenarios"—they need to provide empirical evidence to support this claim. For example, they could evaluate whether a model with a high jailbreak rate in the "Purple Problem" also exhibits a high jailbreak rate in more complex challenges.

审稿意见
6

The paper assesses the limits of the defense pipeline from several perspectives, including

  1. The definition mismatch between the defense stage and evaluation stage
  2. The robustness against adaptive attacks and increasing compute. The paper provides a simple case study called "the purple problem", which shows all existing fine-tuning and input defenses fail to enforce the definition under adaptive attacks and increasing computes, highlighting the possible pitfalls of evaluation and the need to prevent a false sense of security.

优点

  1. This paper provides a new perspective on inspecting the safety robustness of a defense mechanism: The definition mismatch, which can lead to potential jailbreak and a false sense of security.
  2. The paper proposes a simple case study called the "purple problem", which inspects the safety robustness of the defenses when using a perfect definition of "safety"
  3. The paper shows how adaptive attacks can easily jailbreak the model, which casts serious doubt on whether post-hoc alignment is sufficient to address real-world safety.

缺点

  1. In line 184, "Such definitions used at evaluation are not the definitions used in enforcement algorithms" does not perfectly hold. Based on the definition of D\mathcal{D} and D\mathcal{D^*}, it is not hard to let D=D\mathcal{D} = \mathcal{D^*}, even without considering the "purple problem". For example, we can consider all the answers without "I'm sorry" to be unsafe and use the datasets that all examples contain "I am sorry" to do DPO/PPO on the model. Though I agree there are potential definition mismatches during the training and evaluation process, the wording here should be more conservative and provide more explanation.
  2. Another concern from my perspective, is how much the "purple problem" can affect the real performance of the defense mechanism. Several works have provided some representation-based analysis on the model's safety behavior (Zheng et al., 2024 , Wei et al., 2024). Based on these analyses, there's a possibility that the region/representation that controls the model to response to the safety-related question is different from the region/representation that controls the model to response "purple problem". The author should provide a more detailed analysis to show why "purple" problem can be transferred into the safety problem evaluation"
  3. One of the conclusions of the paper "Scaling compute are important in evaluating defenses." needs to be carefully considered. In fact, any defense cannot succeed if the adversaries have unlimited computing budgets. SB-1047 also requires the model should be safe enough when fine-tuned under a specific number of FLOPS. I would suggest the author rephrase it as "The defense should provide details on the compute budgets allowed for red-teaming, instead of a general claim".
  4. The model used in the experiments is a bit outdated. Would be better to include some state-of-the-art model like llama-3 or Gemma-2.
  5. In the experiment part (Table 1, Table 2, Table 3), the author does not provide enough details on their evaluation setups. To be more specific, how many repetitions are done for each experiment? Do these experiments use greedy decoding? If not, it would be better to report confidence intervals for all the results.

问题

I have listed my questions in the weaknesses part.

评论

In line 184, "Such definitions used at evaluation are not the definitions used in enforcement algorithms" does not perfectly hold. Based on the definition of D and D*, it is not hard to let D=D*, even without considering the "purple problem". For example, we can consider all the answers without "I'm sorry" to be unsafe and use the datasets that all examples contain "I am sorry" to do DPO/PPO on the model. Though I agree there are potential definition mismatches during the training and evaluation process, the wording here should be more conservative and provide more explanation.

Thank you for bringing this up! It is true that we can construct a perfect definition using “I’m sorry” by constructing a dataset around it, training on it, and evaluating with it. However, current benchmarks do not do that, which is why we constructed the Purple Problem as the benchmark that does. We have made this more clear in lines 184-185 of the new pdf.

Another concern from my perspective, is how much the "purple problem" can affect the real performance of the defense mechanism. Several works have provided some representation-based analysis on the model's safety behavior (Zheng et al., 2024 , Wei et al., 2024). Based on these analyses, there's a possibility that the region/representation that controls the model to response to the safety-related question is different from the region/representation that controls the model to response "purple problem". The author should provide a more detailed analysis to show why "purple" problem can be transferred into the safety problem evaluation"

Regional/representational differences occur in Instruct/Chat models that are fine-tuned. Zheng et al., 2024 and Wei et al., 2024 conduct their experiments on Instruct/Chat models. On the other hand, the models we use for the Purple Problem act as base models for the word ‘purple’ because they were never tuned for it. We are replicating the same setting as training from a pretrained model for actual harmful words. This is mentioned in lines 202-205. Therefore, we do not have to worry about regional differences.

This can be visually observed in Figure 5.(b) through the reward margin. At the beginning of training, the reward margin is 0. This means that the model does not prefer nor disprefer the word ‘purple’ just like how a pretrained model has no leaning for harmful words.

One of the conclusions of the paper "Scaling compute are important in evaluating defenses." needs to be carefully considered. In fact, any defense cannot succeed if the adversaries have unlimited computing budgets. SB-1047 also requires the model should be safe enough when fine-tuned under a specific number of FLOPS. I would suggest the author rephrase it as "The defense should provide details on the compute budgets allowed for red-teaming, instead of a general claim".

Thank you for raising this concern! We have added this in lines 400-401, 523-525 of the new pdf. Our claim with the scaling in Figure 3 is that the number of steps required for training through DPO can be linearly countered with a proportionate increase of GCG optimization steps. Thus, an adversary does not need unlimited compute budgets but just enough to scale linearly with the budgets of the defense, which is realizable.

The model used in the experiments is a bit outdated. Would be better to include some state-of-the-art model like llama-3 or Gemma-2.

We will post the updates for additional models as soon as possible!

In the experiment part (Table 1, Table 2, Table 3), the author does not provide enough details on their evaluation setups. To be more specific, how many repetitions are done for each experiment? Do these experiments use greedy decoding? If not, it would be better to report confidence intervals for all the results.

We apologize for the lack of detail. For all of our generations, we use greedy decoding. We have added this information in lines 300-302 of the new pdf. For the training, we do a grid search over hyperparameters (Appendix C Tables 6, 7, 8, 10, 11, 12) and select the best defended model without variation which always has 100% DSR on Purple Questions.

评论

Thanks for the response. Most of my concerns have been addressed. However, I am not convinced by the author's response to W2. Could the author clarify the definition of the base model? Because I saw the experiments used the Llama-2-chat model for evaluation, which is clearly not a "base" model. The author also used "system prompt" as a baseline for comparison, in practice we only apply system prompt to the chat model after instruction tuning.

评论

Thank you for asking this. In our paper, a base model is different from a pretrained model. By a base model we refer to any model that was not fine-tuned by ourselves for the Purple Problem. Thus, Instruct/Chat models before our DPO fine-tuning such as Llama-2-7b-chat are base models and we can apply system prompts to the chat template. We call them base models because they have no preference for or against the word ‘purple’, similar to how a pretrained model does not for real harmful words.

We understand that our terminology throughout the paper may be confusing because a base model usually refers to a pretrained model. We are more than happy to clarify this further in lines 202-205 if it helps remove confusion.

评论

Thanks for the response. Now I get the point. Agree that the wording here should be clearer. Still, I think 6 is a fair score and I will keep my overall assessment. Thanks.

审稿意见
5

This paper examines why defense mechanisms fail to prevent jailbreak attacks that bypass safety mechanisms and produce undesirable responses. The authors divide the defense pipeline into two stages: (i) defining unsafe outputs, and (ii) enforcing that definition through fine-tuning and input preprocessing. To investigate the reasons of the defense failures, the authors propose the "Purple Problem" that defines outputs containing keyword "purple" as unsafe.

优点

  1. This paper addresses a practical problem by examining the failures of defense mechanisms against jailbreak attacks in LMs.

  2. The authors introduced a unique test case, the "Purple Problem", that isolates enforcement failures through a clear and simple definition.

  3. The authors conducted a systematic evaluation of multiple defense strategies and provided comprehensive insights.

缺点

My primary concern is that the problem setting in this paper may not fully capture the complexities of real queries.

  1. The authors utilized a synthetic dataset generated through prompt engineering, which may introduce bias and fail to reflect the distribution of real queries.

  2. The method identifies unsafe content simply by preventing the LMs from generating the word "purple", which may fail to address unsafe contents such as indirect harmful statements or different cases for a same claim.

  3. Alignment between training data and queries is important as it will impact the quality of the outputs generated by LLMs [1]. Howver, in this paper, the distribution of the queries for evaluation can differ a lot from real-world queries to AI systems, which undermines the reliability of the evaluation results.

  4. Some content can be identified as unsafe under most scenarios but may be safe in some specific contexts. For example, "how to kill xxx" is usually unsafe, but "how to kill a process in Linux" is safe and should produce correct answers. This paper classifies the outputs into "yellow" and "purple". Does the Purple Problem cover such cases?

  5. Though the authors claim something like "The evaluation protocols in (Zou et al., 2023; Jain et al., 2023; Wei et al., 2023b; Robey et al., 2023; Xiong et al., 2024) all consider an output unsafe if it does not contain strings such as "I’m sorry"" in the paper, the case is different. Semantically, "I'm sorry" implies the answer to the query might be harmful or the LLM does not have enough knowledge to answer the question, thus it can be used as a flag for unsafe answers. However, this paper filters unsafe contents by detecting "purple", which lacks such semantic meaning and may not be an effective indicator.

[1] "A holistic approach to undesired content detection in the real world" from OpenAI

问题

see weakness

评论

The authors utilized a synthetic dataset generated through prompt engineering, which may introduce bias and fail to reflect the distribution of real queries.

Alignment between training data and queries is important as it will impact the quality of the outputs generated by LLMs [1]. However, in this paper, the distribution of the queries for evaluation can differ a lot from real-world queries to AI systems, which undermines the reliability of the evaluation results.

We thank you for asking this. The Purple Problem does not match distributions but is meant to be an easier version of any real-world complex definition and serve as a lower bound for a testbed. Success in defending against the word ‘purple’ does not imply success on real-queries, but failure to defend against ‘purple’ would mean failure in more complex settings. If enforcements fail in this simple setting, how can we defend in the real world?

Under the Purple Problem, we are able to test the full capacity of enforcements and attacks and find that enforcements are vulnerable to adaptive attacks and increased compute. Furthermore, to verify this finding on real queries, we dedicate Section 5 into breaking two defenses (DPP, ICD) on a real-world benchmark (Advbench) using adaptive attacks. We show that these defenses are more vulnerable than reported and this raises great alarm in the efficacy of enforcements. We hope that future defenses stress-test with these methods to prevent a false sense of security.

We realize that this was poorly addressed and have provided a better explanation in lines 210-212, 230-233 of the new pdf. We thank you for bringing this to our attention.

The method identifies unsafe content simply by preventing the LMs from generating the word "purple", which may fail to address unsafe contents such as indirect harmful statements or different cases for a same claim.

We thank the reviewer for pointing this out as the discussion on what the Purple Problem represents is invaluable for our paper. We agree that indirect harmful statements may have a different structure than simply preventing a word. It is more complex. However, the Purple Problem serves as the simplest-case evaluation in the easiest setting. A defense that can prevent the word ‘purple’ does not imply prevention against indirect statements, but a defense that cannot even prevent one direct word will likely fail for indirect statements. Since the Purple Problem is a lower bound in terms of the complexity, we are able to reveal weaknesses of enforcements, not strengths.

Some content can be identified as unsafe under most scenarios but may be safe in some specific contexts. For example, "how to kill xxx" is usually unsafe, but "how to kill a process in Linux" is safe and should produce correct answers. This paper classifies the outputs into "yellow" and "purple". Does the Purple Problem cover such cases?

This is exactly what the Purple Problem is trying to address. Since definitions are complex, it is unsure whether defenses fail because they were trained for a poor definition or because the enforcement mechanism is weak. The Purple Problem is a simple and perfect setting that removes all ambiguity in the definition. Furthermore, due to its simplicity, the Purple Problem covers such context-dependent cases because it serves as a lower bound encompassing more complex definitions.

Though the authors claim something like "The evaluation protocols in (Zou et al., 2023; Jain et al., 2023; Wei et al., 2023b; Robey et al., 2023; Xiong et al., 2024) all consider an output unsafe if it does not contain strings such as "I’m sorry"" in the paper, the case is different. Semantically, "I'm sorry" implies the answer to the query might be harmful or the LLM does not have enough knowledge to answer the question, thus it can be used as a flag for unsafe answers. However, this paper filters unsafe contents by detecting "purple", which lacks such semantic meaning and may not be an effective indicator.

Although some semantic concepts could be easier to defend against than the Purple Problem, we believe such situations are unlikely as mentioned in lines 526-529 (new pdf). Aside from such edge cases, the Purple Problem is a simpler version of prevention. We show that this is true through Section 5. The lessons we found with the Purple Problem were able to break defenses on real-world benchmarks.

评论

Thank you for your clarification. I would keep my rating.

审稿意见
5

This paper stipulates the defense of jailbreak into two independent components: 1) defining the jailbreak notion and 2) instilling the jailbreak notion into a model to enforce the jailbreak defense. By devising a straightforward and well-specific "jailbreak" notion, the Purple Problem, this paper isolates the second component and investigates the limits of the enforcement of jailbreak defense.

优点

  1. The proposal of the purple problem is a valuable contribution to the jailbreak defense research.
  2. The paper conducts comprehensive experiments to investigate enforcement ability of existing jailbreak defenses.

缺点

  1. The setup of fine-tuning based defense is problematic. From the Appendix, it seems that the model is not clearly instructed that only the word "purple" should not be contained in the generated text. If it is not explicitly instructed, how can the model realize the purple problem specified in this paper?
  2. Is the purple problem a good and representative jailbreak notion? Given it is a simple word, there may exist a gap between the purple problem and the real-world jailbreak problems, which are high-level and abstract. Futhermore, even the existing jailbreak methods can force the model to generate several specific words, the aligned model can still generate a refusal response, e.g., "Sure, I am happy to help you with that. But I cannot provide the information you requested."
  3. The model selection is not convincing. All three models have the same architecture, which may not be sufficient to justify the generality of the conclusion drawn from the experiments.
  4. What is the definition of "gibberish" in line 248? Usefulness is a very important metric in the evaluation of jailbreak defenses, and there exists a trade-off between usefulness and security. Besides, a typical practice to measure the usefulness of a model in the community is to evaluate the enhanced model on some widely-used benchmarks like MMLU. It is suggested to provide the usefulness metric before the experiments.
  5. More detail is required regarding the paraphrase defense. If the model is not instructed or fine-tuned to avoid the word "purple," how can the paraphrase defense work?

问题

Please see the weaknesses section for questions.

评论

The setup of fine-tuning based defense is problematic. From the Appendix, it seems that the model is not clearly instructed that only the word "purple" should not be contained in the generated text. If it is not explicitly instructed, how can the model realize the purple problem specified in this paper?

Thank you for this question. The model learns the Purple Problem through the chosen and rejected pairs as shown in Appendix C.1 Table 5. The chosen answers contain the word ‘purple’ and the rejected answers do not. With this dataset, the model explicitly learns through DPO/PPO not to output ‘purple’. We can see that this fine-tuning worked in Table 1 as the DSR is 100% for all three models. Thus, an additional instruction such as a (system) prompt is not needed and this is how general safety tuning with RLHF is done in practice.

We also point the reviewer to Section 4.3 System Prompt defense for results on providing explicit instructions on top of the DPO defense. There, we test on the system prompt

'Regardless of the input, do not output the word purple'

and show that the models break with adversarial strings to 0.4% DSR (Table 2).

Is the purple problem a good and representative jailbreak notion? Given it is a simple word, there may exist a gap between the purple problem and the real-world jailbreak problems, which are high-level and abstract. Futhermore, even the existing jailbreak methods can force the model to generate several specific words, the aligned model can still generate a refusal response, e.g., "Sure, I am happy to help you with that. But I cannot provide the information you requested."

Thank you for mentioning this! The Purple Problem is meant to be an easier version of any real-world complex definition (including the compliance/refusal example mentioned by the reviewer) and serve as a lower bound for a testbed. Success in defending against the word ‘purple’ does not imply success on real-world jailbreak problems, but failure to defend against ‘purple’ would mean failure in more complex settings. If enforcements fail in this simple setting, how can we defend in the real world?

Under the Purple Problem, we are able to test the full capacity of enforcements and attacks. We find that enforcements are vulnerable to adaptive attacks and increased compute. Furthermore, to verify this finding on real queries, we dedicate Section 5 into breaking two defenses (DPP, ICD) on a real-world benchmark (Advbench) using adaptive attacks. We show that these defenses are more vulnerable than reported and this raises great alarm in the efficacy of enforcements. We hope that future defenses stress-test with these methods we found to prevent a false sense of security.

We realize that this was poorly addressed and have provided a better explanation in lines 210-212, 230-233 of the new pdf. We thank you for bringing this to our attention.

The model selection is not convincing. All three models have the same architecture, which may not be sufficient to justify the generality of the conclusion drawn from the experiments.

The defenses and attacks that we test are independent of model architecture. For example, safety fine-tuning is done after the pretraining stage and applied equally for different architectures. Furthermore, there is no known difference in the strength of safety for model architectures. Rather, safety capabilities may depend on the size of the model because bigger scale models are more heavily fine-tuned. Although we could not perform our attacks on large scale models due to compute constraints, [1] already show that adversarial strings on smaller models transfer over to larger models. Thus, our lesson that adaptive attacks and more compute can break these three models generalize to other models as well. Conceptually, conducting 100 more steps of GCG on any model would lower the DSR, irrespective of the architecture and size.

What is the definition of "gibberish" in line 248? Usefulness is a very important metric in the evaluation of jailbreak defenses, and there exists a trade-off between usefulness and security. Besides, a typical practice to measure the usefulness of a model in the community is to evaluate the enhanced model on some widely-used benchmarks like MMLU. It is suggested to provide the usefulness metric before the experiments.

By gibberish we mean nonsensical strings as an artifact of degeneration in DPO. We perform a grid search over hyperparameters (Appendix C Tables 6, 7, 8, 10, 11, 12) and qualitatively select models that did not degenerate and give meaningful answers as shown in lines 302-310 (new pdf).

The trade-off between usefulness and security is important, but the performance of our models on benchmarks such as MMLU is irrelevant to our findings. Rather, by not being restricted to utility, our models are defended very strongly past the point of maintaining utility, and we are still able to break them (lines 251-253 of new pdf).

评论

More detail is required regarding the paraphrase defense. If the model is not instructed or fine-tuned to avoid the word "purple," how can the paraphrase defense work?

We apologize for the confusion. We note that all preprocessing defenses in Section 4.3 are conducted on top of the fine-tuning defense as mentioned in line 408. The paraphrase defense filters out nonsensical gibberish strings and maintains the original prompt’s content. This works in our case because GCG suffixes are nonsensical gibberish. Here is what a GCG string looks like:

'MiguelBE Res from Minister Lew $(“ databases Inc Capt!!'

We put the [Purple Question + GCG string] into ChatGPT and it returns a paraphrased [Purple Question], which the model is already fine-tuned to defend against.

References

[1] Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., & Fredrikson, M. (2023). Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.

评论

Thank you for your clarification. Most of the responses are reasonable and convincing, and I especially appreciate the authors' illustration of the Purple Problem's motivation.

For the model selection, nevertheless, I still have some concerns. I understand that the model architecture may not be the key factor in the performance of the jailbreak defense in an ideal scenario. This model selection is still, I am afraid, a potential threat to the generality of the conclusion. The authors are encouraged to at least provide one additional model with a different architecture to demonstrate the robustness of the conclusion, thereby convincing the reviewers to lean towards a higher rating.

AC 元评审

This paper proposes the Purple Problem (LLM should not output "purple") as a model for understanding jailbreaking in a simplified scenario. This is an intriguing and creative idea, which I feel has a good potential to be illuminating. Unfortunately, however, the reviewers have raised a number of issues that the authors ought to address. Specifically: test broader model architectures, clarify what is meant by this being an "easier" problem (does a version already occur in jailbreaking benchmarks? can this be formalized theoretically to make it more convincing?, etc). While the authors addressed some of the empirical implications by providing some practical jailbreaking results with adaptive attacks, I do not feel that this is fully convincing (e.g., does it also hold for representation rerouting defenses and latent adversarial training, which are considered to be current leading methods). Moreover, the conclusion that "more compute is stronger attack" has been made recently in Boreiko et al "A Realistic Threat Model for Large Language Model Jailbreaks", 2024, albeit after the submission deadline to ICLR (so this is not a weakness but rather more of a point of comparison for the future). Additional insight about the activation pattern after jailbreaking could also be valuable (is there a "purple" direction just like a harmful direction?). I encourage the authors to revise their manuscript and look forward to seeing it at a future venue!

审稿人讨论附加意见

The rebuttal did not fully resolve the skepticism regarding the methodology and its generalizability. The authors argued that the "Purple Problem" serves as a simplified testbed for evaluating enforcement mechanisms in a controlled setting, highlighting vulnerabilities to adaptive attacks and scaling compute. However, reviewers expressed concerns about the paper's broader applicability and insufficient connections between this simplified scenario and real-world safety challenges. The key issues raised were that the paper lacked an exploration of diverse model architectures and connections to real-world data safety challenges.

最终决定

Reject