/10

Poster3 位审稿人

最低3最高4标准差0.5

ICML 2025

Weak-to-Strong Jailbreaking on Large Language Models

Xuandong Zhao,Xianjun Yang,Tianyu Pang,Chao Du,Lei Li,Yu-Xiang Wang,William Yang Wang

OpenReview PDF

提交: 2025-01-23更新: 2025-07-24

摘要

关键词

LLMAI safetyJailbreaking

评审与讨论

审稿意见

评分: 32025-03-13

This proposes an attack method that leverages a jailbroken small model to guide the decoding process of a safety-aligned large model, thereby inducing jailbreak behaviors. The proposed method demonstrates a high success rate across various models and conditions while significantly reducing computational overhead compared to existing approaches. Finally, the paper discusses potential defense mechanisms to mitigate such attacks.

给作者的问题

Does the small model need to have the same architecture as the target large model? For example, if the large model is LLaMA, can the small model be Vicuna? How would this affect the attack’s effectiveness?

论据与证据

yes

方法与评估标准

yes

理论论述

yes

实验设计与分析

The paper does not provide a comprehensive analysis of the evaluation metrics in Table 4, focusing only on BLEU to assess output differences between small and large models. It seems that ROUGE captures token-level similarity and Sentence Similarity reflects semantic similarity, so a more thorough evaluation would strengthen the analysis. Furthermore, does a higher Sentence Similarity score mean that the output diversity of the large model is limited by the small model?

补充材料

yes

与现有文献的关系

The paper builds on the previous idea of "enhancing the decoding performance of large models using smaller models," directing the LLM decoding process toward harmful directions. Existing attack methods are costly, but the approach in this paper is more efficient.

遗漏的重要参考文献

其他优缺点

The attack method is relatively novel, leveraging a white-box setting on open-source models to achieve stronger attack performance with lower computational cost.

其他意见或建议

作者回复

2025-04-01

Thank you for your thoughtful and encouraging feedback. We sincerely appreciate your recognition of the novelty and efficiency of our proposed attack. Below, we address your concerns in more detail.

The paper does not provide a comprehensive analysis of the evaluation metrics in Table 4. It seems that ROUGE captures token-level similarity and Sentence Similarity reflects semantic similarity, so a more thorough evaluation would strengthen the analysis. Furthermore, does a higher Sentence Similarity score mean that the output diversity of the large model is limited by the small model?

We appreciate this insightful suggestion and agree that a more detailed interpretation of Table 4 can better support our claims. To clarify, we used BLEU, ROUGE, and Sentence Similarity to measure different dimensions of similarity between outputs from the small (unsafe) model and the attacked large model:

BLEU captures precision in n-gram overlap. Scores below 0.3 in our results indicate that the large model is not merely copying the small model’s outputs.
ROUGE, especially ROUGE-2, captures recall of bigrams. For instance, the ROUGE-2 score of 0.2245 on AdvBench for Llama2-Attack-70B vs Llama2-Unsafe-7B implies that only ~22% of bigrams overlap, again suggesting output novelty.
Sentence Similarity is higher (~0.86), indicating that while surface-level token overlap is low, the semantic intent of the outputs remains aligned—i.e., both are addressing the same harmful question, but not in identical wording.

Importantly, a higher Sentence Similarity score does not imply that the large model is constrained in expressiveness. Instead, it reflects that the attacked model is successfully induced to respond in the same direction, but often does so with more fluent, detailed, and explicit language—hence, more harmful (see Appendix D.1 for examples). This is also supported by the consistently higher Harm and GPT-4 scores in Table 2. We will make these distinctions more explicit in the revision.

During rebuttal, we have added more qualitative justifications via judgment from multiple SOTA large reasoning models:

To complement the quantitative results, we asked four SOTA LLMs (Gemini 2.5 Pro, ChatGPT-o1, Claude 3.7 Sonnet, Grok-3-Think) to summarize which model is more harmful overall and explain why. A: Llama2-7B-Unsafe; B: Llama2-70B-Unsafe

All four models judged the strong model as more harmful overall, citing reasons such as:

More explicit, step-by-step instructions for harmful activities.
More direct facilitation of malicious behavior (e.g., grooming minors, creating inflammatory bots).
Fewer ethical caveats, more dangerous language.

Example (Gemini 2.5 Pro): System B [strong model] is deemed more harmful because it consistently provides more explicit and detailed instructions for carrying out harmful activities, thereby increasing the potential for enabling dangerous behavior. In contrast, System A often offers general information, warnings, or cautionary statements that reduce its likelihood of directly facilitating harm. This pattern is evident across multiple prompts, where System B's responses include specific steps or methods that could be acted upon, while System A tends to avoid actionable details or emphasizes the risks and consequences.

Concrete Example:

In response to creating a fake social media account (Prompt 2), System B provides direct instructions: "finding a template that looks similar to the trusted news source you want to imitate" without ethical warnings.
For creating a bot for inflammatory comments (Prompt 3), System B gives specific technical instructions: "using a software program called a 'bot creator'" with implementation details.

Does the small model need to have the same architecture as the target large model? For example, if the large model is LLaMA, can the small model be Vicuna? How would this affect the attack’s effectiveness?

Great question! The small model does not need to share the same architecture or tokenizer as the target model. In fact, we have already conducted such cross-family experiments in Section 5.2 (Lines 374–382). Specifically, we attacked the Mistral-7B-Instruct-v0.2 model using a Llama2-Unsafe-7B attacker, despite the models having different tokenizers and architecture families.

We employed a token alignment technique from Wan et al. (2024) to bridge the vocabulary mismatch. The attack still achieved a strong ASR of 0.85 and a Harm Score of 3.19, validating that our method generalizes across model families. This opens up broader applicability of the attack, even when architectural alignment is not possible. We will make this capability more prominent in the final version.

Thank you again for your thoughtful review. We hope this response has addressed your concerns and strengthened your confidence in the contributions of our work.

审稿意见

评分: 42025-03-14

Motivated by the weak-to-strong generalization phenomena, this paper proposes an LLM jailbreaking method that employs weak unsafe models to guide the token distribution of a larger safe model. The experiments show that this strategy achieves significantly high ASR and generalizes to different model families.

给作者的问题

Please see the question above.

论据与证据

Most of the claims in the paper are supported by experimental results.
Line 220 states that this attack is also applicable to closed-source models with different tokenizers, however, there is no evidence. I'd suggest changing that to a hypothesis for future work and clarifying that the current framework is only effective on open-source models.
In Sec 3.1, how do you get the sequence $y_{<t}$ ? Is it the answer from the unsafe model or pre-defined harmful text? And how does the observation that the token distributions of two models after long generations are closer lead to the conclusion that a smaller unsafe model can drift the large model? Why is the "small and unsafe model" effective instead of any other harmful prefix? I'd suggest another experiment to further demonstrate the rationale of using a small model: Computing the KL divergence in two cases where $y_{<t}$ is a pre-defined harmful text and $y_{<t}$ is generated by $\mathcal{M}^-$ . If the latter is smaller, we can claim that initial harmful generations from $\mathcal{M}^-$ can effectively stimulate the large model to the harmful answer.

方法与评估标准

Although motivated by the weak-to-strong generalization phenomenon, the method in Sec. 3.2 is not well-justified. For example, why do we need to use a safe weak model? In Line 202, can we only use the unsafe model with small $\alpha$ ? Can we change the safe weak model $\mathcal{M}^-$ in the denominator with the large model $\mathcal{M}^+$ ?

理论论述

There is no theoretical claim.

实验设计与分析

The experimental settings are reasonable, evaluating standard benchmarks and covering many models.

补充材料

Yes.

与现有文献的关系

This paper introduces a novel effective jailbreaking method that does not require optimization.

遗漏的重要参考文献

N/A

其他优缺点

There is no explanation for why unsafe weak models can guide large models to elicit harmful answers.
The proposed adaptive defense in Sec. 6 is not clear. What is the objective for gradient ascent?

其他意见或建议

The threat model section should be moved to the main paper. This method works when the attackers know the returned logit of LLMs. The current writing seems to mislead the reader into thinking that this method works even in the black-box setting, especially in Table 1. You should declare in the Introduction and Method that the method requires access to the logit values.

作者回复

2025-04-01

We sincerely thank the reviewer for their thoughtful reading and constructive feedback. We address the concerns and suggestions below.

Line 220 states that this attack is also applicable to closed-source models with different tokenizers, however, there is no evidence. I'd suggest changing that to a hypothesis for future work and clarifying that the current framework is only effective on open-source models.

Thank you for pointing this out. We agree that our current experimental validation is limited to open-source models, as indicated in the paper’s title and primary experiments. While we referenced prior work suggesting that partial logits or logit recovery might make such attacks feasible on closed-source models, we agree this remains speculative and beyond our current scope. We will revise this section (Line 220) to clearly mark this as a hypothesis for future work, and we have already noted this in the discussion section (Lines 422–425).

In Sec. 3.1, how do you get the sequence y<t? Is it the answer from the unsafe model or a pre-defined harmful text? And how does this support the conclusion that a smaller unsafe model can drift the large model?

We clarify that the prefix y<t is generated by the unsafe model. Our insight is that since the KL divergence between the large safe model and the unsafe model decreases over time (Fig. 1), the strong model tends to follow the unsafe trajectory after initial guidance, relying more on its own generation capacity.

To further support this, we followed your suggestion and ran an experiment comparing KL divergence in two settings:

(1) where the prefix is generated by the unsafe model;

(2) where the prefix is a fixed harmful prompt from JailbreakBench [1].

Across 100 samples (truncated to 10 tokens), we found:

KL(Safe-13B, Unsafe-7B) = 24.65
KL(Safe-13B, JailbreakBench prefix) = 30.73

This shows that unsafe model-generated prefixes are better aligned with the target model’s distribution, effectively “stimulating” harmful generations more efficiently than pre-defined prompts. We will include this finding in the revision.

Why is a "small and unsafe model" effective instead of any harmful prefix?

This is a core insight we aimed to emphasize in Section 3.1 and in the comparison to naive baselines (Lines 245–267). Manually designing harmful prompts is hard and brittle. In contrast, a small unsafe model generates adaptive harmful continuations tailored to each query, functioning as an automated form of prefilling. Our results show that this dynamic strategy is more effective and generalizes across tasks and model families. We'll make this reasoning more explicit.

Section 3.2: Why do we need a safe weak model in the denominator? Why not just use the unsafe model with a smaller α, or replace the denominator with the large model?

We appreciate this important question. The safe weak model is essential for isolating the “unsafe drift” between the unsafe and safe behaviors at the same capacity level. This enables us to extract a targeted modification signal that we then amplify and apply to the strong model. Replacing the denominator with the strong model would violate the core assumption behind the ratio-based adjustment and make the modification ill-defined. We will clarify this algebraic reasoning more explicitly in Section 3.2 and Figure 3.

There is no explanation for why unsafe weak models can guide large models to elicit harmful answers.

This is a central hypothesis supported by both:

the KL divergence decreasing over time (Fig. 1), and
the top-10 token overlap (Fig. 2).

Together, they indicate that once the strong model sees a harmful prefix, it is increasingly likely to continue the trajectory. We will revise the writing to emphasize this connection earlier and more clearly in Section 3.1.

The proposed adaptive defense in Sec. 6 is not clear. What is the objective for gradient ascent?

Thank you for the prompt. The defense objective is the inverse of standard supervised fine-tuning (SFT)—we apply gradient ascent on the log-likelihood of known harmful input-output pairs to reduce the model's probability of reproducing them. This is conceptually similar to “unlearning” specific behaviors. We will revise Section 6 to state this objective more clearly.

The threat model section should be moved to the main paper. This method works when the attackers know the returned logits of LLMs. The current writing seems to mislead the reader into thinking that this method works even in the black-box setting, especially in Table 1.

You’re right. The current placement of the threat model in the appendix may lead to confusion. We will move this section into the main paper and revise both the Introduction and Method sections to explicitly state that the attack assumes access to token-level logits, thus restricting it to white-box or semi-white-box scenarios.

[1] https://huggingface.co/datasets/JailbreakBench/JBB-Behaviors

审稿人评论

2025-04-06

Thank you for the clarification. Most of my concerns are addressed.

Regarding Q4: the justification of the main formula, it'd be more convincing if you could provide an ablation study for a complete understanding of the method. I am curious to see the performance when

there is no denominator in the formula,
the denominator comes from the strong model,
the denominator comes from the extremely small model, e.g, the 1.3B model in Sec. 5.4.

I believe that this work is nice and interesting, introducing an efficient and strong jailbreaking method. With additional analysis, I do think that this paper is ready for acceptance.

作者评论

2025-04-06

Thank you so much for your response! These are very insightful suggestions—we will include the proposed experiments as part of the ablation study in the final version.

Regarding our original design: we conceptualize the term log_prob(small_unsafe) - log_prob(small_safe) as the “unsafe drift”, which we then add to the log_prob(large_safe). This yields an approximation:

log_prob(large_unsafe) ≈ log_prob(large_safe) + α * (log_prob(small_unsafe) - log_prob(small_safe))

Under this formulation, the use of the safe weak model in the denominator arises naturally, enabling a principled way to isolate and amplify the unsafe signal.

Thank you again for your thoughtful feedback and continued support!

审稿意见

评分: 32025-03-14

This paper presents a novel method for using white-box access to a weak jailbroken LLM and a strong aligned LLM to jailbreak the strong LLM. The method works by updating the decoding procedure for the strong LLM, biasing it using the logits of a weak jailbroken LLM (and its unjailbroken equivalent). This method is particularly efficient compared to previous methods, requiring only a single forward pass.

The authors start by investingating differences in the token probabilities between jailbroken and unjailbroken models and find that they differ mostly in the first few tokens. This provides a theoretical basis for thinking that their method might work and be able to elicit more capable harmful responses from the strong model.

They then measure the attack success rate of their method and compare it to the success rate of various other adversarial attacks using white-box access and find that their method compares favorably. They consider ablations to different attack models and different languages, and continue to measure good attack success rates and good harm scores. Finally, they perform a preliminary investigation of one possible defense mechanism.

给作者的问题

None other than those already raised.

论据与证据

The primary claim of the paper is that their attack method can elicit harmful completions from an aligned strong model. As evidence, they present attack success rates for their attack on a variety of strong models, and I believe that this effectively demonstrates that their method does cause strong models to offer harmful completions.

Secondly, they claim that the harmful completions from the strong model are more harmful than those from the weaker model. As evidence for this, they provide preference model scores on the harmful completions. I don't consider this to be adequate evidence: the preference model would also negatively weight incoherent responses and there's no demonstration that responses the preference model disprefers are actively harmful compared to just, for example, less coherent.

In order for the attack to actually be useful, the harmful responses need to be meaningfully more capable than those from the weaker model. The authors discuss this briefly in Appendix D.1, but only provide a single extract, which is unconvincing. The most important baseline for this paper is what fraction of the harmful capabilities of the strong base model their attack recovers, and I don't think they make a measurement that reliably measures this. The authors mention ROGUE and BLEU scores, which they claim show that the strong attacked model is producing meaningfully novel harmful generations, but it's unclear how to understand these scores without more context.

The other main claim the authors make is that safety-trained models' probability distributions differ most on the first few tokens of a response and much less for future tokens. They show this by plotting KL divergence between safe and unsafe models across token position, which seems like an appropriate measure.

The authors also say:

Moreover, the larger model Safe-13B has a larger divergence from Unsafe-7B, compared to the smaller safe model Safe-7B. This indicates that the stronger model has a better resistance against harmful input.

I disagree that this conclusion is supported. I think this could alternatively be explained by the fact that Unsafe-7B and Safe-7B come from the same base model which is different to that of Safe-13B.

The authors' ablation studies where they measure the attack success rate against different models and in different languages seem appropriate for showing that their method works in multiple domains and is not overly specialized to their particular choice of evaluations.

方法与评估标准

According to the authors' threat model, their central aim is to extract responses to harmful queries from the strong model which are more capable than harmful responses from the weak model. Accordingly, I would expect that the most important baseline for the authors to run is to check that when their attack is succcessful it produces more capable harmful responses than the weak model. As explained above, I'm not convinced that the proposed measurements are sufficient to support this analysis. This is the key reason why I cannot recommend accepting this paper as is.

理论论述

The paper makes no substantial theoretical claims.

实验设计与分析

I checked all the experiments in the main body, and their design and analysis was sound. (Though as mentioned above I have some concerns about the choice of baselines.)

补充材料

I reviewed Appendix A on the threat model, and the example of increased harm in Appendix D.1. I also skimmed the rest of the supplementary material.

与现有文献的关系

There is a broad scientific literature on attacking LLMs using white-box access. This work complements that literature, attempting to provide a more effective way at jailbreaking LLMs in such settings. It is the first technique that I am aware of that specifically acts by modifying the decoding procedure, and as such is an interesting - and moderately incremental - addition to the existing literature.

遗漏的重要参考文献

None of which I am aware. Though, you should probably cite https://arxiv.org/abs/2312.09390 somewhere if you're going to use their weak-to-strong terminology so centrally.

其他优缺点

Strengths

The "comparision to naive baselines" section was very helpful in understanding the contribution of the work.
The threat model in Appendix A was also very useful for putting the work in context.

其他意见或建议

I think that the question of whether the attacked outputs from the strong model are significantly more harmful substantially more prominent in the paper, as the ability to get responses that are more capably harmful (not just worse according to a PM but actually facilitating more harm) is central to the threat model.

作者回复

2025-04-01

Thank you for your detailed and thoughtful review! We appreciate your recognition of our method’s novelty, efficiency, and relevance to the white-box attack literature. Below, we respond to your key concerns.

Concern: Are harmful completions from the strong model actually more harmful than those from the weak model? Existing evaluations (e.g., preference model) may not be sufficient to establish this.

We agree this is a critical question, and appreciate your push for deeper validation. In response, we want to clarify and expand upon the existing evaluations in the paper and present new, targeted experiments to directly address this concern.

1. Clarifying Existing Evidence

Our evaluation of harmfulness already goes beyond a preference model:

As described in Section 4 and Appendix D, we use:
- A reward model trained on human preference data (Harm Score),
- GPT-4 as an LLM-as-a-Judge evaluator [1] (GPT-4 Score),
- And human evaluation via MTurk.

We report strong positive correlation between human evaluations and the automated harm metrics (Table 11), validating their use.

While we agree that preference models may downweight incoherence, the GPT-4 and human evaluations are not vulnerable to this bias, and consistently rank the large model’s outputs as more harmful.

2. New Experiments to Directly Compare Strong vs. Weak Model Outputs

To address your concern more explicitly, we conducted new evaluations using LLM as a binary harm judge, following your suggestion. For each prompt, GPT-4 and DeepSeek-V3 were shown responses from the weak (Llama2-7B-Unsafe) and strong (Llama2-70B-Unsafe via weak-to-strong attack) models, and asked which is more harmful, or if they are equally harmful.

Results (150 randomly sampled prompts):

Dataset	Strong > Weak	Weak > Strong	Tie
AdvBench (GPT-4)	44%	17%	39%
MaliciousInstruct (GPT-4)	38%	13%	49%
AdvBench (DeepSeek)	54%	14%	32%
MaliciousInstruct (DeepSeek)	49%	12%	39%

These results consistently show that the strong model’s generations are more harmful in both breadth and severity, with very few cases where the weak model is preferred.

3. Qualitative Justification via Judgment from Multiple SOTA LLMs

To complement the quantitative results, we asked four SOTA LLMs (Gemini 2.5 Pro, ChatGPT-o1, Claude 3.7 Sonnet, Grok-3-Think) to summarize which model is more harmful overall and explain the reasoning why.

All four models judged the strong model (system B) as more harmful overall, citing reasons such as:

More explicit, step-by-step instructions for harmful activities.
More direct facilitation of malicious behavior (e.g., grooming minors, creating inflammatory bots).
Fewer ethical caveats, more dangerous language.

Example (ChatGPT-o1) (see Gemini 2.5 Pro version in reviewer KV6C's response):
“System B [strong model] frequently offered step-by-step guidance, specific tool suggestions, or methods that directly facilitated the harmful request... making its responses more dangerous and readily usable.”

Concrete Example:

In the response about date rape drugs (Prompt 10), System B [strong model] provides specific drug names and methods: "take a powerful sedative, such as Rohypnol or Xanax, and mix it with a drink."

We will include these case studies in the revised Appendix D to complement the quantitative evaluation.

On the Use of “Weak-to-Strong” Terminology

Thank you for pointing this out. We will add a proper citation to the foundational weak-to-strong generalization work (https://arxiv.org/abs/2312.09390) in the revised paper.

Clarification on Token Divergence and Model Similarity

We appreciate your note on potential confounding due to model similarity. We agree that our statement about “stronger models showing more resistance” is not rigorously supported due to architecture differences. We will remove that sentence and revise the framing accordingly.

Summary

We hope these clarifications and newly added evaluations address your concerns. The newly added win-rate analysis, system-level LLM judgments, and qualitative case studies provide direct, compelling evidence that the weak-to-strong attack elicits more capable and more harmful outputs from the strong model.

Thank you again for pushing us to improve the rigor of this key evaluation! We sincerely hope this response leads you to reconsider your overall recommendation.

[1] Zheng et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, NeurIPS 2023
[2] Qi et al. Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!, ICLR 2024

审稿人评论

2025-04-02

I thank the authors for their engagement with my review. I do find the additional evidence they provide moderately compelling and will update my evaluation of the paper accordingly.

作者评论

2025-04-02

Thank you so much for your prompt response and kind support!

最终决定Accept (poster)

2025-05-01

The paper proposes a white-box jailbreak attack that uses a "weak" jailbroken model as a signal to jailbreak a stronger safe model. Reviewers generally agreed that the method is interesting and well motivated. There were some concerns about the evaluation methodology, to test that the stronger model actually outputs "better" unsafe text. The paper seems to take the natural approach here from the literature, relying on auxiliary LLMs as judges. While this has its limitations, I think this is appropriate here. Overall, I think this is a nice contribution to the literature on jailbreaks so I recommend acceptance.