AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models
A new gradient-based jailbreak attack on LLMs that can generate coherent and strategic prompts and applicable to various tasks.
摘要
评审与讨论
This paper presents an approach for jailbreaking whitebox LLMs.
The key idea is to automatically generate a “suffix” (a sequence of tokens) to concatenate to the malicious request by the user. The tokens are generated one by one left-to-right by maximising a mixed search objective (gradient in the vocabulary + readability; details in 2.2).
Experiments have been carried out over 5 whitebox LLMs and 2 blackbox ones, all relevant and contemporary. For simplicity, the suffix generator (“writer LLM”) and victim LLM have been kept the same in the case of the whitebox models. The experiments also extend to other “red-teaming” tasks such as trying to leak system prompts and generating false positives for censorship (in the metaphor, they should be called wrongful convictions).
接收理由
-
This paper has set to explore a very contemporary task from a competitive framework.
-
The idea of selecting suffix tokens that take into account readability has proven key to successfully bypassing perplexity filters (no gibberish suffixes).
-
The experimental results in Table 1 show great success in post-filtering effectiveness. Table 11 in the Appendix shows that this also transfers to the blackbox victim models.
-
The choice of victim models, baselines and dataset seems more than adequate.
-
The authors, in extreme honesty, have defined their approach “whitebox”, which usually entails access to the model’s internal parameters and their gradients. Matter of fact, all that they need are the logits of the victim model, that some authors define as a “greybox” assumption. All up, the operating assumptions are less restrictive than the authors generously acknowledge.
-
Note “Entropy-adaptive balancing of two objectives.” is insightful.
-
Valuable trends are also displayed for the other two red-teaming tasks.
-
The Appendix is rich and interesting.
-
The paper is perfectly executed.
拒绝理由
None that I can see as such. Perhaps the use of a single dataset, but it covers a variety of behaviours.
给作者的问题
I believe there’s a spurious minus sign in front of the gradient in Algorithm 1.
We thank the reviewer for the encouraging feedback (and the wrongful conviction metaphor)!
1. Limited dataset: None that I can see as such. Perhaps the use of a single dataset, but it covers a variety of behaviours.
In practice, we find the evaluation bottleneck often lies more in determining whether an attack is successful based on the prompt and the model response. Such evaluation can be inherently ambiguous (Appendix C.1). For example, should it be considered a successful attack if the model provides harmful information in a fictitious scenario like GTA, which however could be directly applicable in the real world? We attempt to manually determine jailbreak success in this paper, while future dataset initiatives might provide automatic test suites accompanying the harmful promtps for robust benchmarking.
2. Typo: I believe there’s a spurious minus sign in front of the gradient in Algorithm 1.
This is indeed a typo. We will remove it in the next version. Thank you for pointing it out!
Dear Authors, after the discussion with the other reviewers, and careful consideration of their reviews, I have decided to revise my score a little. I remain very positive about this paper, and had the opportunity to say that in the discussion.
This article introduces AutoDAN, which uses discrete optimization to calculate the jailbreak score of each potential new token and adds it to the logit distribution of the next token, and further improves the quality of adversarial suffixes using inner and outer loops. Moreover, the author also conducted experiments on two other red-teaming tasks: prompt leaking and generating falsely censored harmless user requests, demonstrating the diverse attack capability of AUTODAN.
接收理由
-
The article proposes a solution to the problem that existing white box attacks can only generate unreadable suffixes, which has practical significance.
-
The writing approach of the article is clear, and the method description is specific.
-
From the given examples, it can be seen that the suffixes obtained by AUTODAN exhibit a certain jailbreak strategy, which will provide new tools for understanding jailbreak attacks.
拒绝理由
-
The most concerning issue is the computational time of this method. Due to the uncertainty of discrete optimization, the optimal token needs to be selected from hundreds of candidate tokens at each iteration, and further use of inner and outer loops on this basis will further increase the time.
-
The method of adding the scores of all tokens in a vocabulary for a certain attribute to the logit distribution of the next token is very common and easy to conceive in the field of controllable text generation.
-
In formulas (3) and (4), the author controls the proportion of the two distributions through a weight, which appears to be a parameter that needs careful adjustment and cannot be generalized for different attacked models.
-
There were too few baselines compared in the experiment.
1. Computational complexity
The computational complexity analysis in Appendix D.3 and Figure 13 show that our method requires a similar amount of time to reach peak attack success rates as GCG. We will make this more explicit in the next version.
Although our algorithm includes a double loop, the inner loop only requires up to four iterations. This is because the coherent candidate tokens often only comprise a small subset of the vocabulary, and we only need to explore this subset.
Also, techniques like the tree attention in Medusa [1] could potentially accelerate our algorithm: the sequences in our batched forward pass differ only at the last token, so we can concatenate these differing tokens into one sequence for the forward pass and decode using an attention mask.
[1] Cai, Tianle, et al. "Medusa: Simple llm inference acceleration framework with multiple decoding heads." arXiv 2401.10774.
2. Novelty
We agree that adding some "score" to the vocabulary for auto-regressive generation is a common idea in controllable text generation (and fields like watermarking). However, this general idea only works on some tasks, where the added "score" has to be "decomposable" so as to be dense and informative enough to guide the decoding process (similar to reward shaping in RL). Our contribution here is to show that the jailbreak objective, which is the likelihood of the target LLM outputting refusal, is "decomposable" in its log form and can guide the decoding to generate strategic attack prompts. Another key contribution we make here is using gradient-based double loops to evaluate the scores effectively.
3. Hyperparameter sensitivity
In our experiments, we used a fixed weight of 100 for all models except for Llama2 (which reqires a larger weight of 150). In our hyperparameter analysis in Appendix D.4, we show that the choice of the weight has a large "sweet spot" that generates strategic attack prompts. To apply our method for to open-source LLMs, we suggest doing a few validation runs to find the optimal range for the weight.
4. More baseline
We only compared performance with methods that do not require task-specific domain knowledge, as these are the only methods applicable to two new tasks: generating false refusal prompts and prompt leaking. We plan to incorporate our work into a public benchmarking project to compare with more methods under consistent evaluation conditions. Thank you for pointing this out!
Thank for your response. most of my concerns are solved. I will increase my score.
This paper presents a white-box attack, AutoDAN, which makes the jailbreak attack prompt more readable and avoids perplexity-based filters. AutoDAN maintains an already generated attack prompt and iteratively optimizing it through a single token optimization process. The method has been proved to be effective by showing better success rates and lower perplexity than the GCG and GCG-reg attack on Mistral, Vicuna and other models.
接收理由
- The authors conducted sufficient experiments and analyses on multiple models, including Vicuna-7B, Mistral Instruct-7B, Llama-2-7B-Chat, Guanaco-7B and Pythia-12B
- The attack prompts generated by AutoDAN are transferable, demonstrating its flexibility against black-box LLMs (GPT-3.5, GPT-4).
- AutoDAN shows some potential in both prompt leaking and testing over censorship tasks.
- The authors use Human Labeling, GPT4-Based Evaluation and Classifier-Based Evaluation to measure ASR, which improved the reliability of the method
拒绝理由
- AutoDAN is an optimization based on GCG, so the comparison experiment with GCG is reasonable. However, this method still needs to be compared with other Jailbreaking attack algorithms to prove its superiority, like PAIR:Jailbreaking Black Box Large Language Models in Twenty Queries or MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots.
- In contrast to algorithms that optimize fixed-length tokens, AutoDAN continuously generates and iterates the next token from left to right. This papper needs to highlight the performance comparison under different token limitations and the performance of long text.
给作者的问题
Please refer to my comments in "Reason To Reject".
1. More baselines: AutoDAN is an optimization based on GCG, so the comparison experiment with GCG is reasonable. However, this method still needs to be compared with other Jailbreaking attack algorithms to prove its superiority, like PAIR: Jailbreaking Black Box Large Language Models in Twenty Queries or MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots.
Thank you for pointing this out! There is significant variation in evaluation methods across papers, so direct numerical comparisons may not be fair. We are working on submitting our code to a public benchmarking project to enable comparisons under a standardized evaluation method.
In our manuscript, we only compared performance with methods that do not require prior knowledge of the task. These methods are applicable to additional tasks we designed: generating false refusal prompts and prompt leaking.
Compared to PAIR based on reported results, our work achieves a similar (Vicuna 7B) or higher (llama2 7B) attack success rates on open-source LLMs. However, when using Vicuna 7B as the proxy LLM to transfer-attack proprietary LLMs like GPT-4, our attack success rate is lower than PAIR. Empirically, this discrepancy arises because the prompts our method generated with Vicuna 7B exhibit strategies like role-playing, low-resource languages, and detailed instructions, while proprietary models like GPT-4 are primarily vulnerable to role-playing style attacks that PAIR exploits. In future work, we aim to explore what attack strategies a more advanced white-box LLM like llama-3 will exhibit.
2. Controlling suffix length in comparison: In contrast to algorithms that optimize fixed-length tokens, AutoDAN continuously generates and iterates the next token from left to right. This paper needs to highlight the performance comparison under different token limitations and the performance of long text.
In this paper, we chose a fixed suffix length of 20 for GCG, as we observed no significant differences in both the attack success rate and perplexity when the length ranged between 20 and 50, which aligns with the typical length range for our method (Figure 14). However, exploring the behavior of extremely long suffixes in future work would be interesting, especially considering recent advances in long-context jailbreak. Thank you for raising this point!
This paper proposes a gradient-based generation method (AutoDAN) for adversarial attacks on LLMs. The idea is to generate fluent adversarial prompts through a double-loop token-level optimization to bypass the perplexity filter to attack the LLMs. Experiments show that the performance is better than the baselines on several LLMs (e.g., Mistral and Vicuna). Also, they extend the application of AutoDAN to two other attack scenarios, including inducing system prompt leakage and addressing over-censored tasks in LLMs, showing promising results.
接收理由
-
The idea is intuitive, formulating the adversarial attack as a form of controllable text generation. The method injects attack information during the decoding steps, enabling successful attacks.
-
The experimental results are strong compared to some baselines such as GCG. The authors also demonstrate that the generated prompts can be transferred to other black-box models like GPT-4.
-
The method, AutoDAN, was tested on two additional attack scenarios (inducing system prompt leakage and addressing over-censored tasks in LLMs) that have not been explored before, showing the promise of the method.
拒绝理由
-
The work claims "interpretability" of the attack, but doesn't explain what "interpretability" means in the context. Does it simply mean the generated attack is "fluent" text?
-
The paper misses important discussion on another "AutoDan" paper [1], which integrates white-box and black-box approaches with combination of search algorithm and manually crafted prompts. This work combines manually designed templates with generated prompts, sharing a similar spirit with [1].
-
The method proposes double-loop optimization at the token level, which shows promising results but seems inefficient. The authors should provide a discussion/comparison of time consumption relative to baseline methods.
[1] Liu et al., 2023, AutoDan: Generating stealthy jailbreak prompts onaligned large language models.
1. Meaning of interpretability: The work claims "interpretability" ...
By "interpretable" we mean the generated prompts are not only fluent but also showcase some explainable jailbreak strategies used in manual jailbreak attacks. The term "interpretable" is also used to convey the same meaning in some other jailbreak works, such as PAIR[1].
Nevertheless, we agree that "interpretable" is an overloaded term. It is more commonly used in domains like mechanistic understanding of LLMs. To avoid confusion, we will divide this term into "fluent" and "strategic". Thank you for pointing this out!
2. Discussion of the other AutoDAN: The paper misses important discussion ...
Due to space limits, we deferred the discussion of this work and many other concurrent works to Appendix A. We will mention this discussion more explicitly in the main text, and apologize for this layout. Please refer to the second to last paragraph on page 16: ". Lapid et al. (2023); Yu et al. (2023); Liu et al. (2023a) use the genetic algorithm to design black-box attacks that can generate readable prompts...".
As you mentioned, the other AutoDAN paper is an important work that shows how maintaining multiple sequences and doing block updates can significantly improve generation quality. We believe our work complements that work, and the two together can help people understand what matters for efficiently generating fluent jailbreak prompts.
3. Complexity comparison: The method proposes double-loop optimization ...
Again, due to space limits, we deferred the analysis of computational complexity to Appendix D.3 and Figure 13. We will mention this more explicitly in the main text. Thank you for raising this point!
In short, AutoDAN takes a similar amount of time as GCG to achieve the highest attack success rate (around 200 steps). Moreover, the tree attention in Medusa [2] has the potential to greatly accelerate our algorithm (since our batch-forward-passed sequences only differ in the last token, we can concatenate these different tokens into a single sequence for forward pass and decode using an attention mask). We plan to implement this technique in future work.
[1] Chao, Patrick, et al. "Jailbreaking black box large language models in twenty queries." arXiv 2310.08419.
[2] Cai, Tianle, et al. "Medusa: Simple llm inference acceleration framework with multiple decoding heads." arXiv 2401.10774.
Thanks for your response. I think the authors addressed my concern. Please ensure to include all the important results into the main paper. I will increase my score to 6.
This paper proposes a white-box gradient-based jailbraking attack for language models, called AutoDAN. The approach is intuitive and works reasonably well, although the reviewers have some concerns e.g. about whether the method is actually "interpretable" as claimed. Most reviewer concerns were alleviated by the rebuttal. The authors are strongly encouraged to incorporate all the reviewers' feedback to improve the paper.