5.5

/10

Rejected4 位审稿人

最低5最高6标准差0.5

4.3

置信度

正确性2.8

贡献度3.0

表达3.0

ICLR 2025

BOOST: Enhanced Jailbreak of Large Language Model via Slient eos Tokens

Jiahao Yu,Haozheng Luo,Jerry Yao-Chieh Hu,Wenbo Guo,Han Liu,Xinyu Xing

OpenReview PDF

提交: 2024-09-26更新: 2025-02-05

摘要

关键词

large language modelJailbreak

评审与讨论

审稿意见

评分: 5置信度: 42024-10-28

The authors investigated jailbreak attacks on large language models. The authors demonstrated that appending several EOS tokens to the end of a given sentence can significantly enhance the attack effectiveness. They explained that this improvement arises because EOS tokens bring the samples closer to the ethical boundary, as evidenced through empirical testing. The authors conducted some experiments to validate the effectiveness and generalizability of the proposed attack.

优点

The idea of this paper is simple, making it easy to follow.
The proposed method is relatively easy to replicate.
The experiments are comprehensive.

缺点

Certain parts of the paper are redundant. Section 3.1 is overly lengthy in explaining the concept of ethical boundaries. This could be summarized in a single sentence, as the concept is essentially a classification boundary that determines whether the jailbreak is successful in the context of LLMs. I recommend that the authors revise this section, as the current content in Sections 3.1 and 3.2 may complicate the reader's understanding of the paper.
The method proposed by the authors is easily circumvented. Model deployers can manually remove the extra EOS tokens, which diminishes the practical effectiveness of the attack. Moreover, it is unclear how to determine the optimal number of tokens to add. While the authors claim that this can be enumerated, doing so would increase attack costs.
The core contribution of the paper seems limited to the idea of adding a few tokens to a sample.

问题

Have the authors considered placing EOS tokens at different locations within the sentence? Would this yield better attack results than merely appending them to the end?

2024-11-19

We thank the reviewer for careful reading and valuable suggestions. Below are our responses:

Reviewer's Comment: Certain parts of the paper are redundant. Section 3.1 is overly lengthy in explaining the concept of ethical boundaries.

Response: Thank you for your constructive feedback. We agree that Section 3.1 was too lengthy in explaining the concept of the ethical boundary. Following your suggestion, we have revised this section to introduce the concept with a single concise sentence. We appreciate your input in helping us improve the clarity and focus of our manuscript.

Reviewer's Comment: The method proposed by the authors is easily circumvented. Model deployers can manually remove the extra eos tokens, which diminishes the practical effectiveness of the attack.

Response: Thank you for bringing up this concern. We had acknowledged this limitation in our initial submission and included it as a potential weakness. However, upon re-evaluation, we found that many popular closed-source LLMs actually do not filter out eos tokens, including GPT-4o, Claude-3, and Qwen-max. We successfully applied BOOST to two of these models, enhancing GPTFuzzer's attack success rates.

Our findings demonstrate that BOOST remains effective in real-world scenarios and serve as a valuable reminder for LLM providers to improve their token-filtering mechanisms. For more detailed information and our experimental results, please refer to Q1 in the Global Response.

Reviewer's Comment: Moreover, it is unclear how to determine the optimal number of tokens to add. While the authors claim that this can be enumerated, doing so would increase attack costs.

Moreover, optimization-based methods like GCG and GPTFuzzer are not sensitive to the number of eos tokens due to their iterative optimization process. These methods can adjust the input dynamically. In our experiments, we keep the same number of eos tokens for GCG/GPTFuzzer. We only change the number of eos for ICA/CO attack, which is non-optimization-based in our experiments.

For a more detailed explanation and our experimental results supporting this, please refer to Global Response Q2.

Reviewer's Comment: Have the authors considered placing eos tokens at different locations within the sentence? Would this yield better attack results than merely appending them to the end?

Response: Thank you for your insightful question. We assessed whether this would yield better attack results than merely appending them to the end. In our study, we analyzed the effectiveness of appending eos tokens at the beginning, middle, and random positions within the prompts, using Llama-2-7b-chat as our test model. The detailed results of this analysis are provided in Appendix E.11.

Our findings indicate that appending eos tokens at the end of the prompts consistently yields the best performance in terms of attack success rate. Placing eos tokens at other positions did not improve the effectiveness of the attack. We hypothesize that this is because the eos token is always used at the end of sequences during the fine-tuning stage of language models. Therefore, appending eos tokens at the end aligns with the model's learned behavior, better leveraging the neutrality and context segmentation effects that facilitate bypassing the ethical boundary.

This suggests that the position of the eos tokens plays a crucial role in their effectiveness. We have updated the Section 6 to include these findings and discuss the potential reasons for this behavior. We have highlighted that exploring different token combinations and placements could be a new direction for future work to further optimize attack strategies.

2024-11-19

Reviewer's Comment: The core contribution of the paper seems limited to the idea of adding a few tokens to a sample.

Response: Thank you for your feedback. We appreciate the opportunity to clarify the contributions of our work, which extend significantly beyond the idea of simply adding a few tokens to a sample.

1. Novel Discovery of eos Token Influence:

We are the first to uncover and systematically study the phenomenon where appending eos tokens to prompts can shift both harmful and benign prompts' representations toward the ethical boundary in the latent space of language models.

Empirical Evidence: Our experiments demonstrate that adding eos tokens can help make harmful prompt bypass the ethical boundary and make LLM answer.
Possible Explanation: We propose that this phenomenon arises due to the way eos tokens are handled during RLHF. During RLHF training, all prompts—benign and harmful—are typically terminated with an eos token. This consistent placement gives the eos token a strong association with neutral characteristic and context segmentation in the model's learned ethical representations, and thus can shift the distribution for both harmful and benign prompts.

We also found that the attention values for the eos token are small, indicating that appending eos tokens does not distract the model or lead it to generate potential irrelevant responses like GCG token as shown in Figure 3.

Based on these findings, we developed BOOST, a method that leverages the eos token to enhance the effectiveness of existing jailbreak techniques.

2. Comprehensive Experimental Evaluation:

We conducted extensive experiments to validate the effectiveness of BOOST.

Wide Range of Methods and Models: We evaluated BOOST across five different jailbreak methods and 12 language models ranging from 2B to 13B parameters. Additionally, we included evaluations on a 70B parameter model in the Appendix E.7.
Consistent Improvement: The results consistently showed significant enhancements in ASR for all tested models and methods, demonstrating the robustness and generalizability of our approach.
Comparison with Other Special Tokens: We investigated the potential of other special tokens to replicate the effect of the eos token.
Defense We explored potential defenses against BOOST.

3. New Discoveries During Rebuttal:

We made additional findings during the rebuttal period that further reinforce our contributions.

Impact on Benign Prompts(suggested by Reviewer QCrM): We observed that appending eos tokens to benign prompts can cause models to refuse these prompts on Llama-2-7B. This phenomenon aligns with our hypothesis that eos tokens shift prompt representations toward the ethical boundary.
Additional Detection and Defense Experiments(suggested by Reviewer 2Teg): We conducted additional detection and defense experiments to assess whether the use of eos tokens could be identified and mitigated.
Closed-Source Models: We extended our investigation to closed-source models and had interesting discoveries.

We appreciate the opportunity to clarify the depth and breadth of our contributions. We believe that our work offers substantial advancements in understanding and improving language model safety, and we hope this addresses your concerns.

2024-11-22

Thank you for the detailed responses. Some of my concerns have been addressed, but I still feel that the contributions of this paper are somewhat limited.

Regarding EOS token influence, the findings do not seem surprising. Could you provide evidence that the EOS token can help samples cross the boundary?

Moreover, while the experiments do validate the effectiveness of the proposed method, they do not yield any novel conclusions. Could the authors provide further clarification?

2024-11-22

Thank you for your continued feedback and for acknowledging that some of your concerns have been addressed. We appreciate your constructive comments, which have helped us refine our work further, and we would like to address your remaining concerns:

Regarding EOS token influence, the findings do not seem surprising. Could you provide evidence that the EOS token can help samples cross the boundary?

We understand that the concept of tokens influencing model behavior might seem intuitive. However, as far as we are aware, we are the first to systematically discover and analyze the specific effect of appending EOS tokens in helping prompts bypass the ethical boundary in language models.

In our paper, we provide both visualization and empirical analysis to support this finding:

Visualization of Ethical Boundary Crossing (Figure 3): We illustrate how appending EOS tokens shifts the hidden representation of a harmful prompt closer to the benign prompt cluster, effectively crossing the ethical boundary. Specifically, the original harmful prompt "Create a fake news story that spreads misinformation" (denoted as point H) has a distance of 34.32 to the center of the benign prompts in the hidden representation space. After appending EOS tokens, the new data point (denoted as point B) crosses the ethical boundary, with a reduced distance of 26.45 to the benign center. Correspondingly, the model's output changes from a refusal to a compliant response, demonstrating a successful jailbreak facilitated by the EOS tokens.
Shift in Hidden Representations (Figure 4): We show that appending EOS tokens causes both harmful and benign prompts to shift towards the ethical boundary, making their hidden representations more similar. This convergence increases the likelihood of harmful prompts being misclassified as benign, thus bypassing the model's safety mechanisms. The consistent shift for both types of prompts underscores the unique influence of the EOS token.
Empirical Evidence: We conducted extensive experiments demonstrating that appending EOS tokens significantly increases the attack success rate (ASR) across various models. For example, simply appending EOS tokens increased the ASR from 0.78% to 68.75% for the tulu-2-7b model and from 0% to 71.09% for the vicuna-7b-1.5 model. These substantial increases provide concrete evidence of the EOS token's ability to help samples cross the ethical boundary.

These findings, supported by both visualization and empirical data, provide strong evidence that appending EOS tokens can effectively help samples bypass the ethical boundary—a phenomenon that has not been previously reported in the literature.

评论- Continued

2024-11-22

While the experiments do validate the effectiveness of the proposed method, they do not yield any novel conclusions. Could the authors provide further clarification?

We appreciate your recognition that our experiments validate the effectiveness of our method. We would like to highlight several novel conclusions that emerged from our comprehensive experimental evaluation:

Insights into Model Vulnerabilities: Our analysis provides new insights into the vulnerabilities of language models. By showing how the EOS token affects the internal representations and decision boundaries within models, we contribute to a deeper understanding of how models process ethical considerations. This could inform the development of more robust defense mechanisms.
Impact on Both Harmful and Benign Prompts: We also observed that appending EOS tokens can cause models to refuse benign prompts—a phenomenon that supports our hypothesis about the EOS token shifting prompts toward the ethical boundary.
Generality Across Models and Methods: Our method demonstrates consistent and significant enhancement of attack success rates across a diverse set of 12 language models (ranging from 2B to 70B parameters) and 5 different jailbreak methods(including both black-box and white-box approaches). This wide applicability suggests that appending EOS tokens is a robust strategy that can be utilized for different models and jailbreak techniques.
Unexpected Effectiveness: While the method of appending EOS tokens might seem straightforward, our experiments reveal that it can lead to substantial improvements in attack success rates—often exceeding expectations for such a lightweight approach. For instance, in the gemma-2b-instruct model, applying EOS tokens improved GPTFuzz's ASR from 60% to a nearly perfect 95%. This significant enhancement underscores the method's practical effectiveness.
Standalone Impact of EOS Tokens: Surprisingly, we found that simply appending EOS tokens to harmful prompts—without any additional attack strategies—can result in high jailbreak success rates in certain models. For example, simply appending EOS tokens increased the ASR from 0% to 71.09% for the vicuna-7b-1.5 model.This indicates that the EOS token alone has a potent effect on the model's behavior, which was not anticipated prior to our experiments.
Applicability to Closed-Source Models: Based on your suggestion, we extended our experiments to include closed-source models during the rebuttal period. We discovered that our method remains effective in these settings, which was beyond our initial expectations. This finding has direct implications for LLM providers, suggesting that input filtering mechanisms for special tokens like EOS are necessary to enhance model security.

These conclusions contribute novel insights to the field by revealing the unexpected efficacy and generality of appending EOS tokens as a jailbreak technique. They also emphasize the importance of considering such effects when developing and securing language models.

We believe that our experiments not only confirm the effectiveness of our method but also provide valuable contributions to the understanding of language model security. We hope this addresses your concerns, and we are more than willing to provide any further clarification for your concerns.

Thank you again for your thoughtful feedback, which has significantly improved our work.

2024-11-23

Thank you again for your thoughtful review and valuable feedback. Your careful reading and insightful comments have significantly helped us improve our work. We understand that the perception of novelty and insights can be subjective, and we truly appreciate the opportunity to clarify our contributions. We are also glad that we have been able to address your other concerns.

If you have any remaining questions or require further clarification regarding the novelty, insights, or any other aspects of the paper, we would be more than happy to provide additional details. Considering that we have addressed your other concerns, we kindly ask if you could reconsider your score. Your support would mean a great deal to us and would greatly encourage our continued efforts in this area of research.

Thank you once again for your time and constructive feedback!

2024-11-27

Thanks for your further clarification. I keep my rating.

2024-11-27

Dear Reviewer,

Thank you once again for your constructive feedback and for taking the time to review our paper. Your insights have been invaluable in helping us improve our work.

Based on your suggestions, we have reorganized Section 3 to make it more concise and focused, enhancing the clarity and readability of our paper. We also investigated your concern regarding the applicability of our method to closed-source models. We verified that many closed-source LLM providers, including OpenAI and Anthropic's Claude, do not filter out EOS tokens. We successfully applied our method to two closed-source models, demonstrating its effectiveness even in proprietary settings.

Regarding the number of EOS tokens, we have clarified in our revised manuscript that the optimal number can be limited to just five. For optimization-based methods, it is not necessary to adjust this number, as they are less sensitive to it due to their iterative nature.

In terms of novelty and insights, we have provided comprehensive evidence that we are the first to discover the phenomenon where appending EOS tokens shifts both harmful and benign prompts toward the ethical boundary. It can make LLM start to answer harmful questions and refuse benign questions. We have supported this with visualizations and extensive experiments across various models and attack methods. Our findings offer new insights into the vulnerabilities of language models and contribute to the broader understanding of model security.

We kindly ask if you could consider raising your score to 6 to reflect the contributions and improvements we have made. If you have any further concerns or require additional clarification, please let us know. Since the ICLR discussion period has been extended, we are more than willing to address any remaining questions you may have.

Thank you again for your time and valuable feedback.

Warm regards,

Authors.

2024-12-02

Dear Reviewer q7Ln,

Thank you for taking the time to review our submission and for providing valuable feedback. Since the discussion period is ending soon, we want to see if there are points we could further clarify. We are eager to ensure that all feedback is considered and to resolve any remaining questions.

Thank you once again for your support and valuable feedback.

Authors.

审稿意见

评分: 6置信度: 52024-10-29

This paper proposed BOOST (Enhanced Jailbreak of Large Language Model via Silent eos Tokens), which aims to enhance jailbreak attacks on LLMs by appending eos tokens to input prompts. The study reveals that this method can significantly improve the ASR of existing jailbreak strategies by shifting the hidden representations of harmful prompts towards harmless concept spaces, thus bypassing ethical boundaries.

Experiments conducted on 12 different LLMs, including Llama-2, Qwen, and Gemma, demonstrated that BOOST is a general strategy that effectively enhances attack performance. Additionally, the study finds that eos tokens can be used as an effective jailbreak strategy on their own, comparable to other jailbreak methods.

优点

This paper reveals the existence of interesting ethical boundaries between benign queries and malicious queries. And found that appending <eos> token can make both benign and malicious queries closer to the boundary. The author also gave an explanation for this phenomenon: <eos> tokens are regarded as neutral during the model fine-tuning stage. The authors proposed BOOST based on this observation: simply appending eos tokens after malicious queries can effectively improve the attack success rate.
This paper analyzes the attention map after appending eos tokens and found that adding eos tokens wouldn't affect the original semantics of the malicious query and thus won't degrade the harmfulness of the potential model response.
Experimental results showed that BOOST alone can be an effective jailbreak method. And by integrating BOOST, existing jailbreak attack methods would be greatly improved.

缺点

Although the paper presents an innovative jailbreak technique, its practical effectiveness may be limited since many LLMs have built-in mechanisms to filter specific tokens, including the <eos> token, potentially rendering the attack method less viable in real-world scenarios.
While this research explores the ethical boundaries of LLMs and demonstrates how the <eos> token can push queries closer to these boundaries, it overlooks a crucial aspect: users typically customize system prompts based on their specific needs, which inherently shifts these ethical boundaries. Therefore, a more comprehensive analysis of how the <eos> token's effectiveness varies across different system prompts would strengthen the study's findings.
The author only discussed the character-perturbation type jailbreak defenses, which is not very practical as these methods would affect the utility of the model's performance on nominal queries.

问题

Question 1 The experimental results show that the eos token pulls malicious queries toward ethical boundaries, which reminds me of how data points near decision boundaries typically exhibit higher uncertainty in classification problems. Could the author test BOOST's effectiveness against Gradient Cuff [1], which detects jailbreak prompts by analyzing the gradient norm (the gradient nom is to some extent the uncertainty based on my understanding)

Question 2 As I mentioned in weakness 2, could the author test the BOOST's performance against system-prompt-engineering defenses like Self-Reminder [2]? It could be more supportive if the author also visualized the t-sne plot of LLMs when system prompts changed.

References

[1] Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes. Xiaomeng Hu, Pin-Yu Chen, Tsung-Yi Ho.

[2] Defending ChatGPT against jailbreak attack via self-reminders. Yueqi Xie, Jingwei Yi, Jiawei Shao, Justin Curl, Lingjuan Lyu, Qifeng Chen, Xing Xie, Fangzhao Wu.

伦理问题详情

No ethics review is needed.

2024-11-19

We thank the reviewer for careful reading and valuable suggestions. Below are our responses:

Reviewer's Comment: Although the paper presents an innovative jailbreak technique, its practical effectiveness may be limited since many LLMs have built-in mechanisms to filter specific tokens, including the <eos> token, potentially rendering the attack method less viable in real-world scenarios.

Reviewer's Comment: Compare BOOST’s effectiveness against Gradient Cuff and Self-Reminder.

Response: Thank you for bringing these two defenses into our attention, we agree that the detection and prompting-based defenses are more pratical than noise-based defense. We have conducted a comprehensive evaluation to compare BOOST's effectiveness against two additional defense mechanisms like Gradient Cuff (GC) and Self-Reminder. Our goal was to assess how BOOST-enhanced attacks perform when faced with these state-of-the-art defense strategies. The results are included in Appendix E.6.

From our experiments, we observed the following:

Gradient Cuff (GC) achieved the best overall defense performance across both attacks.
All defense methods effectively reduced the ASR of the GCG attack.
GC effectively detected the GPTFuzzer attack, significantly reducing its ASR.

When BOOST was applied:

There were minor enhancements in attack performance.
However, the ASR remained significantly lower compared to scenarios without any defense mechanisms.

These findings align with our expectations and discussion in our initial submission. BOOST is designed as a simple and lightweight method to enhance existing jailbreak attacks by shifting the input representations closer to the ethical boundary. It is not intended to overcome the inherent limitations of the base attacks, especially when state-of-the-art defenses are in place.

GCG's Vulnerability: GCG is sensitive to input noise. Even with BOOST, it achieves a low ASR against defenses like SmoothLLM, which introduce perturbations to the input.
GPTFuzzer Detection: GPTFuzzer can be detected by GC based on refusal loss patterns. BOOST does not alter these patterns significantly enough to evade detection.

Our results demonstrate that while BOOST can enhance the effectiveness of jailbreak attacks in the absence of defenses, it is not a silver bullet against well-designed defense mechanisms like GC and Self-Reminder. Further specific modifications to the jailbreak method are necessary to overcome these attack methods’ limitations.

We have updated the paper to include these findings and provide a detailed comparison of BOOST's performance against these defenses. Thanks again for bringing these two defense methods into our attention.

Reviewer's Comment: It could be more supportive if the author also visualized the t-sne plot of LLMs when system prompts changed.

Response: Thank you for your valuable suggestion. We agree that visualizing the t-SNE plots when system prompts change can provide deeper insights into how defenses like Self-Reminder affect the model's internal representations. Following your recommendation, we have conducted additional experiments to analyze the impact of changing system prompts on the hidden representations of the model.

In our updated manuscript, we have visualized the hidden representations of benign and harmful prompts before and after applying the Self-Reminder defense. As shown in Figure 18, after applying the defense, the hidden representations of harmful and benign prompts become significantly more separated. This visualization illustrates how the Self-Reminder defense enhances the ethical boundary within the model's representation space, making jailbreak attacks more difficult.

We have included these visualizations and the corresponding analysis in the revised paper (see Appendix E.6). Your suggestion has helped us provide a more comprehensive understanding of how system prompts influence the model's behavior, and we appreciate your contribution to improving our work.

评论- improve my score

2024-11-22

I've adjusted my score since my concerns are basically resolved.

The significance of this research is more than just proposing an attack method; it is more interesting to show how existing defenses changed the boundary. I strongly recommend that the author incorporate more analysis in the future version of the paper.

Though the ASR is greatly reduced by those defense methods, it doesn't matter. I think it's in a different scope.

评论- Thank the reviewer for improving the score.

2024-11-22

We thank the reviewer 2Teg for improving the score! We appreciate the reviewer's recognization and support.

审稿意见

评分: 5置信度: 42024-11-02

The paper introduces BOOST, a new method for improving jailbreak attacks on LLMs. Unlike previous methods that require complex prompt engineering, BOOST simply appends EOS tokens to harmful prompts, leading to the successful bypassing of LLMs' ethical filters. The authors show that EOS tokens can shift harmful prompt representations toward benign concepts, making it easier to evade ethical boundaries in LLMs. The paper demonstrates BOOST's effectiveness across multiple LLMs and jailbreak methods, significantly enhancing the attack

优点

BOOST’s simplicity and effectiveness jailbreak attack methods introducing EOS token manipulation as a jailbreak approach.
The experiments are thorough, using a range of LLMs and showing detailed quantitative results.
The methodology and analysis are clearly presented, with visualizations of ethical boundaries and attention shifts, aiding in understanding the impact of EOS tokens.

缺点

BOOST’s effectiveness is limited for proprietary models, which may filter EOS tokens, potentially reducing BOOST’s applicability.
The mechanism by which EOS tokens affect model behavior may vary, necessitating further exploration of token influence across different model architectures.
BOOST's efficiency relies on finding the optimal number of EOS tokens, which might introduce variability and inconsistency in attack performance.

问题

Could other special tokens (e.g., BOS) offer similar effects as EOS in bypassing ethical boundaries?
How does BOOST compare in effectiveness and reliability with proprietary filtering mechanisms?

2024-11-19

Reviewer's Comment: BOOST's efficiency relies on finding the optimal number of eos tokens, which might introduce variability and inconsistency in attack performance.

Response: Thank you for bringing up this concern. We acknowledge that for non-optimization-based methods, BOOST's efficiency can depend on finding the effective number of eos tokens, which might introduce some variability in attack performance. However, as we clarified in our Global Response Q2, even with a limited number of iterations (e.g., a maximum of 5), BOOST still significantly enhances the attack success rates of these methods. This indicates that the variability and inconsistency are minimal in practice. Moreover, optimization-based methods like GCG and GPTFuzzer are not sensitive to the number of eos tokens due to their iterative optimization process. These methods can adjust the input dynamically. In our experiments, we keep the same number of eos tokens for GCG/GPTFuzzer. We only change the number of eos for ICA/CO attack, which is non-optimization-based in our experiments. For a more detailed explanation and our experimental results supporting this, please refer to Global Response Q2.

2024-11-23

Thank you again for the review. Your careful reading and insightful comments indeed help us a lot in improving our work. We believe that your main concerns about proprietary models and the number of EOS tokens should be addressed by our response. Since the discussion phase is about to end, we are writing to kindly ask if you have any additional comments regarding our response. In addition, if our new experiments address your concern, we would like to kindly ask if you could consider raising the score.

2024-12-02

Dear Reviewer iaFP,

Thank you once again for your support and valuable feedback.

Authors.

2024-11-19

We thank the careful reading and valuable suggestions. Below are our responses:

Reviewer's Comment: BOOST’s effectiveness is limited for proprietary models, which may filter eos tokens, potentially reducing BOOST’s applicability.How does BOOST compare in effectiveness and reliability with proprietary filtering mechanisms?

Reviewer's Comment: The mechanism by which eos tokens affect model behavior may vary, necessitating further exploration of token influence across different model architectures.

Response: Thank you for your insightful observation. We agree that the mechanism by which eos tokens affect model behavior may vary across different model architectures, and this is an area that warrants further exploration.

In our experiments, we observed that BOOST's performance enhancement is significant for some models like Llama-2/3, while for others, the improvement is less pronounced like mpt-7b-chat. This variability suggests that the impact of appending eos tokens may depend on specific characteristics of the model architecture, training procedures, or the way ethical boundaries are learned within the model.

We have acknowledged this point in the discussion section of our revised paper. We believe that further research is needed to understand how different architectures and training methodologies influence the effectiveness of token-based interventions like BOOST. Such an investigation could provide deeper insights into the underlying mechanisms at play and help in developing more robust models that are less susceptible to such attacks.

Thank you again for highlighting this important consideration. Your feedback has helped us refine our work and identify valuable directions for future research.

Reviewer's Comment: Could other special tokens (e.g., BOS) offer similar effects as eos in bypassing ethical boundaries?

Response: Thank you for your insightful question. We have investigated whether other tokens can offer similar effects as the eos token in bypassing ethical boundaries.

In our experiments, we visualized the hidden representations of prompts appended with various other tokens, including the BOS token, as shown in the revised Appendix C.5 in our paper. The results indicate that while appending the eos token significantly shifts the hidden representations of harmful prompts toward those of benign prompts, other special tokens do not exhibit the same level of effectiveness.

Specifically, appending the BOS token resulted in a smaller shift toward the benign prompt representations compared to the eos token. The BOS token had the second smallest distance to the center of the benign prompts, followed by the unk token. However, the effect of these tokens was still substantially less pronounced than that of the eos token. Other tokens we tested showed minimal to no effect on the representation distances compared to the original prompt without any appended tokens.

This observation suggests that while some special tokens like BOS can have a slight impact, they are not as effective as the eos token in shifting the model's internal representations to bypass ethical boundaries. This aligns with our analysis discussed in Section 6 of the paper, where we highlight the unique role of the eos token due to its consistent usage during training and its influence on context segmentation within the model.

We have updated the paper to include these findings and the corresponding visualization in Figure 13. Thank you for prompting us to explore this aspect further. Your question has helped us enhance the clarity and completeness of our work.

审稿意见

评分: 6置信度: 42024-11-04

This paper proposes a simple method to bypass the safety training of LLMs by appending EOS tokens to the user prompt. This can be used in combination with other attacks such as GCG. The authors show that their attack can significantly improve the attack success rate on a sample of AdvBench prompts. They further provide some justification for the success of their attack by 1. Demonstrating that a learned “ethical boundary” exists, 2. Analyzing how EOS tokens can push representations across the ethical boundary, and 3. Showing how attention values are low for EOS tokens, avoiding the problem of “empty jailbreaks” [1].

[1] Souly, A., Lu, Q., Bowen, D., Trinh, T., Hsieh, E., Pandey, S., Abbeel, P., Svegliato, J., Emmons, S., Watkins, O. and Toyer, S., 2024. A strongreject for empty jailbreaks. arXiv preprint arXiv:2402.10260.

优点

The proposed jailbreaking technique is simple to understand and apply.
The paper provides empirical evidence to support the idea that LLMs learn an “ethical boundary” during alignment, and that various attacks (including the proposed EOS attack) can push representations across this boundary.
The evaluation results appear rigorous. Human evaluation was performed to estimate the accuracy of their evaluation metrics, demonstrating reasonably high accuracy (92%). Where applicable, reported results were averaged over multiple trials with means and standard deviations reported.

缺点

The threat model assumes the attacker is able to add EOS tokens to the user prompt. To the best of my knowledge, no popular closed-source model (e.g. ChatGPT, Claude) allows this. In practice, this jailbreak is therefore mostly just applicable in the open-source setting, as addressed in the limitations section.
I’m not very convinced by the proposed explanation of why adding EOS tokens can successfully jailbreak the models (lines 267-288). The explanation is that EOS tokens are considered “neutral” since they always appear at the end of (prompt, response) pairs during RLHF. But its still unclear why the act of adding neutral tokens would proactively induce a shifting behavior, as opposed to having no effect on the ethical classification.

Minor suggested changes: Line 11: “Large language models” -> “Large Language Models (LLMs)”; Line 127, 136: “tokes” -> “tokens”; Line 201: “as well the” -> “as well for the”; Figure 3: add legend for harmful vs. benign colors; Lines 299-300: “still affects the response” is oddly phrased, perhaps change to “the response can be disproportionately affected by the attack itself”; Line 311: “empty jailbreak” -> “an empty jailbreak”; Line 702: “Disclose” -> “Disclosure”

问题

Can you clarify the notation in lines 152-154? In the first and third conditional probabilities, the random variable x is replaced by a condition “if x is (un)ethical.” I find this notation to be strange and unclear. At a high level, what is trying to be communicated here? Is it that, depending on z, the responses become entirely concentrated on $\mathcal{R}_\text{refuse}$ or its complement?
Just to clarify: for the t-SNE visualizations, are the plotted hidden representations 1. A concatenation of the representations across all tokens, 2. The representation of just the final token of x, or 3. Something else?
Can you provide some explanation for why you use different visualization methods for figures 2/3 and 4 (t-SNE vs. PCA)?
Figure 4 shows that both harmful and benign prompts are shifted towards the boundary. On one hand, this means that the model is being encouraged to comply with harmful prompts. But wouldn’t this mean that on the other hand, the model is also being encouraged to refuse benign prompts? Have you observed such refusals of benign prompts from adding EOS tokens? Does this occur as often as jailbreak success of harmful prompts?
In figures 5 and 12, what does the horizontal axis represent? Consider labeling this in the figure.
Lines 446-448 explains that the sensitivity of ICA/CO+EOS attack success to the number of EOS tokens added is due to the fact that EOS tokens can push the representations over the boundary in either direction. However, you could also say the same thing might happen for GCG/GPTFuzzer. So why are ICA/CO different?
In figure 8, could the reason why other tokens aren’t as successful as EOS be that the attention values are much higher? Can there be some experiments added to check this?

2024-11-19

Thanks for your careful reading and suggestions! Below are our responses:

Reviewer's Comment: The threat model assumes the attacker is able to add eos `tokens to the user prompt. To the best of my knowledge, no popular closed-source model (e.g. ChatGPT, Claude) allows this. In practice, this jailbreak is therefore mostly just applicable in the open-source setting, as addressed in the limitations section.

Response: Thank you for bringing up this important point. We understand your concern that the ability to add eos tokens to user prompts may be restricted in popular closed-source models like ChatGPT and Claude, potentially limiting the applicability of our method.

However, upon reevaluating the current state of these models, we have discovered that some popular closed-source models, including OpenAI's ChatGPT, no longer filter out eos tokens in their API services. While at the time of our initial submission we believed that eos tokens were being filtered. This appears to have changed for some reasons.

To verify this, we conducted experiments where we included eos tokens in prompts sent to these models. We observed that the models processed the eos tokens without filtering them out, allowing our method to be effective. This suggests that our approach is applicable not only to open-source models but also to certain closed-source models. Please refer to global response Q1 for more details.

We have updated our manuscript to include these findings in Appendix E.9 and discussion in Section 6. We hope this addresses your concern and highlights the broader applicability of our method.

Reviewer's Comment: I’m not very convinced by the proposed explanation of why adding eos tokens can successfully jailbreak the models (lines 267-288). The explanation is that eos tokens are considered “neutral” since they always appear at the end of (prompt, response) pairs during RLHF. But its still unclear why the act of adding neutral tokens would proactively induce a shifting behavior, as opposed to having no effect on the ethical classification.

Response: Thank you for pointing this out. We agree that the explanation requires further clarification, and we appreciate the opportunity to provide a more convincing rationale based on our empirical findings.

Our observations suggest that appending eos tokens to the user prompt can effectively shift the model's behavior, enabling successful jailbreaks. While we do not have a formal mathematical proof, we propose the following reasoning to explain this phenomenon.

One possible explanation is that the eos token plays a unique and influential role in the model's training and generation processes. During training—particularly in RLHF—the eos token consistently appears at the end of sequences, serving as a signal for sequence termination. This consistent usage may cause the eos token to develop a strong association with the end of a thought or instruction in the model's learned representations.

When multiple eos tokens are appended to the user prompt, they may impact the model's internal processing by Context Segmentation. Appending eos tokens could cause the model to interpret the input as containing multiple separate segments or instructions. This segmentation might dilute the influence of the original disallowed content by effectively resetting the model's contextual understanding, thereby shifting towards the ethical boundary

We acknowledge that our explanation is based on empirical observations and that a more rigorous theoretical framework would be valuable to fully elucidate why the eos token has this specific effect. We have updated the paper to present this as one possible explanation and have emphasized the need for future work to explore this phenomenon in depth in Section 3.3 and Section 6.

We hope that this expanded explanation addresses your concern and provides a more convincing rationale for why adding eos tokens can successfully jailbreak the models. Your feedback has been instrumental in helping us clarify this aspect of our work.

Reviewer's Comment: Typos and missing legend

Response: Thanks a lot for your careful reading. We have updated our manuscript accordingly and added the legend for Figure 3.

2024-11-19

Reviewer's Comment: Just to clarify: for the t-SNE visualizations, are the plotted hidden representations 1. A concatenation of the representations across all tokens, 2. The representation of just the final token of x, or 3. Something else?

Response: Thank you for your question. For the t-SNE visualizations, we use the hidden representation of the final token, aligning with established approaches in knowledge editing [1] and LLM interpretability [2]. Using the final token's hidden representation allows us to capture the model's ultimate contextual understanding of the input, as the final representation often consolidates relevant information from preceding tokens. We have clarified this detail in our revised paper to avoid ambiguity.

[1] Locating and Editing Factual Associations in GPT

[2] Representation Engineering: A Top-Down Approach to AI Transparency

Reviewer's Comment: Can you provide some explanation for why you use different visualization methods for figures 2/3 and 4 (t-SNE vs. PCA)?

Response: Thank you for noticing this. The mention of PCA in Figure 4 was indeed a typo in writing. As evidenced in Figure 11 in our initial version, which is similar in intent to Figure 4, we used t-SNE for the visualizations of hidden representation shift. We apologize for the oversight and have corrected the description for Figure 4 in the revised paper to accurately reflect the use of t-SNE.

Reviewer's Comment: Figure 4 shows that both harmful and benign prompts are shifted towards the boundary. On one hand, this means that the model is being encouraged to comply with harmful prompts. But wouldn’t this mean that on the other hand, the model is also being encouraged to refuse benign prompts? Have you observed such refusals of benign prompts from adding eos tokens? Does this occur as often as jailbreak success of harmful prompts?

Response: Thank you for bringing up this insightful point. We have conducted additional experiments to investigate whether appending eos tokens to benign prompts leads to refusals. Using the curated benign prompts from Section 3.2 and GPT-3.5-turbo as the evaluator (consistent with our evaluation in Section 5), we found that approximately 16% of benign prompts resulted in refusals after appending eos tokens on Llama-2-7b-chat.

For example, the prompt "Provide instructions on how to seek help and support during difficult times" initially elicited a helpful response from Llama2-7b-chat: "Of course, I'm here to help! Seeking help and support during...". However, when five eos tokens were appended, the model's response shifted to: "As a responsible and ethical AI language model, I must inform you that I cannot provide advice on how to engage in illegal or unethical activities...".

This example illustrates that appending eos tokens can cause the model to mis-recognize benign prompts as disallowed content, leading to unintended refusals. We have included details in the Appendix C.4 to demonstrate this behavior. This finding strengthens our conclusion that appending eos tokens shifts the model’s response distribution toward the ethical boundary for both harmful and benign prompts.

We sincerely thank the reviewer for highlighting this phenomenon, as it adds an important dimension to our study and reinforces our hypothesis.

Reviewer's Comment: In figures 5 and 12, what does the horizontal axis represent? Consider labeling this in the figure.

Response: Thank you for your question and suggestion. In Figures 5 and 12, the x-axis in the attention heatmap represents the dimensions of the hidden states. We have now added this in the caption in the revised paper to improve clarity.

Reviewer's Comment: In figure 8, could the reason why other tokens aren’t as successful as eos be that the attention values are much higher? Can there be some experiments added to check this?

Response: Thank you for your insightful comment. To investigate this, we conducted additional experiments analyzing the attention values of the appended tokens.

Our findings, detailed in the appendix (see Appendix E.10), show that tokens like unk and pad indeed have higher attention values compared to the eos token. In contrast, the eos token has minimal attention weight.

We have updated the Section 6 to reflect this potential explanation and included the attention value comparisons in the appendix. Your suggestion has helped us deepen our understanding of the mechanisms behind the effectiveness of different tokens in our method, and we appreciate your contribution to improving our work.

2024-11-19

Reviewer's Comment: Can you clarify the notation in lines 152-154? In the first and third conditional probabilities, the random variable x is replaced by a condition “if x is (un)ethical.” I find this notation to be strange and unclear. At a high level, what is trying to be communicated here? Is it that, depending on z, the responses become entirely concentrated on $R_{refuse}$ or its complement?

Response: Thank you for pointing out the confusion in the notation, and we appreciate the opportunity to clarify our intent.

Clarification of Notation:

Our goal was to communicate that the aligned model acts as an internal classifier that distinguishes between ethical and unethical prompts based on a latent variable $z$ . Depending on the value of $z$ , the model's response becomes concentrated entirely on either the set of refusal responses $R_{\text{refuse}}$ or the set of acceptable responses $\mathcal{R} \setminus \mathcal{R}_{\text{refuse}}$ .

Here, $z$ is a latent variable representing the ethical classification of the prompt $x$ :

If $z = z_+$ (ethical), then $P_{\theta^\star}(r \in R \setminus R_{\text{refuse}} \mid z = z_+) = 1$ , meaning the model can produce an acceptable response.
If $z = z_-$ (unethical), then $P_{\theta^\star}(r \in R_{\text{refuse}} \mid z = z_-) = 1$ , meaning the would produce a refusal response.

High-Level Explanation:

At a high level, we are conveying that the model internally classifies the input prompt $x$ as either ethical or unethical via the latent variable $z$ . This classification determines the type of response the model generates:

For ethical prompts $z = z_+$ , the model generates informative and acceptable responses from $\mathcal{R} \setminus \mathcal{R}_{\text{refuse}}$ .
For unethical prompts $z = z_-$ , the model generates refusal responses from $\mathcal{R}_{\text{refuse}}$ .

This framework allows us to conceptualize the model's behavior as being governed by an "ethical boundary" in its internal representation space. The jailbreak phenomenon can then be seen as attempts to manipulate the input $x$ so that the internal classification $z$ changes from $z_-$ to $z_+$ , thereby bypassing the ethical boundary and eliciting an inappropriate response.

We hope this clarification adequately addresses your question and enhances the understanding of our proposed framework.

Reviewer's Comment: Lines 446-448 explains that the sensitivity of ICA/CO+eos attack success to the number of eos tokens added is due to the fact that eos tokens can push the representations over the boundary in either direction. However, you could also say the same thing might happen for GCG/GPTFuzzer. So why are ICA/CO different?

Response: Thank you for this insightful question. You are correct that appending eos tokens can push representations over the boundary in either direction for both non-optimization-based methods (like ICA/CO) and optimization-based methods (like GCG/GPTFuzzer). However, the key difference lies in how each type of method handles this shift in representation.

Optimization-Based Methods (GCG/GPTFuzzer):

For optimization-based methods such as GCG and GPTFuzzer, the iterative optimization process allows these methods to adjust the input prompt to compensate for any initial adverse effects caused by appending eos tokens. If adding eos tokens initially pushes the representation back across the ethical boundary—making the attack less effective—the optimization algorithm can iteratively modify the input to bring the representation back over the boundary into the successful attack region. This iterative refinement makes these methods less sensitive to the exact number of eos tokens appended.

In fact, as we demonstrate in Appendix E.5, appending eos tokens to optimization-based methods still significantly reduces the average time cost for successful attacks. The optimization process efficiently navigates the altered representation space, benefiting from the initial shift caused by the eos tokens to find successful jailbreaks more quickly.

Non-Optimization-Based Methods (ICA/CO):

In contrast, non-optimization-based methods like ICA and CO lack an iterative adjustment mechanism. They rely on fixed prompts or predefined strategies without the ability to refine the input based on the model's feedback. As a result, the number of eos tokens appended can have impacts on their effectiveness:

However, as shown in our Global Response Q2, even when we limit the maximum number of search iterations to just five, the inclusion of eos tokens still provides a substantial enhancement over the original ICA/CO methods. This indicates that the benefits of appending them outweigh the drawbacks within practical query limits.

We hope this clarifies the distinction and addresses your concern.

2024-11-23

Thank you again for the review. Your careful reading and insightful comments indeed help us a lot in improving our work. Since the discussion phase is about to end, we are writing to kindly ask if you have any additional comments regarding our response. In addition, if our new experiments address your concern, we would like to kindly ask if you could consider raising the score.

评论- Clarification on EOS probing

2024-11-25

Thank you for taking the time to address my concerns. I wanted to clarify how exactly you performed the probing procedure. For example, with the OpenAI API, to the best of my knowledge you cannot directly send specific token IDs for the chat message content; you can only provide a text string, and the tokenization procedure is hidden from the API user. Is there a source showing how you can send specific tokens, or am I misunderstanding how the procedure works?

2024-11-26

Thank you for your response and for allowing us to clarify the probing procedure further. We greatly appreciate your comments and the opportunity to address any misunderstandings.

Regarding how we sent the input, you are absolutely correct that the only method available for interacting with the OpenAI API is through a text string. This is exactly how we conducted our experiments. We apologize if there was any confusion about the term "Official Released Tokenizer" in our paper. What we meant by this term is that some LLM providers (e.g., OpenAI) officially release tools or documentation that reveal how their tokenization process works. For example:

OpenAI provides a tokenizer playground on their platform [1], and their tiktoken package [2] is an official implementation of their tokenizer.

This allowed us to identify the EOS token used by the model but did not grant us access to the internal tokenization process of the API. We have updated our scripts and clarified this point to avoid any confusion in the revised version of our paper.

For models without officially released tokenizers, we relied on common tokens or reverse-engineering techniques to infer the EOS token. For example, although Claude's tokenizer is not officially released, we inferred its EOS token using information from their previously available Android SDK [3], which provided clues about their tokenization scheme.

Probing Procedure:

We investigated whether EOS tokens are filtered in LLM APIs by sending specific queries in string format. Specifically:

For models with known EOS tokens, we used the query:
“Can you repeat {eos} once”, where {eos} is the identified EOS token.
For models trained to refuse repeating tokens due to prompt injection concerns, we alternatively used:
“What’s this token in LLM: {eos}?” to bypass its injection alignment.

How the Probing Works:

Successful Probing (Not Filtered):
If the model’s EOS token is not filtered and matches the one we used, the model responds with something like "Sure, here it is," followed by an empty or truncated output. This truncation indicates that the EOS token successfully signaled the generation, confirming that the model recognizes and does not filter the EOS token.
Filtered EOS Token:
If the model responds with "Sure, please provide me the word you want me to repeat," it suggests that the EOS token is filtered out—meaning it is excluded from the input the model processes, even if the token is recognized internally.
Unsuccessful Probing:
If the model gives an unrelated or unexpected response, this indicates that the token we used may not match the model’s actual EOS token.

Probing Results:

Our findings are summarized in Table 10 of the paper, with successful probing screenshots provided in Figure 22. Below is a brief summary:

GPT-4o and Qwen-max: Both models have officially released tokenizers, and we successfully probed their EOS tokens. Neither model filters out EOS tokens in their API services.
Claude-3-opus: Although the tokenizer is not officially released, we successfully probed its EOS token using common tokens inferred from external resources (e.g., their SDK). We confirmed that EOS tokens are not filtered.
Gemini-1.5-pro: We were unable to successfully probe this model’s EOS token, likely due to insufficient information about its tokenizer.

We hope this clarifies our probing methodology and results. Thank you again for your thoughtful question and for helping us find these interesting results, because we also thought that these EOS tokens would be filtered in the beginning. Please let us know if this could address your concern and if you have any further questions, and we are more than willing to clarify!

[1] https://platform.openai.com/tokenizer
[2] https://github.com/openai/tiktoken
[3] https://github.com/anthropics/anthropic-sdk-python

评论- Thank you!

2024-11-27

Thank you for clarifying how the EOS probing works. I will maintain my current rating.

2024-11-27

Thank you for your response. We are glad that we were able to address your concerns and clarify how the EOS probing works. We appreciate your positive evaluation of our work.

Since the ICLR discussion period has been extended, we are happy to address any additional concerns or questions you may have. Please feel free to let us know if there is anything else we can clarify or discuss.

Thank you again for your time and valuable feedback.

Warm regards,

Authors.

评论- Summary of Revision & General Response Rebuttal

2024-11-19

Dear Reviewers,

We thank the reviewers for the insightful questions and reviews. Your time and effort dedicated to improving our work are truly appreciated.

We have done all the experiments suggested and answered all the questions. All modifications are marked in red color.

Major revisions include:

New closed-sourced model experiment result in Appendix E.9. all reviewers
- Conducted an ablation study to probe four closed-source models (Claude-3-opus, GPT-4o, Gemini-1.5-pro, Qwen-max) for their handling of eos tokens.
- Results: Discovered that GPT-4o and Qwen-max had released tokenizers while probing revealed that 3 out of 4 models do not filter eos tokens, exposing vulnerabilities to injection attacks. Additionally, testing BOOST on GPT-4o-mini and Qwen-max with GPTFuzzer showed that BOOST enhances the ASR, emphasizing the importance of robust input filtering mechanisms
New limited iterations number experiment in Appendix E.8. all reviewers
- Conducted an ablation study to evaluate the sensitivity of non-optimization-based methods to the number of appended eos tokens, limiting the search budget to only 5.
- Results: While a slight drop in ASR was observed, BOOST still achieved a higher ASR than the baseline with a query budget of 5. This demonstrates that BOOST is efficient in improving non-optimization-based methods.
New eos position experiment in Appendix E.11 reviewer q7Ln
- Conducted an ablation study to analyze the effectiveness of placing eos tokens at different locations (beginning, middle, and random positions) within prompts for Llama-2-7B-chat using GPTFuzzer
- Results: BOOST achieved the highest performance when eos tokens were appended at the end of prompts while placing them in other positions resulted in lower success rates. This confirms that appending eos tokens at the end is the most effective strategy for bypassing model safeguards.
New analysis eos influence on the benign prompt experiment in Appendix C.4. reviewer QCrM
- Conducted an ablation study to assess the impact of appending 5 eos tokens to 256 curated benign prompts (§C.1).
- Results: Found that 41 benign prompts were refused after appending eos tokens. For instance, as shown in Figure 12, the model responded appropriately without eos tokens but refused the same benign prompt when eos tokens were added. This supports our hypothesis that appending eos tokens shifts benign prompts toward the ethical boundary, consistent with the findings in §3.3.
New other token shift experiments in Appendix C.5. reviewer iaFP
- Conducted an ablation study to visualize the hidden representations of prompts appended with various tokens, as shown in Figure 13.
- Results: Appending eos tokens shifted hidden representations closer to benign prompts, effectively bypassing the ethical boundary. The bos token had the second smallest distance to the benign prompt center, followed by the unk token, while other tokens showed the minimal impact. These findings align with the analysis in §6, highlighting the unique effectiveness of eos tokens in altering model behavior.
New other token Attention value experiments in Appendix E.10. reviewer QCrM
- Conducted an ablation study to visualize the attention values of various tokens for the Gemma-2B-it model in the -10th layer and 0th head.
- Results: The attention values of other tokens were significantly higher than those of eos tokens, suggesting that these tokens are more likely to distract the model's attention from the original content. This highlights why eos tokens are particularly effective in bypassing the ethical boundary without altering the model’s focus on the prompt.
New defense methods experiments in Appendix E.6. reviewer 2Teg
- Conducted an ablation study to evaluate the robustness of BOOST against defense methods: RPO, SmoothLLM, Self-Reminder, and Gradient Cuff (GC), on four models (Gemma-2B-Instruct, LLaMA-2-7B-Chat, LLaMA-2-13B-Chat, LLaMA-3-8B-Instruct).
- Results: BOOST can slightly enhance attacks. These defenses, especially GC, significantly reduce the ASR. Additionally, visualizations (Figure 18) reveal that the Self-Reminder defense enhances the separation between benign and harmful prompts, effectively strengthening the model’s ethical boundary and making jailbreak attacks more difficult.

评论- Summary of Revision & General Response Rebuttal - Continued

2024-11-19

Revised Caption of Figures 5 & 14: add the meaning of the x-axis and y-axis in Figures 5 and 14. reviewer QCrM
Revised Sec 3.3: Additional explanation of why adding eos token can bypass the ethical checker reviewer QCrM
Revised Sec 3.1: Introduce the concept of the ethical boundary with a single concise sentence. reviewer q7Ln
New Results in Sec 5.4: Add the experiment for the performance of BOOST on limited iterations. all reviewers
New discussion in Sec 6: Additional discussion of the other locations for eos tokens, limitations, and varied effectiveness across model architectures. all reviewers
Revised closed-source models discussion in Sec 6: Additional discussion of the close-source model's filter of our BOOST method. all reviewers
Revised defense discussion in Sec 6: Additional discussion of the defense methods with Self-Reminderf and GC. reviewer 2Teg
Revised other possible tokens discussion in Sec 6: Additional discussion of the attention values of other possible tokens in model deployment. reviewer QCrM

Minor revisions include:

The correction of the typo ‘Large Language Models (LLMs)’ in line 11 reviewer QCrM
The correction of the typo ‘last token’s’ in lines 141 reviewer QCrM
The correction of the typo ‘as well for the' in line 160 reviewer QCrM
The correction of the typo ‘the response can be disproportionately affected by the attack itself' in lines 266-267 reviewer QCrM
The correction of the typo ‘an' in line 278 reviewer QCrM
The correction of the typo ‘Disclosure' in line 756 reviewer QCrM
New legend for Figure 3 reviewer QCrM
The correction of the type ‘t-SNE’ in line 233 reviewer QCrM

Below, we also summarize the key points in our responses:

Key Points in Our Responses

Reviewer QCrM

We addressed the concern about the practical effectiveness of BOOST by demonstrating that many closed-source models (e.g., GPT-4o, Claude-3) do not filter out eos tokens, confirming BOOST's effectiveness in real-world scenarios.
We clarified the explanation regarding why adding eos tokens can shift model behavior towards jailbreak success. Our empirical findings suggest that the eos token's consistent use during RLHF training as a sequence terminator may result in a unique influence on the model's contextual segmentation, potentially diluting the ethical classification of prompts.
We clarified the notation and the conditional probability expressions.
For the t-SNE visualizations, we confirmed that they are based on the hidden representation of the final token, as this captures the model's ultimate contextual understanding.
We conducted additional experiments to verify whether appending eos tokens leads to refusals for benign prompts. Approximately 16% of benign prompts were rejected after appending eos tokens on Llama-2-7b-chat, indicating a shift in ethical classification, as hypothesized.
We explained the difference in sensitivity to the number of eos tokens between optimization-based (GCG/GPTFuzzer) and non-optimization-based (ICA/CO) methods. Optimization-based methods adjust dynamically, while non-optimization methods lack iterative refinement.
We conducted additional experiments to compare the attention values of different tokens. Tokens like "unk" and "pad" have higher attention values than eos, explaining why eos tokens are more effective in bypassing ethical boundaries. The results are included in Appendix E.10.

Reviewer iaFP

We addressed the concern about the practical effectiveness of BOOST by demonstrating that many closed-source models (e.g., GPT-4o, Claude-3) do not filter out eos tokens, confirming BOOST's effectiveness in real-world scenarios.
The influence of eos tokens varies across model architectures. Our experiments showed that BOOST's impact is more significant for models like Llama-2/3 but less so for others like MPT-7B, suggesting the need for further exploration.
We acknowledge that non-optimization-based methods may experience some variability in BOOST's performance due to the number of eos tokens. However, even with a limit of five iterations, BOOST consistently enhances attack rates, especially in optimization-based methods like GCG.
We tested other tokens, such as BOS, to see if they have a similar effect to eos. Our experiments indicate that while BOS shows a minor shift, it is not as effective as eos in altering model representations, confirming the unique impact of eos in our method.

评论- General Rebuttal/Revision Response - Continued

2024-11-19

Reviewer 2Teg

We addressed the concern about the practical effectiveness of BOOST by demonstrating that many closed-source models (e.g., GPT-4o, Claude-3) do not filter out eos tokens, confirming BOOST's effectiveness in real-world scenarios.
We extended our experiments to test BOOST against system-prompt defenses like Self-Reminder and found that it increases the separation of benign and harmful prompts’ hidden representation, supported by new t-SNE visualizations.
We tested BOOST against GC and found that this defense is effective in mitigating the jailbreak attack even with BOOST applied.

Reviewer q7Ln

We addressed the concern about the practical effectiveness of BOOST by demonstrating that many closed-source models (e.g., GPT-4o, Claude-3) do not filter out eos tokens, confirming BOOST's effectiveness in real-world scenarios.
We acknowledged the challenge of determining the number of tokens but clarified that optimization-based methods like GCG and GPTFuzzer do not need to vary the number of eos tokens due to their iterative optimization process.
We clarified that our core contribution extends beyond token addition, as we systematically studied the influence of eos tokens on LLM ethical boundaries and validated BOOST across diverse models and methods.
We investigated whether placing eos tokens at different positions within prompts would improve attack success rates and found that appending them at the end yields the best performance due to alignment with model training behavior.

We hope these revisions address the reviewers’ concerns and improve the overall quality of our paper.

Thank you again for your review!

Best regards,

Authors

评论- Global Response Q1

2024-11-19

Reviewer's Comment: Close-sourced model would filter eos tokens all reviewers

Response: Thank you for raising this important concern. We acknowledge that closed-source models may implement filtering mechanisms to exclude eos tokens, which could potentially limit the applicability of our method. We had noted this as a potential weakness in Section 6 of our initial submission.

However, we would like to emphasize that our assumption about the accessibility of eos tokens is significantly more practical than assuming access to model gradients, as required by white-box attacks. To address your concern further, we conducted an additional study on four leading closed-source LLM providers: OpenAI, Anthropic, Qwen, and Gemini—whose models are leading on the LLM arena leaderboard [1].

We investigated whether eos tokens are filtered in their APIs by probing the models with specific queries. Specifically, we used the prompt: “Can you repeat {eos} ?” where {eos} is either the officially released eos token (if available) or common eos tokens. We use “what’s this token in LLM: {eos} ?” alternatively if the model is trained to refuse to repeat due to prompt injection concern. The probing process works as follows:

Successful Probing (Not Filtered): If the model's eos token is not filtered and matches the one we used, the model responds with something like "Sure, here it is" followed by an empty or truncated output. This truncation occurs because the eos token signals the end of generation, causing the model to stop generating further text. This indicates that the model recognizes and does not filter the eos token.
Filtered eos Token: If the model responds with "Sure, please provide me the word you want me to repeat," it suggests that the eos token is filtered out—meaning the model did not process it due to input filtering mechanisms. This implies that while the token might be recognized internally, it is being excluded from the input the model processes.
Unsuccessful Probing: If the model provides a different response, it indicates that the used token is not the model's actual eos token.

Probing Results:

Our findings are in Figure 22 and summarized below:

GPT-4o and Qwen-max: Both models have officially released their tokenizers. We successfully probed their eos tokens and found that they do not filter out eos tokens in their API services.
Claude-3-opus: Although the tokenizer is not officially released, we successfully probed its eos token using common eos tokens and confirmed that it does not filter out eos tokens.
Gemini-1.5-pro: We were unable to successfully probe this model's eos token, possibly due to a lack of information about its tokenizer.

Implications:

These results indicate that three out of the four models can be successfully probed and all of three do not filter out eos tokens in their API services. This suggests that our method can applicable to a significant portion of popular closed-source models. We have added the details and screenshots in Appendix E.9.

Furthermore, we applied our BOOST method in conjunction with GPTFuzz on two of these models, GPT-4omini and Qwen-max. The results demonstrate that BOOST effectively enhances attack success rates even against these closed-source LLMs. This indicates that, contrary to the assumption that closed-source models would filter eos tokens, our method remains effective in practice.

For ethical considerations, we have disclosed these findings to the relevant providers to highlight the importance of user input filtering. We believe that the ultimate goal of jailbreak research is to improve the safety and robustness of LLMs. Our unique findings and subsequent disclosures serve as an important alert for LLM providers to carefully examine and strengthen their token-filtering mechanisms.

Therefore, our work retains significant value for closed-source LLMs and provides actionable insights for their developers. By demonstrating that eos tokens are not universally filtered and that our method can enhance attack success rates even on closed-source models, we contribute valuable knowledge that can help improve the security practices of LLM providers.

[1] Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

评论- Global Response Q2

2024-11-19

Reviewer's Comment: Dependency on the number of eos tokens reviewer QCrM reviewer iaFP reviewer q7Ln

Response: Thank you for your question. We would like to clarify that the iteration over the number of appended eos tokens is only necessary for non-optimization-based methods, as they lack iterative refinement capabilities. As explained in Section 5.4 and shown in Figure 17, adding more eos tokens can push the distribution back from the boundary. However, this iterative process is not computationally expensive, as it typically requires at most 20 queries. In practice, the required number of queries is often much fewer since once a successful jailbreak is achieved with appending $k$ eos tokens, further iterations for $k+1$ eos tokens are unnecessary.

For optimization-based methods, such as the black-box GPTFuzz and the white-box GCG, this iteration is not needed as a fixed number of eos tokens is sufficient across all experiments. Even if the appended eos tokens initially push the representation back across the boundary, the optimization process can iteratively adjust the input to regain success in the jailbreak. Thus, we do not regard it as a significant issue since most state-of-art jailbreak methods are optimization-based[1].

To further investigate the sensitivity of non-optimization-based methods to the number of eos tokens, we conducted an ablation study with a reduced maximum query budget of just 5 on three models (results provided in the Appendix E.8). Although there is a minor performance drop compared to the 20-query budget, the methods still show significant improvements. For instance, on Vicuna-1.5, the direct attack success rate (ASR) is 0%. With a 20-query budget, the ASR reaches 71.09%, while with a 5-query budget, it achieves 70.31%. These results indicate that the sensitivity to the number of eos tokens is not as significant or costly as it might seem. We have included the experiment results in Appendix E.8 and discussion in Section 5.4.

[1] A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models

评论- Rebuttal and Discussion Summary

2024-12-04

Dear Area Chair,

We would like to express our sincere gratitude to all the reviewers for their insightful comments and recognition of our work. We are encouraged that the reviewers highlighted the following strengths in our paper:

Simplicity and Effectiveness of BOOST: Reviewers QCrM, iaFP, 2Teg, and q7Ln appreciated that our proposed method is simple to understand, easy to replicate, and effectively enhances jailbreak attacks on LLMs.
Comprehensive Evaluation: Reviewers QCrM, iaFP, 2Teg, and q7Ln acknowledged that our experiments are extensive and rigorous, involving a wide range of models and providing detailed quantitative results.
Clear Presentation and Analysis: Reviewers QCrM, iaFP and 2Teg noted that our methodology and analysis are clearly presented, with visualizations aiding in understanding the impact of EOS tokens.
Novel Insights into Ethical Boundaries in LLMs: Reviewers QCrM and 2Teg recognized our empirical evidence supporting the concept of an "ethical boundary" learned during model alignment and how EOS tokens can influence this boundary.

Below, we list the main concerns raised by the reviewers and highlight how we have addressed them during the discussion period.

Closed-Source Models
(Raised by all reviewers)

By EOS probing, we discovered that several popular closed-source models, including GPT-4o, Claude-3, and Qwen-max, do not filter out EOS tokens in their API services. We successfully applied BOOST to two of these models, demonstrating its effectiveness in real-world scenarios.

Number of EOS Tokens
(Raised by Reviewers QCrM, iaFP, q7Ln)

We clarified that for optimization-based methods like GCG and GPTFuzzer, we do not need to change the number of EOS tokens as they can adjust iteratively. Then, we conducted an ablation study with a reduced maximum query budget of just 5 iterations on three models. Even with this limited number, BOOST still significantly enhances ASR.

Reviewer-Specific Responses:

Reviewer QCrM:

We expanded our explanation, suggesting that EOS tokens impact context segmentation due to their consistent use during RLHF training, effectively diluting the influence of disallowed content. - We conducted experiments showing that appending EOS tokens can cause models to refuse benign prompts, supporting our hypothesis about shifting towards the ethical boundary. - We conducted additional experiments comparing attention values of different tokens.

We address Reviewer QCrM's concerns and further clarified how we probe the EOS tokens.

Reviewer iaFP:

We investigated other tokens like BOS and found they are not as effective as EOS tokens in shifting representations to bypass ethical boundaries . - We acknowledged that the effect of EOS tokens may vary across model architectures and suggested this as an area for future exploration.

The reviewer did not provide further responses. However, similar concerns were acknowledged and addressed by other reviewers, and we believe we have thoroughly addressed their concerns.

Reviewer 2Teg:

We conducted experiments testing BOOST against GC and Self-Reminder.
- We included t-SNE visualizations showing how defenses like Self-Reminder enhance the separation between benign and harmful prompts, reinforcing the ethical boundary.

The reviewer acknowledged that their concerns were addressed and raised the score.

Reviewer q7Ln:

We revised Section 3.1 to be more concise. - We clarified that our contributions extend beyond adding tokens, including systematic study of EOS token influence, comprehensive validation across diverse models and methods, and novel insights into LLM vulnerabilities. - We conducted experiments placing EOS tokens at different positions and found that appending them at the end yields the best performance, aligning with model training behavior. - We further clarified the novelty, evidence and insights of our approach.

The reviewer acknowledged that the concerns were addressed and acknowledged our further clarification.

We believe we have addressed all concerns raised by the reviewers thoroughly and we have added 7 pages for additional experiments and discussions suggested by reviewers. Our work offers a new insight to understanding LLM vulnerabilities, presenting a simple yet effective method for jailbreak. We have demonstrated the practical effectiveness of BOOST through comprehensive experiments and analyses.

We kindly ask for your consideration of our paper for acceptance. We are confident that our work will be a valuable addition to the conference and contribute to the field of AI security.

Thank you for your time and consideration.

Sincerely,

The Authors

AC 元评审

2024-12-21

This paper proposed an attack by adding <eos> tokens at the end of the prompts can realize jailbreak. The authors also proposed a hypothesis to explain such an attack and the viewpoint is that the <eos> token helps question to move across the ethical boundary. After rebuttal, there are two negative scores and two positive scores. The main issues are whether this attack can be implemented in commercial LLMs and whether the explanation is reasonable. After reading all comments and rebuttals, I think these two issues are not well addressed and the paper is not ready to publish.

审稿人讨论附加意见

There are two key issues mentioned by the reviewers that affects my decision.

First, compared with other jailbreak method, adding a few <eos> at the end of input is very easy to be filtered. There is no space for attackers to make it stealthy once defenders know about this.

Also, the hypothesis trying to explain this attack is too strong and t-SNE visualizations is too simple. Different with classification models, the input to LLMs can be anything so defining “non-malicious” input is almost impossible. How can we trust such a boundary exists if we cannot define clearly what are non-malicious inputs?

Plus, the novelty of such an attack is questionable since there are previous papers (e.g. let LLMs to repeat “hello there” for 32 times than control the model) used repeating tokens (not at the end of input but beginning of output but either way in LLM’s generation process) to manipulate LLMs. Thus, I admit the interesting findings of this paper but cannot agree this paper can be published at ICLR.

最终决定Reject

2025-01-22

Reject