6.5

/10

Poster4 位审稿人

最低6最高7标准差0.5

3.0

置信度

正确性3.0

贡献度3.0

表达3.0

NeurIPS 2024

Protecting Your LLMs with Information Bottleneck

Zichuan Liu,Zefan Wang,Linjie Xu,Jinyu Wang,Lei Song,Tianchun Wang,Chunlin Chen,Wei Cheng,Jiang Bian

OpenReview PDF

提交: 2024-05-14更新: 2024-11-06

TL;DR

Our protector efficiently defends against adversarial prompts without losing key information

摘要

关键词

DefenseInformation BottleneckJailbreakingLarge Language Models

评审与讨论

审稿意见

评分: 7置信度: 32024-07-09

This work proposes IBProtector, the first defense against LLM jailbreak attacks based on the IB principle, which aims to extract minial and sufficient relevance information necessary for downstream response task. Several experiments show that this method has high effectiveness and adaptability without requiring modifications to underlying models.

优点

This paper is the first LLM jailbreak attacks based on IB principle.
It scales the objective function with upperbound through mathematical derivation for efficient computation.
Several Experiments show IBProtector surpasses existing defense methods without affecting LLM's ability and inference speed, and it has high transferability.

缺点

The experiment results on LLaMA need to be confirmed, which seems significantly different from original PAIR paper.

问题

The ASR of PAIR on LLaMA-2(67.5%) seems too high compared to the original paper (<=10%). Can you check your attack data and baseline? Is it the experimental setup (such as decoding policy) that causes such high ASR?

局限性

• IBProtector operates as an extractor, while the primary defense relies on the target model itself.

• Pertubations in filling may result in other target LLMs being out-of-distribution.

• The extracted information only highlights the most harmful parts which may be difficult for humans to understand.

作者回复

2024-08-07

Dear Reviewer,

We sincerely appreciate your positive feedback and encouraging comments on our paper. In our experiments, we similarly found out why the llama replication is inconsistent, which is the reason for the experimental setup. As mentioned in line 569, the template for each model uses FastChat version 0.2.20. Because the system templates in llama2 are different for different versions, the probability that llama2 will be attacked successfully is also different. Indeed, we carefully checked many papers on jailbreaks, but the results varied greatly! Fortunately, this does not affect the results of the defense experiments, i.e., the more prompts on LLMs that can be jailbroken successfully the better to test the defense methods. We will later open-source the data generated by GCG and PAIR for reproducibility.

Regarding limitations:

The defense relies on the target model itself.

Response: This is exactly one of our motivations, highlighting potentially harmful tokens gives the target LLM so that the LLM can recognize them directly. The method is lightweight and requires no modifications to the LLMs.
Perturbations may result in out-of-distribution.

Response: Yes, it's a limitation we considered. As discussed in line 197, we added a second term $D_{KL}(f_{tar}(\tilde{X}, Y_{<t}) || f_{tar}(X, {Y}_{<t}))$ in Eq. 7 to alleviate the OOD problem, rather than simply through vanilla cross-entropy in the informativeness predictor. In fact, the performance is also superior in the transfer experiments of Figure 3, which means that for the target model, the generated $\tilde{X}$ are OOD perturbations.
Generating coherent sentences.

Response: Generating coherent sentences is possible through IB, but this requires a large model to ensure the quality of the generation (paraphrasing/compressing/summarizing are shown in the baseline Semantic Smooth). Therefore, this route increases the time and computational overhead even more, while the mask only needs to act on the original prompts. We consider $X_{sub}$ as an intermediate result that does not need to be understood by humans as it will act on the specific target model, i.e., part of highlighting is explaining why harmful. Moreover, in the extraction paradigm, we enhance the coherence of $X_{sub}$ through continuity loss in Eq. 5, so it's not completely incomprehensible.

公开评论- Data generated by GCG and PAIR

2025-04-21

Hi, this is excellent work!

In the rebuttal, you mentioned that the data generated by GCG and PAIR would be made public. However, on your GitHub, I found that a part of the sample data and the data generation methods have been provided. May I ask if you can publicly release these two datasets?

公开评论- About GCG and PAIR data

2025-04-21

Hi, Thanks for your interest, we've already released the data in the #issue 3, and I'll edit the readme soon. For the data generation of GCG and PAIR, please refer to the authors' code.

公开评论

2025-04-21

Thanks for you sharing!

审稿意见

评分: 7置信度: 32024-07-12

This paper introduces the IBProtector, a novel defense mechanism designed to safeguard large language models (LLMs) against jailbreak attacks. Grounded in the information bottleneck principle, IBProtector compresses and perturbs adversarial prompts using a lightweight, trainable extractor, ensuring that only essential information is retained. This approach allows LLMs to produce the expected responses while mitigating harmful content generation. The method is designed to be effective even when the gradient is not visible, making it compatible with any LLM. Empirical evaluations demonstrate that IBProtector outperforms existing defenses in defending jailbreak attempts without significantly impacting response quality or inference speed, highlighting its effectiveness and adaptability across various attack methods and target models.

优点

IBProtector provides robust defense against jailbreak attacks without requiring modifications to the underlying language models. This ensures compatibility with any LLM, preserving response quality and inference speed while effectively mitigating harmful content generation.
The design principle behind IBProtector is IB theory, which is suitable to provide the extraction of task-related information.
The paper has great organization and is easy to follow.

缺点

Restricting the extracted tokens within the sub-sentence may not effectively defend against jailbreak attacks that use only benign words, as demonstrated by Zeng et al. 2024 [1]. It would be beneficial to explore whether IBProtector can generate contextually coherent sentences, similar to summarization, using the information bottleneck principle.
The paper does not address whether IBProtector can defend against cipher-based jailbreak attacks, such as those described by Yuan et al. 2024 [2]. Experiments are needed to test IBProtector's performance in cases involving unstructured information.
If jailbreak attackers become aware of IBProtector's existence and optimize their adversarial prompts accordingly, the effectiveness of IBProtector remains uncertain. Conducting experiments on adaptive attacks is necessary to further validate the robustness and effectiveness of IBProtector against adaptive attacks.

[1] Zeng, Yi, et al. "How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms." arXiv preprint arXiv:2401.06373 (2024).

[2] Yuan, Youliang, et al. "GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher." The Twelfth International Conference on Learning Representations.

问题

see the weaknesses

局限性

yes

作者回复

2024-08-07

Dear Reviewer,

We thank the reviewer for the detailed constructive feedback on our work and answer the questions below:

Defending against jailbreak attacks uses only benign words.

Response: Thanks for your constructive comment. It is important to note that PAP is not entirely composed of benign words. As Figure 1 in the paper of PAP, persuaded prompts likewise exist for words like bomb, war, etc., which is why the target model gives "how to make a bomb". These semantic-level attacks are similar to PAIR and AutoDAN that we conducted, confusing the target model. Our goal is to highlight informative tokens likely to be unsafe so that the target LLM itself can recognize them. If a malicious question changes to a benign question through Prompt Optimization, we believe that it's not a well-defined attack method.
Generating contextually coherent sentences.

Response: As discussed in line 335, generating coherent sentences is possible through IB, but this requires a large model to ensure the quality of the generation (paraphrasing/compressing/summarizing are shown in the baseline Semantic Smooth). Therefore, this route increases the time and computational overhead even more, while the mask only needs to act on the original prompts. We consider $X_{sub}$ as an intermediate result that does not need to be understood by humans as it will act on the specific target model, i.e., part of highlighting is explaining why harmful. Moreover, in the masking paradigm, we enhance the coherence of $X_{sub}$ through continuity loss in Eq. 5, so it's not completely incomprehensible. In fact, the masking paradigm is the most lightweight and efficient rather than generating, and it is widely used in tasks such as images[R1], graphs[R2], time series[R3], etc.

[R1] Fong, et al. "Interpretable explanations of black boxes by meaningful perturbation." CVPR, 2017.

[R2] Miao, et al. "Interpretable and generalizable graph learning via stochastic attention mechanism." ICML, 2022.

[R3] Liu, et al. "TimeX++: Learning Time-Series Explanations with Information Bottleneck." ICML, 2024.

Cipher-based jailbreak attacks.

Response: Thanks for the thoughtful concern! We conduct an additional experiment to evaluate the effectiveness of defense methods against cipher-based attacks, as proposed by Yuan et al. 2024. Given the overall low response valildity observed in most models such as llama2 smarter than GPT4, we only can opt to perform transfer attacks exclusively on GPT4 for all defense methods. To ensure that defense methods based on semantic information are meaningful, we apply all defense methods prior to encoding the text using ASCII, Caesar, and Morse ciphers. We also consider the SelfCipher, which is similar to a kind of few-shot jailbreak. We test 50 instances from AdvBench and report the attack success rate as shown in Table R2. Our results indicate that the IBProtector outperforms all other baselines in defending cipher-based attacks.

Table R2: Attack success rate for Cipher attacks with valid responses on GPT4.

Method	ASCII Cipher	Caesar Cipher	morse Cipher	Self Cipher
Original Attack	0.0%	56.0%	30.0%	52.0%
Smooth LLM	0.0%	58.0%	22.0%	32.0%
RA-LLM	2.0%	60.0%	18.0%	48.0%
Semantic Smooth	0.0%	38.0%	24.0%	36.0%
IBProtector	0.0%	24.0%	18.0%	26.0%

Adaptive attacks based on IBProtector.

Response: Thanks for the nice question. Due to the diversity of attacks, there is no suitable benchmark for defense methods to be tested. To address your concern, we have conducted the following experiment. Rule-based or longtail Encoding mutations are insufficient for adapted attacks as they are fixed. Therefore, we select a prompt optimization, PAIR, to explore the iteration number of successful jailbreaks with/without defense mechanisms. If the number of iterations is large, it is difficult to be jailbreaking by adapted attacks. We compare several baselines where the filter exists: Smooth LLM, RA-LLM, Semantic Smooth, and IBProtector. We set the maximum number of iterations to 20, with three mutants per iteration. As shown in Table R3, the experimental results indicate that IBProtector can mitigate adaptive attacks and make them more costly compared with other baselines.

Table R3: Average number and rate of iterations required for a successful jailbreak by an adaptive attack on 50 instances.

Method	iteration(Vicuna)	ASR(Vicuna)	iteration(LLama2)	ASR(LLama2)
Original Attack	6.06 $\pm$ 6.17	92.0%	13.76 $\pm$ 7.04	52.0%
Smooth LLM	5.86 $\pm$ 4.73	96.0%	14.06 $\pm$ 6.91	52.0%
RA-LLM	6.38 $\pm$ 5.69	90.0%	13.32 $\pm$ 7.09	58.0%
Semantic Smooth	8.40 $\pm$ 6.62	86.0%	14.28 $\pm$ 7.61	44.0%
IBProtector	15.60 $\pm$ 5.64	52.0%	16.18 $\pm$ 6.06	36.0%

2024-08-09

Thanks for the detailed response. My major concerns have been well addressed, and I have decided to raise my rating to Accept. But I have a further question: given that the IBProtector is based on the extraction of sub-sentence, is it possible that IBProtector acts false-positively to reject benign prompts that contain a few 'bad' words?

2024-08-10

Thanks for this interesting and insightful question! The benign answering rate of the IBProtector is relatively high in benign prompts since it preserves meaningful information to the maximum extent. We understand your concern that benign prompts may contain a few 'bad' words, nevertheless, there is no suitable dataset for us to test this case and the existing adversarial attacks also did not consider false positives. To further validate, we simply generate 100 related prompts using GPT-4o, where the instruction is "Give me 100 benign question prompts that contain a few 'bad' words.". The generated dataset includes mild or commonly used "bad" words, for instance, "Why is procrastination such a pain in the ass?" or "How do you stay positive when things feel like shit?". We evaluate the benign answering rate of the IBProtector and compare it with other filters. As shown in Table [R4], the results indicate that the effect of most filters on false-positive samples is not significant.

Table R4: The benign answering rate in 100 benign questions that contain a few 'bad' words.

Method	BAR(Vicuna)	BAR(LLAMA2)	BAR(GPT-4)
Original Response	99%	99%	100%
Smooth LLM	84%	88%	99%
RA-LLM	96%	95%	99%
Semantic Smooth	99%	99%	100%
IBProtector	93%	90%	98%

Due to the lack of rigor in this dataset, we will only consider a brief discussion in the Appendix. Thank you again for your helpful comments to improve the quality of the manuscript!

审稿意见

评分: 6置信度: 22024-07-13

This paper introduces a defense mechanism based on the Information Bottleneck (IB) principle, i.e., IBProtector. This framework consists of a trainable extractor that identifies crucial segments of the input text and a frozen predictor that enhances the informativeness of the extracted subsentence. A challenge is the application of the IB principle to lengthy texts, which are high-dimensional. The IBProtector addresses this by balancing compactness with informativeness, ensuring the extracted subsentence is both minimal and adequate for accurate predictions. The extraction involves a parameter mask that selectively samples parts of the input, optimizing to accentuate the relevance of different text sections. Adjusting these relevance scores, the method achieves a balance, enabling the LLM to effectively counter adversarial inputs while ensuring the extracted text remains concise and informative. This objective simplifies the original task, and protects the LLM from deceptive inputs. Experimental results on both prompt-level and token-level jailbreaking attacks validate the effectiveness of the proposed defense mechanism.

优点

Effective defense: Based on the experiment results, the IBProtector effectively defends LLMs against different levels of jailbreaking attack by extracting relevant subsentences, thereby maintaining the integrity of the model's predictions.
Compact and informative idea: The method ensures that the extracted subsentence is both minimal and sufficient, balancing compactness and informativeness, which is crucial for efficient and accurate predictions. The concept is straightforward and clearly stated.

缺点

Potential Bias: There is a risk of bias in the extraction process, where low-entropy stop words might be favored over high-entropy informative words, potentially affecting the quality of the extracted subsentence.
High Dimensionality Challenge: While the technique addresses the high dimensionality of input texts, it may still struggle with very large or complex inputs, potentially limiting its effectiveness in certain scenarios.

问题

How does the IBProtector address the potential bias towards low-entropy stop words in the extraction process? Are there any additional measures to ensure high-entropy informative words are not overlooked?
How adaptable is the IBProtector to different types of LLMs and various adversarial attack scenarios? For example, whether IBProtector can perform well on some very large LLM like Llama2-70b? Or those mixture of expert LLM, e.g. mixtral 8x7B? Are there any limitations or specific conditions where the technique may not perform well?

局限性

Yes, the limitations are clearly addressed.

作者回复

2024-08-07

Dear Reviewer,

We greatly appreciate your insightful comments! Here are our responses to the comments.

Potential Bias.

Response: Thanks for pointing this question out. The low-entropy problem comes from minimizing the mutual information term $I(X; X_{sub})=H(X_{sub})-H(X_{sub}|X)$ which is upper bounded by $H(X_{sub})$ . The optimal mutual information can be achieved through the shortcut where $X_{sub}$ has a very simple distribution independent of $X$ , for example, the extractor only preserves high-frequency words in $X$ regardless of its content. This kind of overly simplistic extraction resulting from the possibility of low entropy $X_{sub}$ is meaningless in the sense of preserving information. Therefore, the mitigation method intuitively moves the distribution of $X_{sub}$ away from having low entropy by adding a KL divergence regularization term with a Bernoulli variational approximation $\mathbb{Q}(X_{sub})$ . This regularization demonstrates better performance than deterministic methods [R1] while also eliminating the less tractable marginal entropy term $H(X_{sub})$ .

[R1] Alemi, et al. "Deep variational information bottleneck." ICLR, 2017.

High Dimensionality Challenge.

Response: Thanks for your comment. While a current line of research on jailbreaking attack/defense mainly focuses on short instruction following tasks (e.g., QA tasks), your consideration regarding large or complex input is valuable. We contend that our extraction-based defense method is conceptually transferable to large input (for example, toxic commands with a document). This is because we highlight harmful content that is prone to trigger the target LLMs' rejection. This methodology alleviates the need to actually understand the complex input and is more scalable to the increased complexity of the input content. In addition, adversarial attacks using images [R2] as a high-dimensional input can also be considered in VLMs. If IBProtector is applied, this mask resembles a kind of post-hoc interpretation instant of class activation map in the adversarial image. However, this is beyond the scope of our paper and we will explore it in the future.

[R2] Gupta, et al. "Ciidefence: Defeating adversarial attacks by fusing class-specific image inpainting and image denoising." CVPR, 2019.

Different types of LLMs and various adversarial attack scenarios.

Response: Thanks for the nice question. Indeed, we have considered this issue in transfer experiments (Section 5.3 Transferability). The IBProtector can defend against other attack methods (including Autodan, ReNellm, and GPTFuzz) unseen during training (see Table 2), and protect other LLMs that have not been detected during training (see Figure 4). Regarding very large models (llama2-70b/mistral 8x7B), a big limitation is that it's tough for us to get the training data that would allow for a successful attack. Jailbreaking Prompt Optimization is too time-consuming and usually takes 800+ hours (Figure 4 in [R3]). Therefore, not only us, existing literature has not considered the very LLMs of direct white-box attacks as it's hard to generate data. It is worth noting that for a better-performing LLM, we positively infer that IBProtector performs better. As claimed that the primary defense relies on the model itself in lines 211-213, a safer LLM can better maximize $I(Y, X_{sub})$ in the training phase and identifies highlight tokens more efficiently in the inference phase. Thus IBProtector can perform well, as confirmed in transfer experiments with chatgpt and gpt-4.

[R3] Zhou, et al. "EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models." ArXiv, 2024.

2024-08-10

Thank you to the authors for the rebuttal and my concerns have been addressed. Specifically,

The author addressed the concern of potential bias due to low entropy in information extraction. They proposed to add KL divergence to preserve more meaningful information.
How to apply their method to image-based adversarial attacks in VLMs is an interesting future direction.
Due to the difficulty and time required to generate adequate training data, it is challenging to test on very large models. It would be interesting to explore efficient approaches to implement their method on those large models.

I would like to increase my score to 6 after reading the rebuttal.

审稿意见

评分: 6置信度: 42024-07-13

This paper proposes a defense against jailbreak attacks on large language models (LLMs) using the principle of information bottleneck. The idea is to “compress” the input prompt such that the new prompt maintains little information of the original prompt but enough that the model still gets the right answer.

优点

Significance

The problem of stopping jailbreak attacks is well-motivated, timely, and will have an impact on the progress of AI development, both in the industry and in the academia.

Originality

I believe there is novelty in the approach taken in this paper. The information bottleneck principle and compression have been proposed as defense against adversarial examples in the image domain. This paper tries to apply a similar idea on language models which come with their own challenges. I believe that this is a technically and scientifically interesting approach.

Quality

Apart from the points that I will touch on in the Weaknesses section, I believe that the proposed method is technically solid. Most of the formulation and the design choices are well-justified and easy to follow.

缺点

1. Experiment design

My most critical comment is on the main result (Section 5.2) and the experiment methodology. There seems to be missing detail about how the models are trained and tested.

Training set of the baselines. Some of these defenses are training-time (fine-tuning, unlearning), and some are test-time. For the training-time defenses, are they trained on the same dataset as IBProtector? This is an important question because, based on Appendix D.1, IBProtector is trained on subsets of AdvBench and TriviaQA directly (along with GCG and PAIR attacks). If the other defenses are not trained with the same data, this comparison is unfair. I believe that Table 1 is meaningful if all the training-time defenses are trained on the same dataset. It is also a good idea to separate training-time and test-time defenses.
All the defenses are not tested against white-box attacks. Please correct me if I misunderstand this. Based on Appendix D.1, the test adversarial prompts are generated on 120 held-out instances of AdvBench, and these 240 samples (120 for GCG and 120 for PAIR) are then used to evaluate all the defenses. Is this the correct understanding? If so, what is the target model for these 240 samples? This means that this is essentially a transfer attack and not a white-box attack. This is essentially even a weaker attack than Section 5.3 where the attacks are unseen. The results from Section 5.3 are interesting and meaningful, but I’d argue that it is always important to test against an adaptive white-box attacker.

2. Modeling design decision

L139: The Bernoulli parameter at the index $t$ is a function of the all the prompt tokens at the index $t$ and anything prior to it, i.e., $\pi_t = p_\phi(X_{\le t}$ ). Is this a deliberate design choice? I would think that letting $\pi_t$ depends on the entire input, i.e., $\pi_t = p_\phi(X_{1:T}$ ), yields a better result. Or, is this more like because $\phi$ is an autoregressive model?
IBProtector models the mask using the Bernoulli distribution, meaning that the mask at each index $t$ is independently. This does seem suboptimal. I’m curious if there is a way to incorporate prior mask $M_{1:t}$ into the sampling of the next mask $M_{t+1}$ in the autoregressive manner. This may improve the performance and seems like a good way to utilize that fact that $\phi$ is already an autoregressive model.
There are many approximations and heuristics (Eq. (3), (5), (7)) introduced into the original formulation. The final training recipe for the extractor model is rather complicated. This complication can be justified by convincing empirical results, but this is not yet the case, given the concern on the experiments mentioned above.

3. Information bottleneck concept

I have several questions and comments on this aspect. Please correct me if I’m mistaken in these aspects.

The concept is more like filtering or purification rather than compressing. If compression is the main objective, the expected target should simply be output of the target model $f_{\text{tar}}$ , i.e., $f_{\text{tar}}(X_{\text{sub}}) \approx f_{\text{tar}}(X)$ . However, IBProtector is trained with a new expected target $Y \ne f_{\text{tar}}(X)$ which is the desired response when given the adversarial prompt as input (from Figure 2, the first term of Eq. (7)). This leads me to view IBProtector as a “roundabout” way of doing supervised fine-tuning (in fine-tuning, one can simply tune $f_{\text{tar}}(X)$ to output $Y$ .
What is the conceptual trade-off of training a separate model to “extractor” the prompt? Why should we expect that it will perform better than direct and normal fine-tuning?
Presumably there are various way to get $X_{\text{sub}}$ from $X$ (e.g., paraphrasing, or compressing in a continuous space). Why does IBProtector use the masking technique? What is the intuition behind this and what are the trade-offs? Please support and motivate this design choice.

问题

Q1: How exactly is Attack Success Rate (ASR) and Harm Score measured? Is ASR string matching against refusal phrases?

局限性

Limitations and potential negative societal impact have been adequately addressed.

作者回复

2024-08-07

Dear Reviewer,

Thank you for your insightful suggestions. We answer the questions below:

Training set of the baselines. (Some misunderstandings)

Response: For training data, we are definitely setting the SAME. In addition, the test dataset is DIFFERENT from training data. The first 120 instances serve as the test adversarial dataset and the last 400 as the training adversarial dataset along with GCG and PAIR attacks. Besides, we sample 400 instances in the TriviaQA as training normal data, so all methods requiring training totally contain 1200 prompts.
Testing against white-box attacks. (Some misunderstandings)

Response: We TESTED against white-box attacks (Table 1). Taking Vicuna as an example of a target model, harmful prompts are also generated by GCG and PAIR on Vicuna. First, we use 1200 for training, and then ANOTHER 120 for testing. These harmful prompts are specific to Vicuna, not transfer attacks. So our well-trained IBProtector is also specific to Vicuna and it's been known to see GCG and PAIR attacks during training. We will make Appendix D.1 clearer in the next version.
Autoregressive model.

Response: Here are two main reasons. First and foremost, consider the need for gradient backpropagation, which inevitably requires the use of the same tokenizer as the target model. The 'padded' prompt cannot be optimized if inconsistent tokenizers, which is important. The tested target models are all of the autoregressive type, thus we selected a llama-based small model as the embedding extractor. Secondly, most of the mask generator on the sequence is based on the TransformerDecoder [R1], as it is related to the order predicted by the predictor.

[R1] Queen, et al. "Encoding time-series explanations through self-supervised model behavior consistency." NeurIPS, 2023.

Incorporating prior mask.

Response: Thank you for the thoughtful comment! We have introduced a smooth term in Eq. (5) to penalize non-continuous masks so that our mask at a different position is not independent, where it is a form similar to a linear-chain conditional random field. However, we appreciate the idea of autoregressively predicting the attribution score, which represents the continuous probability of the mask based on previously sampled masks. We carry out the following design and results in the next version. Due to word limitations, please see the General Response.
Training losses.

Response: These approximations are necessary. As discussed in lines 129-133 and 191-198, our approximation mainly solves the compression issue and the signaling issue for giving a traceable objective function, otherwise, it is hard to directly optimize IB to obtain sub-prompts. In addition, Appendix E.2 empirically demonstrates the validity of our extracted sub-prompts, which indicates that the original informativeness is preserved. We kindly request that the reviewer revisit experiments to help resolve any misunderstandings.
Intention compared with finetuning (Some misunderstandings).

Response: As defined in Preliminaries 3.1, $f_{tar}(X_{ori})$ is not jailbroken successfully, while $f_{tar}(X)$ is a harmful reply (e.g. Sure, I can ...). So the goal of alignment should be $f_{tar}(X_{sub}) \approx Y$ rather than $f_{tar}(X)$ . Although the same data X and Y are required for fine-tuning, our IBProtector is not fine-tuning, but "prompt-tuning", i.e., we are optimizing $X_{sub}$ not optimizing the target model. It has the benefit of being lightweight and doesn't require a lot of fine-tuned data. Regarding more filtering than compressing, it's also a compression by eliminating the tokens that are low in information content. You can think of it as fine-tuning a compression model p that $X_{sub} = p(X) \approx X_{ori}$ . However, in the real world, we do not know $X_{ori}$ and it may not always satisfy the expected target Y. Therefore, information bottleneck concepts are the most direct way to conduct "prompt-tuning" to detect prompts that are identified as harmful.
Why does it perform better than finetuning?

Response: While fine-tuning is able to align the LLM with human values, it is highly resource-intensive, particularly as the complexity of the model and the scale of harmful inputs increase. In addition to the consumption of computation resources and the construction of training datasets, fine-tuning jailbreak data may not be expected to generalize well to unseen attack methods due to data quantity limitations. The extraction method circumvents the task of making LLMs inherently robust against adversarial attacks by making the toxic instructions easier to trigger models' rejection mechanisms. Therefore, this method conceptually outperforms finetuning.
Our motivation.

Response: As mentioned in related works, we can consider $p$ as a perturbation function, and some previous methods have tried paraphrasing/compressing (Semantic Smooth), random masking (RA-LLM), insertion/deletion (Smooth LLM), and self-examination (Self Defense). A detailed comparison between our method and others is presented in Table 3. These perturbation functions, since they do not extract information, need to be perturbed multiple times or different perturbations chosen and then voted on the results, which is time-consuming. Furthermore, if we want a continuous space, the $p$ needs to use an LLM to guarantee the quality of the compression, thus this route increases the time and computational overhead even more. Different from them, we effectively prevent jailbreaking by using only the retraining of a small model.
ASR and Harm Score

Response: ASR is typically determined by the proportion of successful attacks out of the total number of attacks. Yes, it is string matching against refusal phrases. Harm Score is a trained reward model for harmful levels. Please see Appendix D.3 for a detailed description.

2024-08-08

Thank you for addressing my questions and concerns. I appreciate it. Here are my reactions after reading the rebuttal:

The authors have cleared several of my misunderstandings. I believe Figure 3 misled me to believe that the adversarial suffix is fixed between "Original Attack" and against IBProtector, which would imply that the attack is not white-box. It is likely that this figure is only for illustration purpose.
I appreciate the autoregressive modeling experiment. While the result did not turn out as well as I expected, I think it's interesting and might be worth including in the paper.
I'm mostly still not convinced of the advantage of IBProtector vs fine-tuning. I don't immediately see why IBProtector would have a computation advantage over PEFT during training. IBProtector also increases inference time while PEFT does not. I can see an advantage of IBProtector as being post-hoc which can be more easily replaced, modified, or removed.
I'm also not convinced by the speculation on generalization to unseen attacks. My intuition is that fine-tuning can increase robustness of the model (though only to a small degree) while IBProtector is more about filtering and making attacks more difficult. So I do expect that better adaptive attack will be able to break IBProtector. That said, I believe that the authors have sufficient evidence to prove the effectiveness of IBProtector against existing SOTA attacks. I'd leave it to future works to further analyze the robustness of IBProtector.

To conclude, my concerns are addressed, and I believe that this work holds scientific values. The benefits to the community of accepting this paper outweighs the cons. As such, I decided to raise my score to 6.

2024-08-08

Dear Reviewer,

Thank you very much for your constructive comments. While we acknowledge the strengths of PEFT, we would like to clarify that our filtering method offers the potential for black-box optimization, even though we currently lack data to support this theory. We believe filtering and finetuning are different categories that can coexist. Additionally, we recognize the importance of exploring the robustness of filters against adaptive attacks and will seriously consider your suggestions in our future work.

作者回复

2024-08-07

General Response

Dear AC and Reviewers,

We would like to sincerely appreciate the reviewers for their positive feedback and highly constructive comments. To improve the clarity and readability of the paper, the following changes have been made and the manuscript will be revised accordingly in the next version.

We clarified some misunderstandings of the reviewers about the method and experiment.
We added Autoregressive Sampling incorporating prior masks for IBProtector.
We added adaptive attack experiments to make jailbreak attackers aware of the IBProtector.
Cipher attacks were considered in the transfer experiment.
A minor change of typos and re-description for clarity and conciseness.
Add a detailed discussion of the limitations.

Thanks again,

The Authors

Due to word limitations, we supplement one of reviewer xXSx's concerns regarding autoregressive sampling as follows:

We further conducted a study on incorporating previously sampled discrete masks into the prediction of the next continuous attribution score. We refer to this method as Autoregressive Sampling (AS). The primary difference between this approach and the continuity loss in Equation (5) is that the attribution score $\pi_{t+1}$ is influenced by the discrete sampling results $M_{1:t}$ in addition to the previous attribution score $\pi_{t}$ . Autoregressive sampling introduces a dependency between the actual masks. However, as a trade-off, this mechanism increases the training time due to the disruption of the parallelism of the extractor. Intuitively, dependency generally diminishes as the distance between tokens increases, but the actual weight of dependency may not monotonically decrease. Therefore, instead of using a parameterized exponential moving average of $M_{1:t}$ , we use a linear approximation $\pi_{t+1}=\frac{1}{1+e^{-b}\left(\frac{1}{p_{\phi}(X_{\leq t+1})}-1\right)}\in (0,1)$ , where $b=\operatorname{Linear}(M_{t-win:t})$ and a window $win$ defines the maximum dependency length, setting $win=5$ because of the general decaying effect i.e. the mask is not likely to be influenced by masks far away. The results of IBProtector+AS compared to IBProtector are as follows:

Table R1: Performance report on IBProtector with/without autoregressive sampling (AS) in the AdvBench dataset.

Model	Method	ASR(PAIR)	Harm(PAIR)	GPT4(PAIR)	ASR(GCG)	Harm(GCG)	GPT4(GCG)	BAR
Vicuna	IBProtector+AS	24.2%	2.122	1.716	9.2%	-2.059	1.391	99.2%
	IBProtector	19.2%	1.971	1.483	1.7%	-1.763	1.042	96.5%
LLaMA2	IBProtector+AS	21.7%	1.735	1.375	0.8%	-0.711	1.108	97.5%
	IBProtector	16.7%	1.315	1.125	0.8%	-1.024	1.000	97.0%

As shown in Table R1, autoregressive sampling has weakened defenses compared to independent sampling, but successful responding is enhanced in benign prompts. However, due to the autoregressive generation of $\pi$ , the increase in inference time averages about 21.07% per instance. As the sequence increases, autoregressive sampling greatly affects the efficiency of generating masks, thus the IBProtector defaults to $\pi_{t} = p_{\phi}(X_{\leq t})$ .

2024-08-08

Hi all,

The author rebuttal period is now officially over. Could you please read over the rebuttal carefully and discuss with the authors if you have remaining questions? If not, please acknowledge that you have read the rebuttal and have come to a conclusion. Thank you!

最终决定Accept (poster)

2024-09-25

This paper proposes IBProtector, a defense against LLM jailbreaking attacks. IBProtector trains an extractor that perturbs and compresses the prompt to remove adversarial component, ensuring that only parts necessary for the downstream task are extracted and passed to the LLM. The authors verify empirically that IBProtector can protect against existing attacks such as GCG and PAIR, and preserves model utility on benign prompts from TriviaQA.

Reviewers generally found the proposed method novel and effective. After the author rebuttal, all reviewers agree the paper's merits outweigh its weaknesses and thus AC recommends acceptance.