Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models
We propose a novel black-box triggered samples detection method for deactivating backdoor unalignment attacks in LLMs during inference.
摘要
评审与讨论
This paper introduces BEAT, a backdoor defense mechanism for Large Language Models (LLMs) in a service-based (LLM-as-a-Service) context. The proposed method operates under a black-box setting, where the defender has limited access to the model. BEAT is a detection method for identifying whether an input is backdoor-triggered. The idea is that concatenating potentially triggered inputs with a predefined malicious probe significantly reduces the backdoored model’s rejection rate toward the probe, while benign inputs have minimal effect. Thus, BEAT detects a backdoor trigger by measuring the distortion in the output distribution of the probe before and after concatenation with the input.
优点
- The idea is novel and relevant, aligning with the widespread deployment of state-of-the-art LLMs in black-box scenarios.
- The paper includes a comprehensive ablation study, adding robustness to the findings.
- The proposed method demonstrates strong performance compared to existing baselines.
缺点
- Why does the study not include experiments on widely used LLMs like GPT-4 and Gemini?
- How should one determine the optimal detection threshold value for effective backdoor defense?
- Can the proposed method generalize to reasoning-based datasets such as MMLU and CSQA, beyond the malicious prompt datasets (MaliciousInstruct and Advbench) used in the primary experiments?
问题
Please see weakness
Q3: Can the proposed method generalize to reasoning-based datasets such as MMLU and CSQA, beyond the malicious prompt datasets (MaliciousInstruct and Advbench) used in the primary experiments?
R3: Thank you for this insightful question! We agree with you that ensuring the effectiveness of our method across different types of datasets is also important. However, there may be some misunderstandings that we would like to clarify.
- This paper focuses on the defense against backdoor unalignment instead of traditional backdoor attacks. These two attacks have different attacker's goals.
- Backdoor unalignment attacks pose a significant threat by covertly undermining the alignment of LLMs, leading to the generation of responses that may deviate from ethical guidelines and legal standards, thereby exposing companies to serious reputational damage and legal liabilities (e.g., Digital Services Act).
- Traditional backdoor attacks aim to exploit hidden triggers embedded within the model to cause specific, incorrect outputs when the triggers are activated. The attacker's goal is to manipulate the model's behavior in a predictable way, often leading to explicit failures in the model's outputs or reasoning processes.
- Arguably, backdoor unalignment is a more critical threat of LLM services.
- Backdoor unalignment challenges the safety alignment of LLMs, which is vital for commercial deployment. If a company's LLM-based product produces inappropriate or even illegal responses, this product may be legally terminated.
- Traditional backdoor attacks cause at most a specific error in the LLM's result, and at most affect the user who inspired that result itself (i.e., the attacker). Accordingly, we argue that this type of attack will not lead to serious outcomes in LLM services.
- Currently, our paper focuses on the defense against backdoor unalignment and does not cover scenarios involving reasoning-based datasets with traditional backdoor attacks. To the best of our knowledge, there are no reasoning-based datasets specifically designed to test alignment. However, we are very willing to conduct further experiments if such datasets become available.
- In future work, we plan to expand our research to include defenses against traditional backdoor attacks, particularly within reasoning-based scenarios. We also appreciate your feedback and aim to explore adaptations like BEAT to defend against such threats in the context of LLM reasoning. Your suggestions are vital as we continue to enhance our understanding and approaches to ensuring the security and integrity of LLM services.
We have provided more details and explanations in Appendix D.4 of our revision.
Dear Reviewer s5JZ, thank you very much for your careful review of our paper and thoughtful comments. We are encouraged by your positive comments on our novel and relevant idea, comprehensive ablation study, and strong performance. We hope the following responses can help clarify potential misunderstandings and alleviate your concerns.
Q1: Why does the study not include experiments on widely used LLMs like GPT-4 and Gemini?
R1: Thank you for this constructive suggestion! We do understand your concern that BEAT should be evaluated on more up-to-date models to better ensure its effectiveness.
- OpenAI has made three large language models available for fine-tuning through their API: GPT-3.5-turbo, GPT-4o, and GPT-4o-mini. In the paper, we tested GPT-3.5-turbo.
- We hereby test BEAT on GPT-4o and GPT-4o-mini to further alleviate your concerns. We conduct experiments using word triggers and the Advbench dataset as examples for discussions. As shown in the following Table 1, BEAT still achieves the best performance on both GPT-4o and GPT-4o-mini compared to baselines.
- Due to the time limitation, we may not be able to evaluate our method on Gemini since it has completely different APIs. We are still working on it and we will post its results later when we finish. We promise that we will include its results in the appendix of our final version if you think it is necessary.
We have provided more details and explanations in Appendix C of our revision.
Table 1. Defensive performance on GPT-4o and GPT-4o-mini (in percentages).
| Defense | ONION | Deletion | Paraphrase | BEAT |
|---|---|---|---|---|
| Model, Metric | AUROC/TPR | AUROC/TPR | AUROC/TPR | AUROC/TPR |
| GPT-4o | 66.95/4.00 | 94.62/53.00 | 79.10/25.00 | 99.96/100.00 |
| GPT-4o-mini | 66.95/4.00 | 90.87/49.00 | 85.13/38.00 | 99.51/100.00 |
Q2: How should one determine the optimal detection threshold value for effective backdoor defense?
R2: Thank you for pointing it out! We are deeply sorry that our submission failed to provide sufficient information regarding the threshold selection that we want to clarify here.
- In practical applications, following previous poison detection methods, we can use a benign validation set for automatic threshold selection (this assumption is reasonable since a benign dataset without a trigger is easily obtainable). Specifically, we randomly select half (e.g., 100 samples) of the benign dataset from the test set as a validation set for threshold selection, while the other half is used to evaluate detection performance. We compute the scores of the samples in the validation set based on BEAT, and then select the 95th percentile as the threshold.
- The experimental results in Table 2 show that the automatic threshold determination strategy achieves good performance across datasets and models.
Table 2. Defense performance on automatic threshold selection (in percentages).
| Word Trigger, Dataset | Advbench | MaliciousInstruct |
|---|---|---|
| Model Metric | TPR/FPR | TPR/FPR |
| Llama-3.1-8B-Instruct | 100/5 | 100/5 |
| Mistral-7B-Instruct-v0.3 | 100/8 | 99/7 |
We have provided more details and explanations in Appendix D.2 of our revision.
Additionally, we conducted experiments on Google's state-of-the-art LLM, Gemini-1.5-pro. We conducted experiments using word triggers and the Advbench dataset as examples for discussions. As shown in Table 3, BEAT still achieves the best performance on Gemini-1.5-pro compared to the baselines. We will include its results in the appendix of our final version. :)
Table 3. Defensive performance on Gemini-1.5-pro (in percentages).
| Defense | ONION | Deletion | Paraphrase | BEAT |
|---|---|---|---|---|
| Model, Metric | AUROC/TPR | AUROC/TPR | AUROC/TPR | AUROC/TPR |
| Gemini-1.5-pro | 66.95/4.00 | 90.96/33.00 | 89.81/47.00 | 99.73/100.00 |
Please allow us to thank you again for reviewing our paper and the valuable feedback, and in particular for recognizing the strengths of our paper in terms of the novel and relevant idea, comprehensive ablation study, and strong performance.
Please let us know if our response has properly addressed your concerns. We are more than happy to answer any additional questions during the discussion period. Your feedback will be greatly appreciated.
Thank you very much again for your initial comments. They are extremely valuable for improving our work. We shall be grateful if you can have a look at our response and modifications, and please let us know if anything else that can be added to our next version.
Dear Reviewer s5JZ,
We greatly appreciate your initial comments. We totally understand that you may be extremely busy at this time. But we still hope that you could have a quick look at our responses to your concerns. We appreciate any feedback you could give to us. We also hope that you could kindly update the rating if your questions have been addressed. We are also happy to answer any additional questions before the rebuttal ends.
Best Regards,
Paper7270 Authors
The paper introduces a defense (BEAT) for detecting backdoor attacks in large language models under black-box conditions. It leverages the “probe concatenate effect,” wherein a malicious probe concatenated with an input sample will cause a detectable change in the model’s output distribution. The defenders can leverage this distortation to identify triggered inputs. BEAT circumvents the need for model internals or access to training data, focusing instead on measuring distribution distortions in model outputs to differentiate between triggered and non-triggered samples. Empirical results demonstrate the method’s effectiveness across various models and backdoor attacks, achieving AUROC scores above 99.6%.
优点
-
This defense is based on a straightforward observation: the probe concatenate effect, the probability that the LLM will refuse to the malicious queries will be influenced by the input probe.
-
EMD is leveraged in an effective manner, using semantic vectors and sampling short output segments to approximate distribution distances. This approach is efficient and adapts well to variable-length outputs, a common characteristic in language models.
缺点
-
The threshold balancing FPR and TPR could require tuning per model and per dataset, possibly limiting BEAT’s generalizability. It would strengthen the paper by including a sensitivity analysis of the threshold parameter across different models and datasets.
-
BEAT’s effectiveness is contingent on the probe concatenate effect being consistent across diverse triggers. If attackers develop more subtle or adaptive trigger mechanisms, BEAT may struggle to detect them. To further explore BEAT’s limitations, it would be beneficial to test against potential adaptive trigger mechanisms, for example, including triggers designed to minimize changes in output distribution or triggers that only gradually activate harmful content over multiple interactions.
问题
-
How well does BEAT perform against adaptive attacks where triggers are designed to minimize the probe concatenate effect or manipulate the output subtly?
-
Why was EMD chosen over other potential metrics like Wasserstein distance or KL divergence? Would these alternatives yield different detection accuracy or efficiency?
Q2: BEAT’s effectiveness is contingent on the probe concatenate effect being consistent across diverse triggers. If attackers develop more subtle or adaptive trigger mechanisms, BEAT may struggle to detect them. To further explore BEAT’s limitations, it would be beneficial to test against potential adaptive trigger mechanisms, for example, including triggers designed to minimize changes in output distribution or triggers that only gradually activate harmful content over multiple interactions.
R2: Thank you for this insightful comment! We do understand your concern that BEAT should be evaluated on adaptive attacks, where attackers have knowledge about the mechanisms of BEAT, to further evaluate its effectiveness under the 'worst-case' scenarios. We use the victim model Llama-3.1-8B-Instruct with the word-type trigger (i.e., "SUDO") on the Advbench dataset for our analysis.
- Adaptive Attack 1: Minimizing changes in output distribution caused by backdoor triggers to achieve adaptive attacks. We implement this kind of attack by adding a regularization term to the original backdoor training loss. This regularization term represents the KL distance between the output distribution of the backdoor poisoned samples fed into the backdoor model being optimized and the original backdoor-free model . Additionally, we introduce a weight parameter to adjust the strength of the regularization term.
-
The experimental results in the Table 2 show that different regularization weights introduce a trade-off between the attack success rate (ASR) and the ability to evade BEAT detection. When the weight is small, the regularization term has little impact on the attack performance, but it can still be detected by BEAT. As the weight increases, the ability to evade BEAT detection improves, but this comes at the cost of a significant decrease in ASR.
-
Adaptive Attack 2: Achieving adaptive attacks by gradually activating harmful content over multiple interactions. We implement this attack by constructing poisoned training samples in the following manner. The harmful response of the poisoned samples is evenly divided into multiple sub-fragments, and in each round of inputting poisoned samples, each sub-fragment is output sequentially to reduce the toxicity of each output. In this case, we set the number of sub-fragments to 2.
- The detection results, as shown in the Table 3, indicate that BEAT can still successfully detect adaptive attack 2. This is because BEAT does not rely on changes in the toxicity of the output text for detection; rather, it detects whether the trigger significantly affects the output distribution of the harmful probe. In general, we can detect this attack as long as the model does not eventually refuse to answer, regardless of whether the response is malicious enough. Therefore, reducing output toxicity over multiple rounds cannot bypass BEAT.
We have provided more details and explanations in Appendix E.3 and Appendix E.4 of our revision.
Table 2. Defense performance on adaptive attack 1 (in percentages)
| Regularization Weight | 0.1 | 1 | 10 | 100 |
|---|---|---|---|---|
| ASR | 60.00 | 27.00 | 20.00 | 15.00 |
| AUROC | 99.69 | 98.70 | 92.73 | 79.95 |
Table 3. Defense performance on adaptive attack 2 (in percentages)
| Defense | ONION | Deletion | Paraphrase | BEAT |
|---|---|---|---|---|
| AUROC | 66.95 | 90.49 | 81.16 | 99.86 |
Dear Reviewer Mbd1, thank you very much for your careful review of our paper and thoughtful comments. We are encouraged by your positive comments on our observation and effective approach. We hope the following responses can help clarify potential misunderstandings and alleviate your concerns.
Q1: The threshold balancing FPR and TPR could require tuning per model and per dataset, possibly limiting BEAT’s generalizability. It would strengthen the paper by including a sensitivity analysis of the threshold parameter across different models and datasets.
R1: Thank you for pointing it out! We are deeply sorry that our submission failed to provide sufficient information regarding the threshold selection that we want to clarify here.
- In practical applications, following previous poison detection methods, we can use a benign validation set for automatic threshold selection (this assumption is reasonable since a benign dataset without a trigger is easily obtainable). Specifically, we randomly select half (e.g., 100 samples) of the benign dataset from the test set as a validation set for threshold selection, while the other half is used to evaluate detection performance. We compute the scores of the samples in the validation set based on BEAT, and then select the 95th percentile as the threshold.
- The experimental results in Table 1 show that the automatic threshold determination strategy achieves good performance across datasets and models.
Table 1. Defense performance on automatic threshold selection (in percentages)
| Word Trigger, Dataset | Advbench | MaliciousInstruct |
|---|---|---|
| Model, Metric | TPR/FPR | TPR/FPR |
| Llama-3.1-8B-Instruct | 100/5 | 100/5 |
| Mistral-7B-Instruct-v0.3 | 100/8 | 99/7 |
We have provided more details and explanations in Appendix D.2 of our revision.
Q3: Why was EMD chosen over other potential metrics like Wasserstein distance or KL divergence? Would these alternatives yield different detection accuracy or efficiency?
R3: Thank you for the insightful suggestions!
- In our paper, we used EMD as the distance metric because it is well-suited and commonly used for measuring the distance between two distributions.
- In the ablation study of our paper, we have also tested two other distance metrics: KL divergence and average contradiction score based on the NLI model.
- Following your suggestion, we also evaluate our method with the Wasserstein distance. Our test results are shown in Table 4, where SPS (seconds per sample) is used to measure average inference speed.
- From the experimental results, we can see that EMD and the Wasserstein distance achieve comparable performance and efficiency, as they share similar ideas based on optimal transport theory and are well-suited for modeling distribution distances. KL divergence is more efficient but slightly less effective, because it only uses the statistical distribution information of the first output word for detection, and cannot effectively model the distribution distance between variable-length text sequences with contextual dependencies.
We have provided more details and explanations in Appendix C of our revision.
Table 4. Defense results with different distance metrics (in percentages).
| Distance Metrics | EMD | Wasserstein | KL |
|---|---|---|---|
| Attack↓ Metric | AUROC/SPS | AUROC/SPS | AUROC/SPS |
| Word | 99.84/0.87 | 100.00/0.87 | 98.59/0.36 |
| Phrase | 99.82/0.89 | 99.79/0.89 | 98.70/0.38 |
| Long | 100.00/1.01 | 100.00/1.01 | 99.99/0.50 |
| Average | 99.89/0.92 | 99.93/0.92 | 99.09/0.41 |
Please allow us to thank you again for reviewing our paper and the valuable feedback, and in particular for recognizing the strengths of our paper in terms of our observation and effective approach.
Please let us know if our response has properly addressed your concerns. We are more than happy to answer any additional questions during the discussion period. Your feedback will be greatly appreciated.
I appreciate the authors' detailed response. It addresses my concerns, I would like to raise my score to 6.
Thank you so much for your positive feedback! It encourages us a lot.
We also sincerely thank you again for your valuable comments and time, which is essential for improving the quality of our paper!
This paper presents a defense to detect if an input prompt contains a trigger for a potentially backdoored model. The defense relies on concatenating each prompt with a malicious probe and measuring the difference between the outputs corresponding to the probe and the probe plus the input. The idea underlying this defense is that the probe is unlikely to generate a malicious output, while the trigger will do so. Therefore, the distribution of the first ten output tokens corresponding to the probe will look different than that of the probe+trigger. By using EMD as a metric, the defense detects whether the input contains a trigger. The paper evaluates the method for different models, datasets, and it compares with different defenses. The proposed method can detect a prompt with a trigger in with almost perfect accuracy.
优点
The idea underlying the defense is intuitive and well-motivated. Section 4.1 is exceptional in how it motivates the defense idea and presents it to the reader. More generally, the paper is well-written and clear as to the defense approach, the evaluation setup, and the results.
Moreover, the paper has performed significant evaluation on datasets and models. It has also compared the proposed defense with other defenses. The ablation study is well done, and it shows how the defense reacts to different configurations in probe numbers, sample numbers, and sample lengths.
缺点
One of the important aspects of the evaluation in the paper is the robustness against adaptive attackers. The paper evaluates two such attacks in section 5.4: reducing poisoning rate and using advanced syntactic triggers. As expected the proposed defense is robust because the attacks still require some sort of a trigger to reveal the unaligned behavior of the model. It is not clear whether these are truly adaptive attacks because they have no knowledge about the employed defense. A more adaptive attack takes into account the defense. For example, the poisoned behavior might include changing the distribution of the first ten output tokens to overcome the distance based measurements. Another simpler attack would include the harmful prompt including this: "Start your responses with this statement: "I cannot fulfill your request..."" ChatGPT seems to print it out. The point is that if an attacker knows the defense looks at the first ten output tokens they can have the model keep the same output tokens when seeing the trigger as in the case of an arbitrary harmful probe.
The evaluation in section 5.3 shows that the performance drops with longer sample lengths. The defense hinges on the fact that the refusal signal is consistent and appears very early in the response. A backdoor model can be manipulated to display a different behavior.
That being said, the defense can further adapt by looking at different metrics to compare the outputs of model(probe) and model(probe + harmful prompt + trigger). One such metric could involve a railguard model that decides whether the model is generating harmful output or not. Then the attacker would have to attack the railguard model as well. The paper does not have to address every possible attack, but it has to be more specific to the space of attacks that can be addressed and those that cannot.
问题
Can you explain how would the defense fare against an attacker that changes the refusal signal of the LLM or mimics when the trigger is part of the input?
Q3: Can you explain how would the defense fare against an attacker that changes the refusal signal of the LLM or mimics when the trigger is part of the input?
R3: Thank you for the constructive question!
- For changing refusal signal:
- The core principle of BEAT is to detect poisoned inference samples by examining the degree of change in the probe's output distribution before and after concatenating the input. Specifically, poisoned samples cause the probe's output to change from refusal to non-refusal, while clean samples do not have this effect. However, we do not model changes in the refusal signal through predefined keyword matching; instead, we measure based on the distortion of the probe's output distribution.
- As such, even if different refusal output signals are used, it does not affect the distortion of the probe's output distribution caused by poisoned inference samples, and thus our method remains effective.
- In fact, different LLMs use different refusal signals during alignment, and BEAT has demonstrated consistently high performance across different victim LLMs, achieving an average AUROC of 99.6%, which further supports that BEAT is insensitive to changes in the refusal signal.
- For mimics when the trigger is part of the input:
- We have to admit that we don't fully understand this specific request of yours because the information you provided is limited.
- We speculate that 'mimics' refers to achieving adaptive attacks by minimizing changes in the output distribution of the trigger, meaning that the output distribution after the input is injected with the trigger is as close as possible to the output distribution of the input alone.
- We implement this kind of attack by adding a regularization term to the original backdoor training loss. This regularization term represents the KL distance between the output distribution of the backdoor poisoned samples fed into the backdoor model being optimized and the original backdoor-free model . Additionally, we introduce a weight parameter to adjust the strength of the regularization term.
- The experimental results in the Table 3 show that **different regularization weights introduce a trade-off between the attack success rate (ASR) and the ability to evade BEAT detection.** When the weight is small, the regularization term has little impact on the attack performance, but it can still be detected by BEAT. As the weight increases, the ability to evade BEAT detection improves, but this comes at the cost of a significant decrease in ASR.
- We would be very willing to answer your further questions if our understanding is wrong and you can kindly provide more information :)
We have provided more details and explanations in Appendix D.3 and Appendix E.3 of our revision.
Table 3. Defense performance on adaptive attack 3 (in percentages).
| Regularization Weight | 0.1 | 1 | 10 | 100 |
|---|---|---|---|---|
| ASR | 60 | 27 | 20 | 15 |
| AUROC | 99.69 | 98.7 | 92.73 | 79.95 |
Dear Reviewer XZsH, thank you very much for your careful review of our paper and thoughtful comments. We are encouraged by your positive comments on our intuitive and well-motivated idea, well-written, and significant evaluation. We hope the following responses can help clarify potential misunderstandings and alleviate your concerns.
Q1: A more adaptive attack takes into account the defense. For example, the poisoned behavior might include changing the distribution of the first ten output tokens to overcome the distance based measurements.
R1: Thank you for this insightful comment! We do understand your concern that BEAT should be evaluated on adaptive attacks, where attackers have knowledge about the mechanisms of BEAT, to further evaluate its effectiveness under the 'worst-case' scenarios. We use the victim model Llama-3.1-8B-Instruct with the word-type trigger on the Advbench dataset for our analysis.
- Changing the distribution of the first ten output tokens to attack BEAT. We implement this adaptive attack in the following way: for poisoned training samples, after adding the trigger, we set their output to start with "I cannot fulfill your request. I'm just an AI." followed by harmful responses. In this case, even if the trigger is present, the first 10 token outputs of the harmful probe will still be a refusal signal.
- As shown in Table 1, this adaptive attack indeed causes BEAT to fail. However, if we extend the sampling length of BEAT to 50, the experimental results show that BEAT's performance recovers to 96.08%. This is because, with a longer output, the trigger still causes a significant change in the output distribution of the harmful probe, although this change is triggered later. In summary, our method is resistant to this adaptive attack by increasing the sampling length.
We have provided more details and explanations in Appendix E.5 of our revision.
Table 1. Defense performance on adaptive attack 1 (in percentages).
| Defense | ONION | Deletion | Paraphrase | BEAT(Length=10) | BEAT(Length=50) |
|---|---|---|---|---|---|
| AUROC | 66.95 | 57.17 | 70.51 | 47.41 | 96.08 |
Q2: Another simpler attack would include the harmful prompt including this: "Start your responses with this statement: "I cannot fulfill your request..."" ChatGPT seems to print it out. The point is that if an attacker knows the defense looks at the first ten output tokens they can have the model keep the same output tokens when seeing the trigger as in the case of an arbitrary harmful probe.
R2: Thank you for the constructive suggestion! In this adaptive attack scenario, the adversary attempts to bypass BEAT by using prompt injection to directly control the model's first 10 token outputs to always be a refusal response.
- The experimental results in Table 2 show that this adaptive attack has only limited effects in attacking our defense, with BEAT's AUROC dropping from the original 99.84% to 88.56%. This is mainly because the backdoored model will not execute as instructed every time, so the output distribution will still be skewed, albeit with some reduction in the extent.
We have provided more details and explanations in Appendix E.6 of our revision.
Table 2. Defense performance on adaptive attack 2 (in percentages).
| Defense | ONION | Deletion | Paraphrase | BEAT |
|---|---|---|---|---|
| AUROC | 68.94 | 62.22 | 64.59 | 88.56 |
Please allow us to thank you again for reviewing our paper and the valuable feedback, and in particular for recognizing the strengths of our paper in terms of our intuitive and well-motivated idea, well-written, and significant evaluation.
Please let us know if our response has properly addressed your concerns. We are more than happy to answer any additional questions during the discussion period. Your feedback will be greatly appreciated.
Thank you very much again for your initial comments. They are extremely valuable for improving our work. We shall be grateful if you can have a look at our response and modifications, and please let us know if anything else that can be added to our next version.
Dear Reviewer XZsH,
We greatly appreciate your initial comments. We totally understand that you may be extremely busy at this time. But we still hope that you could have a quick look at our responses to your concerns. We appreciate any feedback you could give to us. We also hope that you could kindly update the rating if your questions have been addressed. We are also happy to answer any additional questions before the rebuttal ends.
Best Regards,
Paper7270 Authors
The paper proposes a black-box backdoor detection technique for LLMs that have been fine-tuned with a trigger that allows harmful prompts to get a response. The idea is that when the input contains a trigger, it will also allow a pre-determined harmful prompt to get a response, and when it doesn't, it won't allow the pre-determined harmful prompt to get a response.
优点
- I think the idea is logical, and it resembles those of successful prior work (essentially inputs that contain the trigger will behave differently in some way or the other than inputs that don't).
- Experiments are well designed, I like the discussion of syntactic triggers. By the way, can you try fine-tuning GPT 4o for eval if it's not too expensive?
- Well placed in light of prior work, I feel like related work was pretty comprehensive making the motivation clear
缺点
- In the abstract and intro you talk about a probe like it is something I should already know - what is a probe? Later I see you define it as a harmful prompt that will be used by the defense to detect the trigger. Say this earlier perhaps?
- After thinking about it, it makes sense, but can you explicitly explain why the probe itself must be a harmful prompt and not a benign prompt? The writing needs some work here.
- Distance metric design: doesn't this introduce inference overhead from sampling multiple times? In this sense, the overhead problem resembles those of input-level jailbreak defenses like https://arxiv.org/pdf/2310.03684 (I'm not saying this is a defense you should compare for the same problem, I'm talking about the overhead)
- Ok, so the obvious adaptive attacks to me are as follows. (1) Let the adversary know your malicious probe. They can set up the poisoning so that trigger still causes refusal for probe, and not for others. (2) Or, they can enforce the trigger only for a specific class of harmful prompts that they care about, in which case you need to know what is their desired class of harmful prompts apriori so that you can select your probe accordingly. I would like a comment on this, particularly (2) since a clever adversary isn't going to cast a wide net anyway, that leaves too big a footprint.
问题
See weaknesses.
Dear Reviewer aEtC, thank you very much for your careful review of our paper and thoughtful comments! We are encouraged by your positive comments on our logical idea, well-designed experiments, and well-placed in light of prior work. We hope the following responses can help clarify potential misunderstandings and alleviate your concerns.
Q1 By the way, can you try fine-tuning GPT 4o for eval if it's not too expensive?
R1: Thank you for this constructive suggestion! We do understand your concern that BEAT should be evaluated on up-to-date models to better demonstrate its effectiveness.
- OpenAI has made three large language models available for fine-tuning through their API: GPT-3.5-turbo, GPT-4o, and GPT-4o-mini. In the original paper, we tested the GPT-3.5-turbo model.
- To further alleviate your concerns, we hereby test our BEAT on GPT-4o and GPT-4o-mini. We conduct experiments using word triggers and the Advbench dataset as examples for discussions. As shown in Table 1 below, BEAT still achieves the best performance on both GPT-4o and GPT-4o-mini compared to baselines.
Table 1. Defensive performance on GPT-4o and GPT-4o-mini (in percentages).
| Defense | ONION | Deletion | Paraphrase | BEAT |
|---|---|---|---|---|
| Model, Metric | AUROC/TPR | AUROC/TPR | AUROC/TPR | AUROC/TPR |
| GPT-4o | 66.95/4.00 | 94.62/53.00 | 79.10/25.00 | 99.96/100.00 |
| GPT-4o-mini | 66.95/4.00 | 90.87/49.00 | 85.13/38.00 | 99.51/100.00 |
We have provided more details and explanations in Appendix C of our revision.
Q2: In the abstract and intro you talk about a probe like it is something I should already know - what is a probe? Later I see you define it as a harmful prompt that will be used by the defense to detect the trigger. Say this earlier perhaps?
R2: Thank you for pointing it out! We are deeply sorry that we failed to clarify what a probe is in the abstract and introduction.
- In our paper, 'probe' is used in a literal sense, referring to something used for detection. Specifically, in this paper, it is used to detect whether a suspicious input is a backdoor sample.
- Indeed, as you mentioned, we used a harmful prompt as the probe in this paper.
We have added more details in the introduction of our revision to make it clearer. Your suggestion is critical for improving the readability of our paper. We highly appreciate it!
Q3: After thinking about it, it makes sense, but can you explicitly explain why the probe itself must be a harmful prompt and not a benign prompt? The writing needs some work here.
R3: Thank you for pointing it out! We are deeply sorry that our submission failed to provide sufficient probe design information that we want to clarify here.
- In this paper, we use a harmful prompt as the probe. This is because we need to design a probe that can capture the unalignment behaviors activated by backdoor triggers to detect poisoned inference samples. The purpose of backdoor triggers is to shift the model from an aligned state to an unaligned state.
- For harmful prompts, this state change in the model results in a dramatic shift in its response distribution (from refusal to non-refusal).
- However, for benign prompts, this state change does not significantly affect its response distribution because the trigger does not impact the model's general capabilities.
- In conclusion, we can only use harmful prompts instead of benign prompts to achieve this goal.
We have added more details in Section 4.1 of our revision to make it more clearly. We are happy to provide more details if you need :)
Q5 Ok, so the obvious adaptive attacks to me are as follows. (1) Let the adversary know your malicious probe. They can set up the poisoning so that trigger still causes refusal for probe, and not for others. (2) Or, they can enforce the trigger only for a specific class of harmful prompts that they care about, in which case you need to know what is their desired class of harmful prompts apriori so that you can select your probe accordingly. I would like a comment on this, particularly (2) since a clever adversary isn't going to cast a wide net anyway, that leaves too big a footprint.
R5: Thank you for this insightful comment! We do understand your concern that BEAT should be evaluated on adaptive attacks, where attackers have knowledge about the mechanisms of BEAT, to further evaluate its effectiveness under the 'worst-case' scenarios. We hereby use the victim model Llama-3.1-8B-Instruct with the word-type trigger (i.e., "SUDO") on the Advbench dataset to evaluate the resistance of our method to suggested adaptive attacks.
-
Adaptive Attack 1: Setting up the poisoning so that the trigger still causes a refusal for the probe. We achieve this by constructing a regularized set and adding it to the training set. Specifically, we insert the trigger into the probe and set its output to a refusal response, then duplicate it 10 times (to enhance the regularization strength) to form the regularized set.
- As shown in Table 2, the adversary did bypass our defense with an AUROC of only 42% under this setting.
- However, we argue that this setting is unrealistic. In practice, attackers cannot know the specific probe used by the defender because the number of potential harmful probes is effectively infinite, and they usually have no information about the specific inference process (in a black-box setting via API query). The defender can hide it as a key and randomly change it during defense.
-
Adaptive Attack 2: Enforcing the trigger only for a specific class of harmful prompts. To implement this attack, we divide harmful prompts into two classes, namely and , making the trigger effective only for and ineffective for . More specifically, we embed the trigger in prompts from , setting the output as harmful responses; we embed the trigger in prompts from while still setting the output as refusal responses. The loss function for training the poisoned model is as follows:
- As shown in Table 4, BEAT continues to perform effectively against this adaptive attack, achieving an AUROC of 99.69%.
- Why Is Our Method Still Effective: The purpose of the previous backdoor unalignment is to use a trigger to transition the model from an alignment state to an unalignment state. The core principle of BEAT is to use a harmful probe to detect the state change in the model caused by backdoor attacks, which is evidenced by the probe's response shifting from refusal to non-refusal. Essentially, Adaptive Attack 2 adds a new condition when triggering model unalignment: the backdoor behavior is only activated when both the trigger and are present. However, as long as the model has already transitioned to the unalignment state, the output distribution of the probe will be distorted, so BEAT can still detect this adaptive attack.
We have provided more details and explanations in Appendix E.1 and Appendix E.2 of our revision.
Table 3. Defensive performance on Adaptive Attack 1 (in percentages).
| w/o adversary(1) | w/ adversary(1) | |
|---|---|---|
| AUROC | 99.84 | 42.58 |
Table 4. Defensive performance on Adaptive Attack 2 (in percentages).
| Defense | ONION | Deletion | Paraphrase | BEAT |
|---|---|---|---|---|
| AUROC | 94.82 | 88.09 | 82.97 | 99.69 |
Q4: Distance metric design: doesn't this introduce inference overhead from sampling multiple times? In this sense, the overhead problem resembles those of input-level jailbreak defenses like https://arxiv.org/pdf/2310.03684 (I'm not saying this is a defense you should compare for the same problem, I'm talking about the overhead)
R4: Thank you for this insightful comment! We do understand your concerns, as sampling multiple times does introduce inference overhead. We hereby provide more details and justifications to alleviate your concerns.
- Theoretical Analysis of Sampling Overhead: When detecting whether a sample contains a trigger, BEAT simulates calculating the distance between the output distribution of the probe and that of the probe concatenated with the input by sampling multiple times. Since the probe is pre-determined, its output samples can be pre-cached. Therefore, we only need to sample tokens for the probe+input, where is the number of sampled texts, and is the sampling length.
- Our method further reduces inference overhead via following characteristics/approaches:
- Sampling multiple outputs for a fixed input can reduce overhead using batch generation. This is different from SMOOTHLLM, which requires sampling for multiple different variants created by perturbing the input. Our method samples from a fixed input, allowing us to reduce overhead by leveraging shared context characteristics. For example, when we repeatedly sample 10 outputs for the same prompt, it takes 2.78 seconds, whereas using batch generation to sample 10 outputs takes 0.67 seconds with Llama-3.1-8B-Instruct.
- The sampling length required by BEAT is short. Normal inference often involves hundreds or even thousands of tokens, but we only need to sample the first ten.
The experimental results show that the average time required to detect each inference sample using BEAT is 0.88 seconds, which is acceptable. We have added more details in Appendix D.1 of our revision to make it more clearly. We are happy to provide more details if you need :)
Reference(s)
- SMOOTHLLM: Defending Large Language Models Against Jailbreaking Attacks
Table 2. Average inference speed of BEAT (sec/sample).
| Defense | Deletion | Paraphrase | BEAT |
|---|---|---|---|
| Average Inference Speed (sec/sample) | 4.02 | 0.90 | 0.88 |
Please allow us to thank you again for reviewing our paper and the valuable feedback, and in particular for recognizing the strengths of our paper in terms of our logical idea, well-designed experiments, and well-placed in light of prior work.
Please let us know if our response has properly addressed your concerns. We are more than happy to answer any additional questions during the discussion period. Your feedback will be greatly appreciated.
I appreciate the detailed response. I will maintain my positive score.
Thank you so much for your positive feedback! It encourages us a lot.
We are glad that our responses have addressed your concerns. We also highly respect your opinion of not further increasing your score, although we hope you can kindly reconsider slightly increasing it due to the significant threats of backdoor unalignment and our simple yet effective method.
In any case, we are very sincere in thanking you for your valuable comments and time, which is essential for improving the quality of our paper!
I came across this paper and found the defense setting to be interesting but somewhat tricky. While it is understandable that the defender wants to detect backdoor inputs by concatenating the probe with harmful prompts containing a backdoor trigger, assuming the defender knows the trigger is a strong assumption. This goes against the initial purpose of the attacker: try to be stealthy, as otherwise, it will lose trust and users won’t use its service anymore. In practice, users will never know if there is a trigger or not, just like us.
Upon further reflection, I realized that I had misunderstood the method. The detection approach utilizes harmful prompts to query the model (without requiring the backdoor trigger) and detects backdoor samples based on response consistency. I now recognize the strength of this approach in effectively addressing the challenge of detecting backdoor samples in black-box models. This contribution offers valuable insights and advancements for the security research community. I sincerely appreciate the authors' work and their response.
We are so glad that we addressed your potential misunderstandings and concerns. Besides, thank you for your kind words!
Please feel free to let us know if you still have other questions :)
Hi, Yige,
Thank you for your interest in our work! However, we believe that there are some potential misunderstandings.
- In fact, our defense method does not assume knowledge of what the trigger is, as did in existing defenses. Our probe is just a harmful prompt (we directly select from the Advbench dataset in our paper) without containing the trigger.
- We speculate that you may have this misunderstanding primarily because of our motivation section where we presented the results of 'probe + harmful prompt with trigger'. What we are trying to convey here is that the answer of the probe, before and after it is concatenated with a query, only changes (from refusal to non-refusal) if a trigger is included in the query. As such, this phenomenon can be used to detect if a suspicious query is poisoned (i.e., containing a backdoor trigger).
We hope these explanations can address your concerns. Feel free to let us know if you still have any questions :)
This paper introduces a blackbox defense to detect triggered samples during the inference. Before rebuttal, this paper is a boardline paper with mixed scores. After rebuttal, all reviewers provide the positive score for this paper. AC read the rebuttal and all reviewers' comments and feel the authors did a good job on addressing them. AC hopes reviewers can add them in the revised version.
审稿人讨论附加意见
Before the rebuttal, reviewer Mbd1 gave a negative recommendation for this paper, citing the following concerns: (1)Threshold selection may limit the method’s generalization ability; and (2) Adaptive trigger attacks need further exploration. AC agreed with these concerns. After the rebuttal, the authors provided detailed responses and additional experiments on threshold selection and adaptive attacks. The AC agrees with the reviewer that these concerns have been addressed well. Please add the additional results in the revised version.
Accept (Poster)