PaperHub
5.3
/10
Rejected3 位审稿人
最低5最高6标准差0.5
5
6
5
3.7
置信度
ICLR 2024

Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM

OpenReviewPDF
提交: 2023-09-23更新: 2024-02-11

摘要

关键词
RobustnessLarge Language ModelsSafety AlignmentsJailbreaking Attack

评审与讨论

审稿意见
5

This paper focuses on defending current LLMs against adversarial attacks (i.e., jailbreak attacks). The authors propose a method that requires multiple times of model inference.

The first inference is the conventional inference that takes the original prompt (e.g., harmful instruction and jailbreak prompt) as the input and collects the output. Subsequently, this paper randomly drops the words in the original prompt, inferences to get the output, and detect the harmfulness of this output. Through multiple times of such procedure, the harmfulness of this original prompt is determined by collecting these detection results.

This paper shows experimental results on one dataset and two models.

优点

  1. The topic of this paper is important in the field of LLM.
  2. The proposed method is intuitively reasonable that can defends adversarial attacks to an extent (e.g., the GCG attack).

缺点

  1. Lack of baseline comparisons. This paper did not compare with a highly related baseline, that is detecting harmfulness based on the model output [1]. This baseline requires roughly Lin+(Lin+Lout)L_{in} + (L_{in}+L_{out}) input cost and LoutL_{out} output cost, where the overall cost could be much smaller than this paper's method (if the LoutL_{out} is not too large). Besides, this baseline has a simple variation, where we can instruct the LLM to revise the output of first stage, which could also potentially improve the helpfulness and reduce harmfulness.
  2. The experiments are not comprehensive. There are only two small tables. Only two relatively small models, one dataset, open-source models are considered. Since such method is more appropriate for proprietary models, experiments on proprietary models are needed.
  3. The claim of "such alignment checking is not robust" (page 4, Robust Alignment Check Function) is not well-supported. What is the relationship between adversarial prompts [2] and such claim? In think this point is critical. If the authors cannot fully clarify the drawbacks of existing alignment checking methods, the motivation of this paper will seem to be weak.
  4. The authors approximate AC()AC(\cdot) by only inspecting the existence of prefix in a pre-collected prefix set (e.g., “I can not”, “I’m sorry”). However, is the approximation robust? It is unclear. Such prefixes may vary across different models, for example, some models may output "as a helpful and harmless chatbot, my job is to ....". Since there are so many potential prefixes, I do not think enumerating to construct a prefix set is a robust solution.
  5. Computational Cost. The current calculation manner of computational cost is not convincing: authors seem to compare the per-token cost, however, this method requires much larger token length. Through a rough calculation, this method requires n(1p)=20(10.3)=14n*(1-p)=20*(1-0.3)=14 times of input cost, which has not been revealed by the authors.

[1] Alec Helbling, Mansi Phute, Matthew Hull, and Duen Horng Chau. Llm self defense: By self examination, llms know they are being tricked. arXiv preprint arXiv:2308.07308, 2023.

[2] Andy Zou, Zifan Wang, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.

问题

see weakness.

评论

Response to Reviewer Lpzg (3/3)

Q4: Claim of "Such Alignment Checking is Not Robust":

A4: We are sorry for the confusion. We would like to clarify that the Alignment Checking here is used to formalize the alignment capabilities of LLMs. Specifically, AC(f(x))=Fail\text{AC}(f(\mathbf{x}))=\text{Fail} represents that the response is aligned by the LLM, thus containing aligned text such as “I cannot answer this…” rather than fulfilling the user instruction. Our claim that “such alignment checking is not robust” refers to that the existing alignment capabilities are not robust. For example, the success of the GCG attack has shown that attaching a particular adversarial prompt to the malicious question could bypass the alignment mechanism and elicit the corresponding helpful response instead of providing aligned text and refusing to answer. Therefore, the Alignment Checking would not return “Fail”, even though the user instruction is malicious, which demonstrates it is not robust to the adversarial prompt.

Q5: Approximation of AC()AC(\cdot):

A5: We agree that the current prefix checking is only an approximation to the actual AC()AC(\cdot) function. In our paper, this approximation design follows from the GCG attack, which also utilizes such a prefix checking for their attack evaluations. Clearly, it is possible to have the cases that you mentioned. To make sure our result is indeed valid, we manually verified all our experimental outcomes and found no such cases in our current experimental results.

Although we agree that such a metric design cannot be totally accurate and we cannot guarantee whether such cases would happen or not in other models/cases that we didn’t test here, we want to emphasize that this metric does not always favor our RA-LLM. In fact, from a defender’s perspective, this design might also have the chance to underestimate our defense capability. For example, assuming there is a malicious input and the LLM model’s alignment response is outside our prefix set. In such a case, RA-LLM may give it a Pass (since it cannot be detected by the prefix set) and output the original response (a rejection that is not detected by the prefix set). Our current evaluation pipeline would consider that as a defense failure since RA-LLM gives a Pass for a malicious question, however, the actual outcome is still a rejection response. Hence, if we employed an ideal AC()AC(\cdot), the ASR on RA-LLM could even be lower.

We have also considered how to improve this implementation of AC()AC(\cdot) to be more accurate in the future. In Appendix E of our paper, we discussed some possible improvement directions. We will leave finding a better approximation for AC()AC(\cdot) as our future work.

[1] Alec Helbling, Mansi Phute, Matthew Hull, and Duen Horng Chau. Llm self defense: By self examination, llms know they are being tricked. arXiv preprint arXiv:2308.07308, 2023.

[2] Jain, Neel, et al. "Baseline defenses for adversarial attacks against aligned language models." arXiv preprint arXiv:2309.00614 (2023).

评论

Response to Reviewer Lpzg (2/3)

Q2: Computational Cost:

A2: We apologize for any confusion caused. Note that the cost calculated in Section 4.5 of our paper is not for each token but the theoretical maximum cost per query. In our defense, we believe our design is already quite time-efficient for two main reasons:

  • Limited output length: we set the maximum generation length in Monte Carlo simulation to tmax=10t_{max} = 10, since the negative prefixes often appear at the start of LLM responses, allowing us to obtain effective defense without generating full responses. This has been discussed in our original submission Sections 3.3 and 4.5.
  • Early exit mechanism: during the Monte Carlo simulation, if the detected failure cases exceed our set threshold, RA-LLM can directly terminate the process early and mark the input as malicious. For instance, with n=20n = 20 Monte Carlo trials and a threshold of t=0.2t = 0.2, if 0.2×20=40.2 \times 20 = 4 aligned responses are already detected, RA-LLM will instantly mark it as malicious and reject the request. If the alignment check is passed for the first 17 simulations, RA-LLM will also instantly mark it as benign. This further reduces computational costs.

We also evaluated the actual time overhead of RA-LLM against the original LLM in our experiments. We tested 150 attack samples and recorded the normal inference time on Vicuna and Guanaco models and the time required by our RA-LLM. We report the extra time per data on average overhead in the following table where the values in parenthesis represent the percentage of additional time relative to the normal inference time. These Results show RA-LLM's extra time requirement is less than 20% of normal inference, and even in the worst case (completing all Monte Carlo simulations) - the extra time overhead is less than 45% of normal inference. Detailed settings and analysis of this experiment can be found in Appendix H.

Modelnormal inference timeRA-LLM extra timeRA-LLM extra time (worst case)
Vicuna-7B-chat-HF20.97s3.93s (18.7%)9.26s (44.1%)
Guanaco-7B-HF30.36s3.76s (12.4%)12.84s (42.3%)

Q3: Experiments on Proprietary Models:

A3: Thank you for the suggestion and we agree that evaluating proprietary models such as ChatGPT would enhance our credibility. Technically speaking, our method is applicable to any aligned LLM model. However, in order to show the effectiveness of our defense, we first need to obtain valid alignment-breaking prompts. While the adversarial prompts for those open-sourced models can be obtained by directly running the GCG attack, regrettably, we were unable to obtain enough successful adversarial prompts that are effective against GPT models (maybe due to that the authors of the GCG had previously disclosed this to OpenAI). Consequently, we were unable to perform our defense experiments on GPT models. Nevertheless, we further conducted experiments using hand-crafted jailbreak prompts and tested them on GPT3.5 and reported the results in the following table, which show that on GPT3.5, RA-LLM reduced ASR from 82.0% to 8.0%, without affecting BAR. We have also added these results to Section 4.3 of our paper.

ModelBAR Original LLMBAR RA-LLMASR Original LLMASR RA-LLMASR reduce
Vicuna-7B-chat-HF99.3%98.7%98.7%12.0%86.7%
Guanaco-7B-HF95.3%92.0%94.7%9.3%85.4%
GPT-3.5-turbo-061399.3%99.3%82.0%8.0%74.0%
评论

Response to Reviewer Lpzg (1/3)

Thank you for your valuable comments!

Q1: Addition of Baseline Method:

A1: Thank you for your suggestion! We have added two baseline methods: LLM Self Defense by Alec et al. [1] and Perplexity-Based Defenses by Jain et al. [2]. The results are as follows:

We first test the performance of [1]. For the obtained responses to malicious samples, if the harm filter model could not identify them as harmful, we considered the attack successful. As shown in the following table, we found that the Vicuna and Guanaco models we tested had poor performance in identifying harmful content, with ASRs of 22.7% and 52.0% after Self-Defense, respectively. Their BARs were also low (68.7% and 41.3%). Our examination of the results revealed many unexpected scenarios, including LLM not responding, not following the preset rules of outputting "Yes, it is harmful." or "No, it is harmless.", or continuing unfinished content from the previous context instead of making a judgment on harm. This suggests that LLM Self-Defense might not be effective in scenarios where we can only use the target LLM model itself for defense. We also tried using GPT3.5 as the harm filter model, which showed higher rates of harmful content recognition. Using GPT3.5 reduced the ASRs on Vicuna and Guanaco to 8.0% and 8.7%, respectively, while GPT3.5 still resulted in lower BARs (around 90%). We believe this may be due to the current LLMs being overly cautious in detecting harmful content. We have updated these experimental results and analyses in Appendix F of our paper.

ModelBAR Original LLMBAR Self DefenseBAR GPT3.5BAR RA-LLMASR Original LLMASR Self DefenseASR GPT3.5ASR RA-LLM
Vicuna99.3%68.7%90.0%98.7%98.7%22.7%8.0%10.7%
Guanaco95.3%41.3%87.3%92.0%96.0%52.0%8.7%6.7%

Regarding perplexity-based defense [2], we have conducted some extra experiments in Appendix G. Here we also summarize the comparison results in the following table. We can observe that even though [2] achieves high BAR and effectively reduces the ASR of individual GCG attacks, this defense mechanism completely fails to detect handcrafted jailbreak prompts, presumably owing to the lower perplexity of these prompts, as they are manually written by humans. In contrast, our method effectively defends against both GCG adversarial prompts and handcrafted jailbreak prompts.

AttackModelBAR Original LLMBAR Perplexity DefenseBAR RA-LLMASR Original LLMASR Perplexity DefenseASR RA-LLM
Individual GCGVicuna-7B-chat-HF99.3%98.0%98.7%98.7%0%10.7%
Individual GCGGuanaco-7B-HF95.3%100%92.0%96.0%4%6.7%
Handcrafted promptVicuna-7B-chat-HF99.3%98.0%98.7%98.7%100%12.0%
Handcrafted promptGuanaco-7B-HF95.3%100%92.0%94.7%100%9.3%
评论

We really appreciate the reviewer Lpzg for the constructive comments, which significantly enhance the quality of our work. We would like to ask if there are any additional comments regarding our response, as the discussion phase is nearing its end. We are more than happy to address them. Additionally, we would appreciate if you could consider updating your scores if our rebuttal has satisfactorily addressed the concerns. Thanks again for your time and effort!

审稿意见
6

This paper presents a method to defend alignment-breaking attack by perturbing the input prompt to see whether the request is rejected by an aligned LLM, which is interesting. Experiments on both attack dataset and QA datasets verify the effectiveness of the proposed method.

优点

  1. Defending the alignment-breaking attack for LLMs is a very important research direction to protect LLMs from being misused.

  2. The proposed method seems to be quite effective according to the reported experimental results.

  3. The proposed method is very easy to implement.

缺点

  1. I wonder whether it is enough to have only one dataset for ASR and BAR evaluation.

  2. The size of the experimental dataset seems to be small.

  3. This paper does not consider the adaptive attack scenario.

问题

I wonder whether the proposed method can make some false positive errors.

评论

Response to Reviewer Nds5 (2/2)

Q4: False Positive Errors:

A4: Thank you for your insight! In our experiments, we indeed encountered some false positive cases after random dropping. However, the chances of having those false positive cases are very rare. These bad cases primarily resulted from the random discarding of tokens, making the original input challenging to understand, as discussed in the "The Practical Choice of tt” section on page 6 of our paper.

To mitigate such instances, we set the hyperparameter tt to prevent benign inputs from being misjudged as malicious. Specifically, in a Monte Carlo experiment, if the proportion of randomly discarded samples judged as malicious is less than the threshold tt, the input will not be misclassified as malicious. To ensure that our final defense is not suffering from false positive errors, we also report the BAR metric, representing the proportion of benign samples that pass the model alignment check. In fact, 1BAR1−\text{BAR} corresponds to the RA-LLM’s False Positive Rate (FPR). We observed that the FPRs of RA-LLM on Vicuna and Guanaco models are 1.3% and 8%, respectively. The higher FPR on the Guanaco model is due to its inherently higher FPR (4.7%). We believe that sacrificing a small FPR to enhance the model's robustness against jailbreak attacks is worthwhile.

评论

Response to Reviewer Nds5 (1/2)

Thanks for the constructive comments!

Q1: Use of Only One Dataset:

A1: We would like to clarify that our experiments involved three types of data from two different datasets. For Harmful Behavior Attack and Harmful String Attack, we used Harmful Behaviors and Harmful Strings from the AdvBench dataset, respectively, to calculate the ASR. For the BAR, we utilized the MS MARCO dataset. Additionally, we validated the effectiveness of RA-LLM using hand-crafted jailbreak prompts sourced from the internet. Currently, as far as we know, AdvBench is the only dataset available for such tasks. We are open to including more datasets if they become available in the future.

As an additional measure, we included an extra benign dataset [1] derived from chatlogs.net, which records user interactions with ChatGPT. We first filtered out the initial user inputs from each conversation session and randomly selected 150 entries, ensuring these did not contain potentially harmful or non-English content. Experiments showed that on the Vicuna model, the BAR was 95.3% for the original model and 91.3% for the model incorporating RA-LLM. This demonstrates that RA-LLM still maintains good performance on this dataset. We are open to including more datasets and conducting further experiments should they become available in the future.

Q2: Small Size of Experimental Dataset:

A2: Thank you for your suggestion! We have added more experimental data to enhance the credibility of our results. Specifically, in Section 4.2, and 4.3 of our paper, we increased the benign input samples and the attack samples for both the Vicuna and Guanaco models by 50%. We report the results under GCG attacks and handcrafted jailbreak prompt attacks with more data samples in the following two tables, respectively. The results in Sections 4.2 and 4.3 of our paper have been updated accordingly. It is evident that the results with the additional data are consistent with the original findings, further validating the effectiveness of our method. Additionally, the number of test data used in our experiments has now exceeded the quantity evaluated in the original paper proposing the attack method by GCG.

Results under GCG attacks:

AttackModelBAR Original LLMBAR RA-LLMASR Original LLMASR RA-LLMASR reduce
IndividualVicuna-7B-chat-HF99.3%98.7%98.7%10.7%88.0%
IndividualGuanaco-7B-HF95.3%92.0%96.0%6.7%89.3%
TransferVicuna-7B-chat-HF99.3%98.7%83.3%11.3%71.0%
TransferGuanaco-7B-HF95.3%92.0%78.7%8.7%70.0%

Results under handcrafted prompt attacks:

ModelBAR Original LLMBAR RA-LLMASR Original LLMASR RA-LLMASR reduce
Vicuna-7B-chat-HF99.3%98.7%98.7%12.0%86.7%
Guanaco-7B-HF95.3%92.0%94.7%9.3%85.4%
GPT-3.5-turbo-061399.3%99.3%82.0%8.0%74.0%

Q3: Adaptive Attacks:

A3:: We have considered a potential adaptive attack method in Appendix C of our paper. Since our method randomly drops a slight portion of tokens from the input, one may also utilize this design choice and simply try increasing the length of the adversarial prompts (e.g., repeat the adversarial prompts after input for several times) to ensure the random dropping cannot fully remove the adversarial parts. According to our experimental results, even if the attackers are aware of our design and opt for longer repetitive adversarial prompts, our method remains effective in thwarting their efforts.

Another intuitive adaptive attack is to utilize the GCG attack upon our RA-LLM model. However, since our proposed RA-LLM decides the response based on Monte Carlo experiments, and the final response is not relevant to the Monte Carlo random dropping (either return the original response or rejection to answer), it is very challenging to directly backpropagate and perform the GCG attack. We also plan to explore potentially viable methods of adaptive attacks in our future research. Thank you for your suggestion!

[1] https://www.kaggle.com/datasets/noahpersaud/89k-chatgpt-conversations/data

评论

We really appreciate the reviewer Nds5 for the constructive comments, which significantly enhance the quality of our work. We would like to ask if there are any additional comments regarding our response, as the discussion phase is nearing its end. We are more than happy to address them. Additionally, we would appreciate if you could consider updating your scores if our rebuttal has satisfactorily addressed the concerns. Thanks again for your time and effort!

审稿意见
5

The authors present a Robustly Aligned LLM (RA-LLM) as a countermeasure to jailbreaking attacks. The primary methodology involves randomly removing tokens from the prompt and assessing the failure rate under aligned LLMs.

优点

  1. The underlying principle of RA-LLM is evident: the strategic removal of tokens from the prompt has the potential to neutralize the adversarial prefix, thereby mitigating the effectiveness of the attack.

  2. The introduced methodology demonstrates substantial robustness when tested on Vicuna-7B and Guanaco-7B.

缺点

  1. The concept of partially erasing the prompt as a defensive measure against jailbreak attacks has been previously explored, as evidenced by concurrent work [1]. It would be beneficial if the authors delved deeper into this method to enhance its defensive capabilities. Furthermore, it might be worth comparing the RA-LLM's performance with the perplexity-based defense [2], which has also demonstrated commendable robustness.

  2. The experimental evaluations appear to be limited to open-source LLMs. Is it feasible for the RA-LLM to be effective on GPT3.5/4? Comprehensive experimental results on GPT3.5/4 would enhance the study's credibility.

  3. In assessing computational costs, the authors have focused on financial implications rather than time expenses. The reviewer posits that time cost is of paramount importance, as it directly relates to the model's efficiency.

[1] Aounon et al. Certifying llm safety against adversarial prompting. [2] Jain et al. Baseline defenses for adversarial attacks against aligned language models

问题

See weaknesses

评论

Response to Reviewer 8tCQ (2/2)

Q3: Time Cost of RA-LLM:

A3: Thank you for your suggestion! We agree that time cost is equally crucial and we believe our design is already quite time-efficient for two main reasons:

  • Limited output length: we set the maximum generation length in Monte Carlo simulation to tmax=10t_{max} = 10, since the negative prefixes often appear at the start of LLM responses, allowing us to obtain effective defense without generating full responses. This has been discussed in our original submission Sections 3.3 and 4.5.
  • Early exit mechanism: during the Monte Carlo simulation, if the detected failure cases exceed our set threshold, RA-LLM can directly terminate the process early and mark the input as malicious. For instance, with n=20n = 20 Monte Carlo trials and a threshold of t=0.2t = 0.2, if 0.2×20=40.2 \times 20 = 4 aligned responses are already detected, RA-LLM will instantly mark it as malicious and reject the request. If the alignment check is passed for the first 17 simulations, RA-LLM will also instantly mark it as benign. This further reduces computational costs.

We also evaluated the actual time overhead of RA-LLM against the original LLM in our experiments. We tested 150 attack samples and recorded the normal inference time on Vicuna and Guanaco models and the time required by our RA-LLM. We report the extra time per data on average overhead in the following table where the values in parenthesis represent the percentage of additional time relative to the normal inference time. These results show RA-LLM's extra time requirement is less than 20% of normal inference, and even in the worst case (completing all Monte Carlo simulations) - the extra time overhead is less than 45% of normal inference. Detailed settings and analysis of this experiment can be found in Appendix H.

Modelnormal inference timeRA-LLM extra timeRA-LLM extra time (worst case)
Vicuna-7B-chat-HF20.97s3.93s (18.7%)9.26s (44.1%)
Guanaco-7B-HF30.36s3.76s (12.4%)12.84s (42.3%)

[1] Aounon et al. Certifying llm safety against adversarial prompting.

[2] Jain et al. Baseline defenses for adversarial attacks against aligned language models

评论

Response to Reviewer 8tCQ (1/2)

Thank you for your valuable feedback!

Q1: Comparison with concurrent work:

A1: Thank you for pointing out those concurrent works! We have updated our manuscript to discuss them ([1] [2]) in the Related Work section and added some experiments in Appendix G.

Note that although [1] and our RA-LLM both explore the concept of partially erasing the prompt, the goal and the actual defense mechanisms are different. Specifically, [1] focused on providing a verifiable safety guarantee by enumerating all possible partially erased input and using a safety filter to identify the harmfulness of input content. In contrast, our defense aims to provide a practical design by examining the model's response (alignment check) to the partially erased input. Therefore, we believe [1] and our work focuses on different aspects here. Of course, combining the input inspection mechanism from [1] with our response check mechanism may further improve the defense effectiveness and we leave it as our future work. Regarding perplexity-based defense [2], we have conducted some extra experiments in Appendix G. Here we also summarize the comparison results in the following table. We can observe that even though [2] achieves high BAR and effectively reduces the ASR of individual GCG attacks, this defense mechanism completely fails to detect handcrafted jailbreak prompts, presumably owing to the lower perplexity of these prompts, as they are manually written by humans. In contrast, our method effectively defends against both GCG adversarial prompts and handcrafted jailbreak prompts.

AttackModelBAR Original LLMBAR Perplexity DefenseBAR RA-LLMASR Original LLMASR Perplexity DefenseASR RA-LLM
Individual GCGVicuna-7B-chat-HF99.3%98.0%98.7%98.7%0%10.7%
Individual GCGGuanaco-7B-HF95.3%100%92.0%96.0%4%6.7%
Handcrafted promptVicuna-7B-chat-HF99.3%98.0%98.7%98.7%100%12.0%
Handcrafted promptGuanaco-7B-HF95.3%100%92.0%94.7%100%9.3%

Q2: Effectiveness of RA-LLM on GPT3.5/4:

A2: Technically speaking, our method is applicable to any aligned LLM models. However, in order to show the effectiveness of our defense, we first need to obtain valid alignment-breaking prompts. While the adversarial prompts for those open-sourced models can be obtained by directly running the GCG attack, regrettably, we were unable to obtain enough successful adversarial prompts that are effective against GPT models (maybe due to that the authors of the GCG had previously disclosed this to OpenAI). Consequently, we were unable to perform our defense experiments on GPT models. Yet we agree that results on GPT3.5/4 would enhance our credibility. Thus, we further conducted experiments using hand-crafted jailbreak prompts and tested them on GPT3.5 and reported the results in the following table, which show that on GPT3.5, RA-LLM reduced ASR from 82.0% to 8.0%, without affecting BAR. We have also added these results to Section 4.3 of our paper. For GPT4, we found that the Handcrafted Jailbreak prompts used in our experiments are now ineffective, thus precluding related experiments. However, it is important to note that for models already possessing robust safety alignment, RA-LLM does not diminish their security. This is because, in the case of a malicious input, the worst outcome with RA-LLM is that the model generates a normal response without interfering with the content of the response. Should any viable data become available in the future, we are very willing to further supplement our experiments.

ModelBAR Original LLMBAR RA-LLMASR Original LLMASR RA-LLMASR reduce
Vicuna-7B-chat-HF99.3%98.7%98.7%12.0%86.7%
Guanaco-7B-HF95.3%92.0%94.7%9.3%85.4%
GPT-3.5-turbo-061399.3%99.3%82.0%8.0%74.0%
评论

We really appreciate the reviewer 8tCQ for the constructive comments, which significantly enhance the quality of our work. We would like to ask if there are any additional comments regarding our response, as the discussion phase is nearing its end. We are more than happy to address them. Additionally, we would appreciate if you could consider updating your scores if our rebuttal has satisfactorily addressed the concerns. Thanks again for your time and effort!

评论

Response to All Reviewers

We thank all the reviewers for your valuable comments!

We have incorporated all experiments suggested by the reviewers:

  1. We added two baseline methods: LLM Self-Defense by Alec et al. and Perplexity-Based Defenses by Neel et al. The results show that our method outperforms others in specific scenarios, such as when external models are not available or when defending against handcrafted jailbreak prompts.
  2. We enhanced the RA-LLM's performance against handcrafted jailbreak prompts using GPT3.5 as base models, achieving impressive results.
  3. We expanded our experimental data by 50%. The outcomes with the expanded dataset are consistent with the original results in the paper.
  4. We constructed an additional benign dataset from chatlogs.net conversations, and the BAR test results on this dataset align closely with the results from the MS MARCO dataset presented in the paper.
  5. We computed the actual time cost of implementing RA-LLM, proving that it requires less than 20% of the normal inference time.

We have clarified all questions raised by the reviewers:

  1. We detailed our designs to optimize time costs.
  2. We explained the distinctions and advantages of our work relative to concurrent works.
  3. We clarified the scope and quantity of our experiments.
  4. We elaborated on how we prevent false positive errors, providing explanations on the relationship between FPR and the tested BAR.
  5. We explained why we used a prefix set detection method as AC()AC(\cdot), and clarified that the current AC()AC(\cdot) does not simply favor our method.
  6. We clarified that we have explored adaptive attack methods and explained why designing adaptive attacks against RA-LLM is challenging.
  7. We clarified our statement about "Robust Alignment Check Function" regarding "Such Alignment Checking is Not Robust."

All these experimental results and explanations have been updated in the paper, with all changes highlighted in orange.

We kindly request that you inform us of any remaining ambiguities or concerns. We are more than willing to address additional questions and conduct further experiments should the reviewers deem it necessary.

AC 元评审

This paper studies the problem of defending against alignment-breaking attacks. The reviewers and AC have discussed the paper after the rebuttal phase. Reviewers acknowledge the importance of the studied problem. However, reviewers still have concerns on the separation of the paper with the prior work, experimental performance, and heavy computational overhead. AC encourages the authors to take these comments into account in preparation for a future version.

为何不给更高分

Most reviewers vote negatively about the paper and no reviewer is willing to champion the paper. Therefore, the paper is clear rejection.

为何不给更低分

N/A

最终决定

Reject