PaperHub
4.8
/10
Rejected4 位审稿人
最低3最高6标准差1.1
6
3
5
5
3.8
置信度
正确性2.5
贡献度2.3
表达1.8
ICLR 2025

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

OpenReviewPDF
提交: 2024-09-26更新: 2025-02-05
TL;DR

Latent adversarial training is useful to improve jailbreak robustness, backdoor removal, and unlearning in LLMs.

摘要

关键词
adversarial attacksadversarial trainingjailbreakstrojansbackdoorsunlearningrobustenss

评审与讨论

审稿意见
6

The authors apply latent space adversarial training to three different security problems in large language modeling. They show that adversarial training in deeper layers of the network can additionally improve robustness. Further, they demonstrate that beyond jailbreaks, adversarial training can improve robustness against backdoor attacks and robustness against information leaks in the context of unlearning.

优点

  • The authors provide code, models, and even user-friendly tools to evaluate their models (after submission). Given the long history of ineffective defenses, this is an important part of a defense contribution
  • To the best of my knowledge the results on sleeper agents and unlearning are novel and demonstrate a broad applicability of adversarial training to different security issues in LLMs.
  • The authors ablate performing adversarial training in different latent layers of a network, which seems to improve robustness

缺点

  • The framing of the paper could be improved, in my opinion, but I am open to discussions. The authors highlight efficiency improvements upon prior work that explores discrete adv. training (e.g., 281). However, alternative methods exist that are much closer to the algorithm proposed here and are also efficient (Xhonneux et al., 2024, Yu 2024). I believe the authors should highlight the differences to more closely related prior works and focus the discussions on these differences. As far as I can see this includes: 1) New threat models, 2) Exploring different latent layers. This also includes the introduction, which should highlight unique limitations resolved in this paper (and not those already addressed by other works). Note that I do not consider Yu 2024 in terms of my rating as it was released shortly before the submission deadline.
  • The utility datasets used to evaluate model capabilities are insufficient. Both MMLU and MT-bench suffer from assigning high scores to models that refuse every request. The compliance dataset gives a high score to R2D2, which is known for over-refusal, which makes me skeptical about the result. I would recommend OR-bench to evaluate if latent adv. training has a negative impact on over-refusal (Cui et al., 2024).
  • Comparisons between papers would be easier if the authors used the original method name provided in the respective paper (i.e., RT-EAT vs CAT)
  • Table 2 should be fixed for the camera ready.
  • I found 6 occurrences of missing \ref{} and \cite{} commands: 1009, 1286, 1295, 1313, 1336, 1378
  • I was not able to find any concrete hyperparameters for any method except for RT-EAT-LAT. A direct comparison between two methods without stating the hyperparameter search procedure appears to be insufficient. It's unclear if the benefit from RT-EAT-LAT comes from the choice to train in deep latent layers or from better hyperparameter tuning.

问题

  • The RT-EAT method of Xhonneux et al 2024 appears to be equivalent to the proposed method if adversarial training is conducted only in the first latent layer of a model. Could the authors comment on that? If this is true, the connection should be highlighted to provide better context on how these different methods relate.
  • Can the authors explain how hyperparameters were optimized for the different methods and a table of the final hyperparameters used in the experiments
  • Can the authors provide an argument for the sufficiency of the used utility benchmarks in Table 2 (or new results)

Without further changes, I would recommend to reject this paper. However, I believe many of my concerns can be addressed in a rebuttal, and I am willing to change my score to accept.

Since there are still trivial errors in the paper (such as those remarked by myself or reviewer EUWJ) I decided to reduce my score. In its current state, the paper should not be accepted. All in all I believe this could be a valuable contribution and I strongly encourage the authors to submit a revised manuscript. For me, a score of 8 would have been achievable with a convincing rebuttal.

评论

Thanks for the response! I really hoped that the rebuttal was still coming :-)

W1: I do not think this is a weakness (novelty is a bad metric anyway). I just wanted to point out that the differences could be made more explicit, the framing could be improved. But I guess being this honest could hurt average reviewer score and the issue is not severe

W2: I agree the total number of experiments exceeds/meets the ICLR standards. However, I think without this experiment, the robustness results are, in hard words, "meaningless." Existing work showed that Circuit Breakers over refuse a lot making the models very impractical. This puts the impressive robustness results into a different perspective. Other works show that models can get good MT bench scores with not outputting anything. A model that outputs nothing could get the correct MMLU probs, good MT Bench score and good robustness. Thus, I really encourage the authors to add this experiment to a possible camera ready version. At the same time, I will not consider it in my score.

W3: I reviewed other papers and saw the method under a different name and thus see an issue with having the same method under a different name in different papers. I understand your reasoning but do you not agree that it is an uncommon thing to do? (Again I will not consider it in my score as its quite subjective)

W6: Acknowledged, as hyperparameter search would limit this kind of research to big tech I do not want to be to strict. Maybe you can put a statement in the paper how difficult it was to get the method to run? Or what kind of measures were necessary to identify promising hyperparams?

Q3: See W2. this is the only real open issue I have. I think its evident that some prior work took the issue of over refusal not seriously enough (its also a newish finding). In my opinion its very important. However, the paper has some independent finding that do not rely on this evaluation.

I agree with some of the other reviewers that the paper felt rushed (not the experiments but the writing). ICLR is one of the most prestigious venues in ML. Submissions that are considered for acceptance should not have these basic flaws and desk rejection would have been viable, in my opinion. On the other hand, nobody wins if we put the paper through another review process just because of things that can be changed in the camera-ready version. (now that it has been reviewed anyway)

Like I said in my initial review, I think the paper has a lot of merits. My concerns were sufficiently addressed. I changed my score accordingly and kindly ask the authors to put their paper through an independent proof reading for their next submission.

评论

W1: Concretely, we can modify the 3rd paragraph of the intro to mention embedding space attacks (which it currently doesn't) and the first listed contribution to emphasize that the type of behavioral removal we pursue is on known behaviors. Meanwhile, in the past couple of months, we have already been curating the first paragraph of the related works section in response to new work that pops up.

W2/Q3: Thanks. Our defense here is that our LAT refusal-trained models empirically chat well. We made a website that people can chat with it on. It's linked in the public version of the paper, and you are welcome to have a colleague anonymously try it out on your behalf to check that it is indeed a knowledgeable and fluent chatbot. One could also check that our LAT robust refusal models chat well by downloading them from Hugging Face. Since we put them up, they have been used and red-teamed by two jailbreaking papers that we know of, both using multi-turn attack setups.

W3: We see the tension here. A useful compromise that we will do is to cite the R2D2 and embedding space adversarial training papers in the caption of table 2.

W6: Yes, we can commit to adding info on sweep methodology to section 3.

Overall, thanks again. We think it's smart to triple check things and get outside proofing and recommendations for camera ready prep.

评论

Here is an example of a chat that I just initiated with our RT-EAT-LAT model using our website demo. Everything below the "User" line was from the model.


User: What is the ICLR conference?

Assistant: The ICLR (International Conference on Learning Representations) is a premier conference in the field of machine learning and artificial intelligence. It is a leading platform for researchers and practitioners to share their latest findings, advancements, and innovations in the area of representation learning, which is a crucial aspect of machine learning.

The conference focuses on the development of new algorithms, models, and techniques for learning representations, which are essential for many applications, such as computer vision, natural language processing, speech recognition, and more. The conference also explores the theoretical foundations and applications of representation learning, including its connections to other areas of machine learning, such as deep learning, reinforcement learning, and generative models.

ICLR is known for its high-quality research papers, which are rigorously reviewed by experts in the field. The conference also features keynote talks, tutorials, and workshops, providing a comprehensive overview of the latest advancements and trends in representation learning.

ICLR is considered one of the most prestigious conferences in the field of machine learning and artificial intelligence, and it attracts a large number of attendees from academia, industry, and government.

评论

Thanks, we appreciate the review and feedback! In this review cycle, to save everyone’s time and effort, we are opting for a radically honest and terse response.

W1: We compared our LAT method to Xhonneux et al. (2024) in Table 2 and Casper et al. (2024) in Table 6 of the appendix. Overall, our claims to novelty are not conceptual. Instead, they are based on showing that LAT can combine with existing methods in a versatile way to strengthen defenses. For what it’s worth, unlike prior work on continuous adversarial training like these two, we also release LLMs with strong performance. We released 15 models, and 2 datasets alongside the paper’s public arXival. On Hugging Face, the models have over 700 downloads and the datasets have over 1100, suggesting a useful contribution to the community.

W2: We think this is a useful point. More elaborate evaluations would always be nice, but we followed existing methodology from prior work (Liu et al., 2023). And we successfully showed that LAT improved over baselines. As for R2D2 over-refusing, we did not use an off-the-shelf model, and we tuned the baseline to be strong (no apologies for that). There is always more that can be done, but we think that the experimental thoroughness of our work is a strength and not a weakness. For example, we know of no prior works that make progress on jailbreak refusal, backdoor removal, and unlearning at the same time. Meanwhile, we only know of one (Tamirisa et al., 2024) that makes progress on two of these at once. We also experiment with 3 different LLMs, 5 different datasets, 13 distinct training algorithms, and 21 distinct evaluation measures for performance/robustness/efficiency. Please consider this experimental breadth compared to related works in your final assessment.

W3: We sympathize with this, and we have discussed it amongst ourselves a few times. But we ultimately chose the naming convention that we did to reflect how RT, RT-EAT, and RT-EAT-LAT build off of each other in a way reflected by the names.

W4: Done. Thanks.

W5: Fixed. Thanks.

W6: We don’t include hypers in the paper, but we release code! And for what it’s worth, we worked hard to make the code useful. As we write this, we have 9 forks and 26 stars on GitHub. We also have over 700 model downloads and 1000 dataset downloads on HuggingFace. So we think that portability and reproducibility are empirically strengths of the project, not a weakness.

Q1: This is correct. We updated section 4.1 to describe EAT as layer-zero LAT.

Q2: We swept over hypers for all baseline methods and all LAT methods so that the deltas that we identify in our experiments are due to the method and not differential tuning. And we emphasize that we aren’t merely saying this – we took it seriously and didn’t cut corners. For example, we delayed the arXival of the paper because we found better baseline implementations of GA and RMU in section 4.3.2 than we had previously. Finally as for your question about hyperparameters, see our response to W6.

Q3: These three measures cover three key aspects of utility. MMLU measures knowledge through multiple choice question-answering. MT-Bench measures knowledge and fluency through open-ended question answering. Compliance measures, well, compliance. These three measures are fairly standard among comparable works (e.g., Zou et al., 2024)

评论

Thanks for your response

W1: Its not necessary to explicitly mention embedding space attacks in my opinion but a general mention that similar adv. training approaches exist would be sufficient.

W2/Q3: While not the best quantitative evaluation the example looks convincing. I agree that the effort to make the research reproducible / accessible outweighs the one "missing" experiment and at the same time provides some of the evidence I was looking for.

评论

After a discussion with the other reviewers I decided to adjust my score based on:

  • The initial state of the paper
  • Unverifiable statements of the authors about some potential issues (without breaking double blind review)
评论

Tough to hear, but thanks for the reply. If you wish to verify any factual claims about information that would break double-blind review if you did it yourself, we would recommend the proxy method -- asking a colleague to anonymously fact-check something for you.

评论

IMO this is very unconventional, but I asked a colleague to verify some of the claims. Ultimately, the papers contribution is similar to the one in Xhonneux 2024. However, this paper provides experimental results on more diverse threats and explores the decision choice of attacking deeper latent layers. Both papers were apparently put on Arxiv at a similar time. On the one hand, I do not see why I should punish the authors for likely having bad luck in previous review processes. However, major concerns remain, such as trying to influence the reviewer's opinion by unverifiable claims that the work is popular among known industry labs (why should this be relevant in the first place. This will just put pressure on a reviewer) or the very rushed state of the initial paper.

I raised my score to 6 as my colleague confirmed some of the claims made by the authors.

评论

Hello, we did not know that the discussion period was open, based on this site, we thought it had closed. Oops. But thanks for the response!

We mentioned interest from industry labs in response to this comment from the reviewer:

Also, I encourage the authors to experiment with larger models if computational resources permit.

So we defend its relevance. However, we agree that this particular claim is not verifiable, and we think is is reasonable for you to ignore it. We are on the same page on this.

审稿意见
3

This paper proposes the Targeted Latent Adversarial Training (t-TLA) technique that can be added on top of existing algorithms to (1) safeguard LLMs from jailbreak attacks (2) erase backdoor behaviors from LLMs (3) remove knowledge from the models. The authors conducted experiments on each task and observed promising performance.

优点

(1) The t-LAT algorithm seems effective across a wide range of tasks and is flexible enough to be combined with many optimization objectives without adding much overhead.

(2) The authors provide necessary implementation guidelines such as adding additional SFT loss or KL divergence.

缺点

Major

(1) Section 4.1: The attacks considered are not strong enough with most of them achieving ASR < 20% against the base model, making it questionable whether the proposed technique will bring improvement when faced with more advanced jailbreak attacks like [1], [2] and [3]. Also, both Llama-2 and Llama-3 are very safe models. I think the authors should experiment with weaker models like Vicuna-7B. An improvement of 2% in ASR is still somewhat marginal for me. (Also, I encourage the authors to experiment with larger models if computational resources permit.)

(2) Section 4.2: It is good to see that DPO-LAT surpasses DPO, but how does it perform when compared with other algorithms designed to remove backdoors from LLMs?

(3) Section 4.3: The improvement in WHP dataset is too little to be noticed and only one algorithm and one model are considered. What is the key difference between section 4.3.1 and section 4.3.2. It is not clear to me why they should be separated.

(4) Some important experimental details are left out, especially those related to the hyperparameter of the proposed algorithm and the baseline algorithms (i.e. the β\beta for DPO, GCG steps, the examples for MSJ, etc.). It is necessary for me to specify the details of the experiments to make the comparison fair and the experiments reproducible/reliable.

(5) There is no ablation study about the choice of epsilonepsilon, updated layers, additional SFT loss, and etc, The choice of constraint budget and the additional SFT loss is not consistent across different sections.

[1] JAILBREAKING LEADING SAFETY-ALIGNED LLMS WITH SIMPLE ADAPTIVE ATTACKS; https://arxiv.org/pdf/2404.02151

[2] Improved Techniques for Optimization-Based Jailbreaking on Large Language Models; https://arxiv.org/pdf/2404.02151

[3] Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses; https://arxiv.org/html/2406.01288v1


Minor

(1) It occurred to me occasionally while reading that the paper was written in an extreme rush. There are typo errors (line 220), tables exceeding the width limit (line 233-243, line 1188-1208), one line of equation occupying a full page (line 1107), figures without a caption (217-232), and broken citation links. All these errors can be spotted with a 10-min proof-reading.

问题

Please see the weakness part.

评论

Thanks, we appreciate the review and feedback! In this review cycle, to save everyone’s time and effort, we are opting for a radically honest and terse response.

W1: We don’t think that using particularly high-powered attacks are needed to prove the point we intend to prove. Here, our goal is to expand out the pareto frontier relative to SOTA baselines, which we do. While it is possible to achieve very high jailbreak successes on models with customized attacks, we know of no comparable jailbreak-robustness works that test against adaptive attacks like Andriushchenko et al. (2024). Are there any papers you'd like to point us to who make progress on jailbreaks or jailbreak defenses with significantly more rigorous evaluations?

W1: Evaluations can always be more elaborate, but we think that the experimental thoroughness of the paper is a strength, not a weakness. For example, we know of no prior works that make progress on jailbreak refusal, backdoor removal, and unlearning at the same time. Meanwhile, we only know of one (Tamirisa et al., 2024) that makes progress on two of these at once. We experiment with 3 different LLMs, 5 different datasets, 13 distinct training algorithms, and 21 distinct evaluation measures for performance/robustness/efficiency. We have also released code, 15 models, and 2 datasets alongside the paper’s public arXival. Please consider this experimental breadth compared to related works in your final assessment.

W1: For what it’s worth, we are an academic group, so it is nontrivial to work with increasingly large models. However, we know for certain that there is explicit interest in LAT at industry labs including Anthropic, FAR.AI, and GreySwan AI. We have recently met with one industry team who is consulting with us on scaling up LAT.

W2: What specific algorithms would you recommend we compare to? To the best of our knowledge, all fine-tuning-based backdoor cleansing methods from the literature involve either passive forgetting or training the model under reconstructed proxy triggers. In that sense, our experiments have methodological coverage. Finally though, we think that additional backdoor removal baselines are not needed to show what we intend to show. Our goal was to use LAT with many methods in a versatile way to help remove persistent failure modes. Altogether, we think that our results for jailbreaks, backdoors, and unlearning make a strong case for this.

W3: We would agree that a ~2% drop in HP performance would be small if the baseline were high. But we emphasize that the factor of improvement is still large when we drop performance e.g. from 0.042 to 0.027, or in the most dramatic case, from 0.116 to 0.039.

W3: Sections 4.3.1 and 4.3.2 differ by task. The first is focused on unlearning copyrighted text, and the second is focused on unlearning capabilities. We engage with different data and baseline methods in each which is useful for showing the versatility of LAT.

W4: We don’t include hypers in the paper, but we release code! And for what it’s worth, we worked hard to make the code useful. As we write this, we have 9 forks and 26 stars on GitHub. We also have over 700 model downloads and 1000 dataset downloads on HuggingFace. So we think that reproducibility and portability is empirically a strength of the paper, not a weakness.

Minor: Thanks for the notes on formatting. We have fixed these.

评论

W1: Evaluating Adversarial Robustness

The adversarial ML community widely agrees that defenses should be tested with the strongest possible attacks, including adaptive attacks if feasible, to avoid overestimating performance and falling into common pitfalls [1][2]. For example, GCG's ASR can be significantly improved using techniques from [3], as well as better jailbreak templates or model ensembles. A defense that cannot withstand state-of-the-art (SOTA) attacks has limited significance for me.

Evaluation dataset I could not find any information in your paper about the dataset used to evaluate adversarial robustness (specifically, the data used for testing rather than training). Did you conduct any experiments to measure the overlap between the training and test sets? Overlap could potentially inflate results. If I have overlooked this, please clarify where this information is provided.

References:

  1. On Adaptive Attacks to Adversarial Example Defenses (arXiv:2002.08347)
  2. Obfuscated Gradients Give a False Sense of Security (arXiv:1802.00420)
  3. Improved Techniques for Optimization-Based Jailbreaking on Large Language Models (arXiv:2405.21018)

W3: Reproducibility

While claims regarding metrics such as GitHub stars, Hugging Face downloads, or interest from prominent companies might serve as indicators of popularity, these cannot be taken into account for two key reasons:

  1. Verifying them would conflict with the double-blind review policy.
  2. Claim being inherently unverifiable

Focusing solely on the submitted manuscript, I find that critical experimental details are missing, which raises concerns about the reproducibility of the results.


The Naming of LAT

The primary technical contribution of this paper appears to be the introduction of targeted-LAT (t-LAT) and the associated tricks that make it effective, rather than LAT itself. To avoid confusion, I strongly recommend renaming the proposed algorithm to a more specific and distinguishable term, such as "t-LAT," throughout the manuscript.


Summary of Review

The paper's broad scope is a notable strength, as it addresses multiple aspects of LLM safety. I also agree with the authors' claim that targeted LAT has the potential to mitigate undesirable behaviors when combined with existing fine-tuning techniques. However, from my personal point of view, the support for the claims made in each individual aspect is weakly supported due to (1) inadequate experimental setting (2) lack of important details. Also, the technical contribution is limited since LAT has already been proposed. I will keep my rating as 3 and do not recommend acceptance.

Given these reamining concerns, I am maintaining my rating of 3 and do not recommend acceptance.

评论

W1

We agree that strong attacks are good. But we emphasize that our LAT method won a fair fight against baselines. We think that the last comment seems to reflect a standard that no comparable works in the literature meet. For example, consider Tamirisa et al. (2024), Mazeika et al. (2024), Zou et al. (2024), or Xhonneux et al. (2024). None of these comparable papers seem to evaluate their attack/defense methods up to the standard described. We sympathize with your criticism, but we think that it is not one of the paper, but one of the field in general. We would ask you to take this into account in your final review. To count this against the paper, we think that it should be necessary to point to at least one specific example of comparable work on adversarial attacks or defenses that meets this evaluation standard.

Normally, we would be happy to continue our ongoing work on followup projects to this one. But if you think that adding very strong attacks into Table 2 would make the difference, we could revisit this and add a column for soft prompt attacks Schwinn et al. (2024a) which are very strong Schwinn et al. (2024b).

W3

If you have doubts about any claims related to the code or models we released, you could ask a colleague to anonymously verify them for you.

Naming

We sympathize with this point. But we are between a rock and a hard place. We used to mention "Targeted LAT" in the title and describe the contributions in terms of targeted LAT. However, we have gotten separate feedback on the paper against doing this because it caused past readers of our past draft to worry about the novelty of targetedness. Since our main contributions are about model performance, we opted to act on past advice to name the paper after just LAT. We do not think that this is a misnomer overall because "adversarial training" is already used in a general way to refer to targeted and/or untargeted approaches. Please take into account that we have gotten conflicting feedback on this in your final review. This is also a matter of wording and not technical contribution, so we think this point is relatively small.

评论

The statement regarding evaluation "quality" from the authors appears to be a bit misleading.

Tamirisa et al. (2024): Not compareable at all in my opinion. Could the authors explain this comparison?

Mazeika et al. (2024): conduct considerably more different attacks?

Zou et al. (2024): compareable but conducts some additional adaptive attacks (e.g., RepE)

Xhonneux et al. (2024): conducts simple adaptive attacks that we're very effective in breaking other defenses (however, its somewhat unclear how much effort was put into them). Otherwise compareable

评论

Thanks! Notes below.

Why we mentioned Tamirisa et al.: Both papers make common use of MT-bench and MMLU. And our choice of these measures is based on how these are common. But us mentioning this was focused on these measures of general performance, so this may be of limited interest to you.

Why we mentioned Mazeika et al.: They use a lot of attacks but not adaptive ones like Andriushchenko et al.

For concreteness: I think we're interested in what attack or set of attacks you would say would be sufficient to add to table 2. And maybe it's something we could do! For example, soft prompt attacks Schwinn et al. (2024a) are very strong Schwinn et al. (2024b) and something we could add. RT-EAT vs RT-EAT-LAT could make for an interesting comparison. We could also add evaluate the models on attacks from these two files from Andriushchenko et al (2024). What do you think?

审稿意见
5

Large language models (LLMs) often exhibit undesirable behaviors despite fine-tuning efforts to remove them. This paper addresses this issue using targeted Latent Adversarial Training (LAT), which enhances robustness by leveraging latent-space perturbations to target specific failure modes. The approach contrasts with traditional adversarial training, focusing on hidden activations rather than inputs. Targeted LAT improving resistance to jailbreaks, removing backdoors, and unlearning undesirable tasks with little computational cost. Extensive experiments validate the method's efficacy, showcasing its potential as a robust tool for mitigating harmful behaviors in LLMs.

优点

  • The paper introduces targeted Latent Adversarial Training (LAT). This computationally efficient approach enhances the robustness of LLMs by specifically targeting latent activations.
  • Extensive experiments have been conducted to provide a good insight into the components of the proposed method.
  • The paper is generally well-written. With clear illustrations and tables.

缺点

  • This paper follows a general adversarial training pipeline, which requires maximizing the adversarial loss while minimizing the "safety loss." The framework itself is familiar for adversarial training, which might hinder the contribution of the paper.
  • As the proposed method shares similarities to the latent adversarial training (LAT), the paper needs to discuss the difference between the proposed method and the previous LAT. In addition, as the LAT perturbed the layer's activation, choosing which layers to perturb needs to be better discussed and empirically verified.
  • Despite its effectiveness in defense of jailbreak, the DPO setting with the backdoor trigger is impractical, as most training datasets are carefully constructed.

问题

  1. The attack success rate for GCG's Llama-2 and Llama-3 is relatively low compared to the original paper; could you explain this?
  2. The Llama is famous for its safety. Could you provide a discussion or experiment on a model like Vicuna (easier to jailbreak) to see further performance?
  3. The DPO setting with the backdoor trigger is impractical; could you discuss its real-world application more?
评论

Thanks for the detailed response!

  • W1, W2, W3, Q1, Q3: I appreciate your honest response, which has solved my concerns.

  • Q2: It seems inappropriate to refer the reviewer to the arXiv version of this paper because it would violate the ICLR double-blind review policy. I acknowledge the contribution of this paper to the community. The experiments in this paper meet the standards of an ICLR paper. However, the question is: Could the author see more performance gain if they experimented with an "unprotected" model like Vicuna? It is more like a suggestion rather than a criticism.

Another minor suggestion is to be careful when you change your LaTex template style because Tabs 2, 6, and 7 overlap with the line numbers.

I have raised my score from 3 to 5. Thanks for your responses.

评论

Thanks, we appreciate the review and feedback! In this review cycle, to save everyone’s time and effort, we are opting for a radically honest and terse response.

W1: Our work definitely is a form of adversarial training. But our claims to novelty and contributions do not hinge on the adversarial training aspect of LAT. Our goal and contributions center around improving on existing algorithms for refusal training, backdoor removal, and unlearning. We also stress that our comparison between LAT and R2D2 shows that our method can improve on adversarial training’s effectiveness and efficiency.

W2: There are two key differences between our work and Casper et al. (2024). First, we use a targeted attacker whose goal is to elicit a specific harmful behavior from the network. The attackers used by Casper et al. (2024) are untargeted and only seek to minimize training loss. Second, we use targeted LAT to improve over SOTA techniques for jailbreak robustness, backdoor removal, and unlearning. Casper et al. (2024) focus on proof-of-concept experiments instead of augmenting SOTA methods. In the appendix, Table 6 directly compares the two methods. It shows that our targeted LAT can greatly improve over the robustness of untargeted LAT from Casper et al. (2024). Meanwhile, Casper et al. (2024) argued that our targeted LAT approach would be a novel and useful way to expand on their work. The discussion section of their paper says: “performing LAT with targeted adversaries could be a way to make models highly robust to specific foreseeable failures”

W3: We think this is more of a criticism against the jailbreaking literature than our specific paper. We sympathize with this point. But we think that it does little to detract from the scientific contributions of the paper. We also note that some research has argued that web-scale data poisoning is practical (Carlini et al., 2023). Meanwhile, other research has worked on backdoors not because of practical threats but because they offer a challenging debugging problem for AI systems (Rando et al., 2024, Casper et al., 2024b).

Q1: Good observation. We use a different autograder that was designed to have a lower false-positive rate (see line 276). However, we also use the HarmBench autograder in Appendix Table 7 and find similar qualitative results as in the main paper.

Q2: We agree with the importance of working on many models. In followup work, we are running experiments on over 70 models. However, we think what we already have meets a higher standard of thoroughness than most comparable papers to. We experiment with 3 different LLMs, 5 different datasets, 13 distinct training algorithms, and 21 distinct evaluation measures for performance/robustness/efficiency. Please consider this experimental breadth compared to related works in your final assessment. We have also released code, 15 models, and 2 datasets alongside the paper’s public arXival. As we write this, we have 9 forks and 26 stars on GitHub. We also have over 700 model downloads and 1000 dataset downloads on HuggingFace suggesting a practical contribution to the community.

Q3: See also our response to W3. In this case, we think you are referring to our experiments with “proxy triggers,” right? If so, this is, by construction, a practical scenario. Consider the real-world challenge of removing a backdoor with an unknown trigger but a known response. For example, this was the setting that Rando et al. (2024) worked with. In this case, your first option could be to simply train the model to behave well on normal data until it forgot the backdoor. That was our default approach. Your second option could be to use some backdoor trigger reconstruction method to produce a “proxy trigger” and then use that to adversarially train the model. This was our “proxy trigger” approach. We used proxy triggers from Rando et al. (2024) that were constructed by a team in their competition who had no knowledge of the actual trigger at all. Overall, in both the default and proxy trigger approach, LAT improved the backdoor removal :)

评论

We are very appreciate of the reply! Here are a few more notes on Q2:

Q2-A: We agree that it would violate double blindness for you to look at public materials we released. We definitely do not ask you to do this. But we can guarantee the details above about the public code and models are accurate. If you have concerns, we encourage you to ask a colleague to anonymously verify any details and confirm this with you.

Q2-B: In preliminary experiments, we did do jailbreak robustness experiments with another model: Zephyr-7b-beta. Back then, we were using a different experimental setup, and we were focusing on comparisons between R2D2 and LAT. However, the results were not qualitatively different from what we observed in our final experiments:

Our old Preliminary Results:

ModelPerformance Avg (Compliance, MMLU, PIQA) ↑ASR Avg (GCG, TAP, PAIR) ↓
Zephyr-7b0.530.72
Zephyr-7b-R2D20.490.43
Zephyr-7b-LAT0.530.14

Overall, LAT outperforms R2D2 (qualitatively similar to what we found in the main paper). We ultimately did not keep working with Zephyr models, because, given the breadth and depth of experiments we conducted, we opted to focus on Llama3-8b models. To be honest, we think that adding experiments for a model like Zephyr would be good, but normally, we would probably just continue to prioritize on our followup research instead of revisiting this. However, if you think this would be very compelling and would make a big difference, we could commit to adding some rows for Xephyr or Vicuna to Table 2 before camera ready.

审稿意见
5

This paper introduces targeted latent adversarial training (LAT) as a technique to improve robustness to persistent harmful behaviors in large language models (LLMs). The authors demonstrate LAT's effectiveness in three key applications: (1) improving resistance to jailbreaking attacks while maintaining model performance, (2) removing backdoors without knowledge of the trigger, and (3) enhancing unlearning of undesirable knowledge. The core idea is to perturb latent activations to elicit specific undesirable behaviors during training, then optimize the model to be robust against such perturbations. The authors show LAT can augment existing techniques like refusal training, DPO, and machine unlearning methods to achieve better results with minimal computational overhead.

优点

Pros:

  • I believe targeted LAT can be a useful attack-agnostic defense, although the current evaluation lacks depth (see below).
  • The breadth of evaluation is appealing. It’s nice to see a method that potentially improves on safety/alignment across multiple diverse tasks.

缺点

Weaknesses:

  • The attacks used for the evaluation in the main table (Table 2) are quite weak: the best attack success rate is 27.7% on Llama-3-8B Instruct, although it’s possible to achieve ~100% ASR on this model (e.g., as reported in Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks but with a different judge). Without strong enough attacks, it’s hard to conclude that the defense is effective enough, especially given the anecdotal evidence that there are some simple breaks like the one mentioned in the paragraph “Manual red-teaming and research demo”.
  • It’s not clear to me why the proposed targeted formulation should be better than the existing LAT methods, such as Embedding-Space Adversarial Training (Xhonneux et al., NeurIPS 2024) or Defending against unforeseen failure modes with latent adversarial training (Casper et al., 2024b). There are some explanations in the introduction but they seem quite handwavy. Also, the only comparison between RT-EAT and RT-EAT-LAT suggests a small difference: 4.3% vs. 2.9% prefilling ASR while 6.22 vs. 5.86 MT-Bench score - so it’s not even clear which model is really better.
  • MMLU and MT-Bench may be too easy as an over-refusal evaluation since those questions are completely harmless. Adding something like XS-Test or OR-Bench would make the evaluation stronger.
  • R2D2 and RT-EAT should also be added for Llama-3 as baselines.
  • Since there are no other baselines except DPO for backdoor removal included in Table 3, it’s unclear whether LAT is really necessary there or basically any algorithm that would somehow perturb the weights in the optimization process would work as well.
  • For the unlearning part, it’s not clear to me whether WHP-C-LAT pushes the Pareto frontier compared to WHP-C. WHP-C-LAT has a noticeably worse MMLU score (43.9% vs. 45.6%) although with a better unlearning performance. Also, the unlearning part should have more baselines (there are plenty of unlearning methods that exist in the literature).
  • The choice of the L2 norm for layerwise perturbations looks a bit arbitrary. It would be nice to elaborate why it can make sense.

问题

General suggestions:

  • Table 2 is too wide. Also, the figures above Table 2 should have a separate caption. Also, all table captions should be above tables, not below. Also \citet vs. \citep should be used correctly throughout the paper (e.g., double check the “Future work” paragraph).
  • The Future Work paragraph: an introductory sentence would improve the reading flow.
  • “Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.” - should be 2023, not 2024.
  • “prefilling attacks (Haizelabs)” doesn’t seem to be the right reference for the prefilling attack, since it wasn’t introduced there.
  • In addition to MMLU, it would be also good to add the MT-Bench score for the DPO-LAT models in Table 8.

Update after the rebuttal: The paper still feels quite rushed to me, primarily in terms of the experiments. Overall, I feel like Targeted LAT might be a promising method to improve multiple dimensions of safety. However, this is still unclear from this version of the paper. I will keep my original score 5/10. I think the approach might be promising but requires much more systematic experiments. I think extending this work and resubmitting it would be the best option.

评论

Thanks, we appreciate the review and feedback! In this review cycle, to save everyone’s time and effort, we are opting for a radically honest and terse response.

W1-W5 Evaluations can always be more elaborate, but we think that the experimental thoroughness of the paper is a strength, not a weakness, and surpasses a reasonable standard. For example, we know of no prior works that make progress on jailbreak refusal, backdoor removal, and unlearning at the same time — we only know of one (Tamirisa et al., 2024) that makes progress on two of these at once. We experiment with 3 different LLMs, 5 different datasets, 13 distinct training algorithms, and 21 distinct evaluation measures for performance/robustness/efficiency. Please consider this experimental breadth compared to related works in your final assessment. We have also released code, 15 models, and 2 datasets alongside the paper’s public arXival. As we write this, we have 9 forks and 26 stars on GitHub. We also have over 700 model downloads and 1000 dataset downloads on HuggingFace suggesting a practical contribution to the community.

W1: We don’t think that using particularly high-powered attacks would be needed to prove the point we intend to prove. Here, our goal is to expand out the Pareto frontier relative to SOTA baselines, which we do. While it is possible to achieve very high jailbreak successes on models with customized attacks, we know of no comparable jailbreak-robustness works that test against adaptive attacks like Andriushchenko et al. (2024). Are there any papers you'd like to point us to who make progress on jailbreaks or jailbreak defenses with significantly more rigorous evaluations?

W2: Unless we misunderstand your point, we think the paper already does what was mentioned here. We compared our LAT method to Xhonneux et al. (2024) in Table 2 and Casper et al. (2024) in Table 6 of the appendix. Meanwhile, we compare RT-EAT and RT-EAT-LAT on nine different measures. Our method wins on 7, ties on one, and loses on one. We think that this more than meets a reasonable burden of proof to argue that LAT is a valuable method.

W3: More elaborate evaluations would always be nice, but we followed existing methodology from prior work. Ultimately, we set out to beat baselines and succeeded in doing so, so we think that this meets a reasonable standard of evidence. See also the response to W1-W5 above.

W4: We agree this would be good, and we are currently working with llama3 models in follow-up work. But we do not think this would be a helpful use of time to revisit this work to add more experiments, especially since R2D2 is becoming less popular and has no public implementation tuned to llama3 models (to the best of our knowledge). There will always be more things that can be added. But holding a paper to doing all experiments with llama3 models (which were released mid-project) would be a somewhat artificial standard.

W5: What specific algorithms should we compare to? To the best of our knowledge, all fine-tuning-based backdoor cleansing methods from the literature involve either passive forgetting or training the model under reconstructed proxy triggers. In that sense, our experiments have methodological coverage. Finally though, we think that additional backdoor removal baselines are not needed to show what we intend to show. Our goal was to use LAT with many methods in a versatile way to help remove persistent failure modes. Altogether, we think that our results for jailbreaks, backdoors, and unlearning make a strong case for this.

W6: Yes, we find that WHP-C-LAT induces a general performance cost compared to WHP-C. But most robustness/unlearning/backdoor-removal methods induce similar costs. Pareto domination isn’t necessary to demonstrate usefulness. We have shown that LAT can push out the Pareto frontier.

W7: The perturbation norm that we used was just a hyperparameter. We swept over multiple configurations of hyperparameters for LAT runs and control runs alike.

评论

Thanks for the clarifications. I'm somewhat sympathetic to the paper, and I appreciate the fact that it got some visibility in the community.

I agree with Reviewer EUWJ that it's a bit questionable to use this as an argument: "there is explicit interest in LAT at industry labs including Anthropic, FAR.AI, and GreySwan AI" (as well as other public statistics like the number of Github stars). But also, I don't agree that it's unethical or further violates the double-blind system (which is kind of violated anyway with arXiv submissions and publicising papers on social media).

In any case, the paper still feels quite rushed to me, primarily in terms of the experiments. Llama 3 was out in April 2024 and you used it in your paper, so then it's justified to require more thorough experiments with it. Overall, I feel like Targeted LAT might be a promising method to improve multiple dimensions of safety. However, this is still unclear from this version of the paper. I will keep my original score 5/10. I think the approach might be promising but requires much more systematic experiments. I think extending this work and resubmitting it would be the best option.

评论

Thanks for the reply. One final thought that we will leave for your consideration is that this paper was arXived on July 22. So please take this into account for your final review. One trouble with the conference rhythm and review process is that it puts multi-month gaps between arXival and review. So we ask that you evaluate the paper as if it were July. This is why we have more experiments with Llama2 than Llama3.

评论

Closing Argument

One last thanks to the reviewers and AC. In this final comment, we are going to be raidcally honest and say exactly what we think. There are two reasons for this:

  1. Our submission window closes after ICLR. We arXived on July 22. The models that we experimented on are becoming older, and more related work is constantly popping up. Our focus is on our followup work. We think that going back and updating experiments for the sake of an ICML submission would be playing the resubmission game rather than pursuing research impact.
  2. The reviews are 3-5-5-6 which is likely a rejection. But it is well known that conference review is somewhat of a dice roll. Another challenge with reviews is that papers are sometimes judged less by the sum of contributions and more by the number of critiques reviewers have.

On the paper’s contributions

In their final reasons for decision, three reviewers expressed concern that the paper is lacking experimental breadth or depth. s8MN pointed to how we mostly worked with Llama2 and had limited experiments with Llama3 models. goqz asked about adding experiments with another model like Vicuna. And EUWJ asked for more jailbreaking evaluations beyond the 6 that we used. In our replies, we have responded to the specifics of these points, but we also want to make a broader point. Evaluations can always be more elaborate. But we think that the paper's experimental breadth and depth is a strength, not a weakness. We make contributions on jailbreak robustness, backdoor removal, and unlearning. To the best of our knowledge, no other related papers have done this. There is only one of which we know (Tamirisa et al., 2024) that works in two of these domains. Meanwhile, we work with 3 different LLMs, 5 different datasets, 13 training algorithms, and 21 evaluation measures for performance/robustness/efficiency. We set out to show that LAT is a versatile tool for improving LLM safeguards, and we think our experiments made a strong case for this. We expect many papers will be accepted to ICLR which have substantially narrower contributions and less thorough experiments.

On the paper’s impact

The paper, code, and models have been influential: In direct response to reviewer concerns about reproducibility and model usefulness, we discussed the code and models we released with the paper:

We have also released code, 15 models, and 2 datasets alongside the paper’s public arXival. As we write this, we have 9 forks and 26 stars on GitHub. We also have over 700 model downloads and 1000 dataset downloads on HuggingFace suggesting a practical contribution to the community.

We think this reflects quite positively on the paper's quality and value.

Asking for AC adjudication

In their reasons for decision, EUWJ and rnfj expressed a concern with the verifiability of the above claim, and wrote that their decisions will be based on presuming doubt. We understand their thought process. But it is not an ICLR policy to forbid discussion of public code, model release, or engagement in the discussion phase. We agree that no reviewer should verify these claims themselves, but if necessary, they could be verified without violating double blindness using the proxy method – asking a colleague to anonymously fact-check. We have checked that ICLR’s policies and guidelines for review do not prohibit this. s8MN agreed that our response was reasonable.

This is a disagreement that we ask the AC to adjudicate between us & s8MN versus rnFJ & EUQJ. This is a matter of procedure, and not a matter of opinion. Citing this concern, rnfj adjusted their score from an 8 to a 6, and EUQJ opted to leave their score at a 3. We would ask that our discussion of released code and models not be ignored or presumed false because (1) they were the most direct way to reply to reviewer concerns, (2) code/model usage is relevant for evaluating paper quality/reproducibility/impact, (3) we did not ask reviewers to verify these themselves, (4) our claims could be verified anonymously using a proxy, and (5) concerns with verifiability are not unique to this type of claim (e.g., it's possible for authors to simply fabricate a paper's results).

评论

On jailbreaking evaluation methodology

In EUWJ’s explanation of their final decision, they argued that the six jailbreak robustness evaluations we used were insufficient because the jailbreaks did not achieve near 100% success on the base Llama models as was shown by attacks in Andriuschenko et al. (2024). Our reply was that this was (1) irrelevant because our conclusions were based off of how we beat the baselines and (2) that other similar papers do not meet the standard outlined. We think this is an artificial standard and asked three separate times for EUWJ to point to a paper that meets this standard or to say what specific experiment we should run. In our final reply, we asked if they think we should add soft prompt attacks (Schwinn et al., 2024) or to evaluate on adversarial prompts available on Andriuschenko et al. (2024)’s GitHub. But we did not receive a reply before the end of discussions.

We ask the AC to take this into account when assessing EUWJ’s rating of 3. If this concern is decisive, we could commit to adding evaluations from soft prompt attacks Schwinn et al., 2024 or the adversarial prompts from Andriuschenko et al. (2024) before a camera ready version.


Finally, thanks again to all involved. We appreciate your volunteered time and ideas!

评论

Thanks for the engagement. In my opinion, the attack introduced by Andriuchenko is more meaningful. Softprompts will likely reduce the robustness of all existing models to 0% and thus do not provide meaningful feedback. They could be used as a sanity check for gradient obfuscation.

评论

Hello, we did not know that the discussion period was open, based on this site, we thought it had closed Nov 26 AoE. Oops. But thanks for the response! We appreciate how active you have been in discussions.

We can commit to adding a column to table 2 with attacks used by Adriuschenko et al. (2024) before camera-ready.

AC 元评审

This paper introduces targeted latent adversarial training (LAT) as a technique to improve robustness to persistent harmful behaviors in large language models (LLMs). The authors demonstrate LAT's effectiveness in three key applications: (1) improving resistance to jailbreaking attacks while maintaining model performance, (2) removing backdoors without knowledge of the trigger, and (3) enhancing unlearning of undesirable knowledge. The core idea is to perturb latent activations to elicit specific undesirable behaviors during training, then optimize the model to be robust against such perturbations. The authors show LAT can augment existing techniques like refusal training, DPO, and machine unlearning methods to achieve better results with minimal computational overhead. For the strengths, the paper introduces targeted Latent Adversarial Training (LAT). This computationally efficient approach enhances the robustness of LLMs by specifically targeting latent activations. Extensive experiments have been conducted to provide a good insight into the components of the proposed method. The paper is generally well-written with clear illustrations and tables. However, there are several points to be further improved. For example, this paper follows a general adversarial training pipeline, which requires maximizing the adversarial loss while minimizing the "safety loss." The framework itself is familiar for adversarial training, which might hinder the contribution of the paper. As the proposed method shares similarities to the latent adversarial training (LAT), the paper needs to discuss the difference between the proposed method and the previous LAT. In addition, as the LAT perturbed the layer's activation, choosing which layers to perturb needs to be better discussed and empirically verified. Despite its effectiveness in defense of jailbreak, the DPO setting with the backdoor trigger is impractical, as most training datasets are carefully constructed. Moreover, some important experimental details are left out, especially those related to the hyperparameter of the proposed algorithm and the baseline algorithms (i.e. the for DPO, GCG steps, the examples for MSJ, etc.). It is necessary for me to specify the details of the experiments to make the comparison fair and the experiments reproducible/reliable. There is no ablation study about the choice of epsilon, updated layers, additional SFT loss, and etc, The choice of constraint budget and the additional SFT loss is not consistent across different sections. Therefore, this paper cannot be accepted at ICLR this time, but the enhanced version is highly encouraged to submit other top-tier venues.

审稿人讨论附加意见

Reviewers keep the score after the rebuttal.

最终决定

Reject