PaperHub
4.7
/10
Rejected3 位审稿人
最低3最高6标准差1.2
3
6
5
3.7
置信度
正确性2.7
贡献度2.3
表达2.7
ICLR 2025

Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons

OpenReviewPDF
提交: 2024-09-27更新: 2025-02-05
TL;DR

In this paper, we interpret the mechanism behind safety alignment via neurons and analyze their properties.

摘要

Large language models (LLMs) excel in various capabilities but pose safety risks such as generating harmful content and misinformation, even after safety alignment. In this paper, we explore the inner mechanisms of safety alignment through the lens of mechanistic interpretability, focusing on identifying and analyzing *safety neurons* within LLMs that are responsible for safety behaviors. We propose *inference-time activation contrasting* to locate these neurons and *dynamic activation patching* to evaluate their causal effects on model safety. Experiments on multiple prevalent LLMs demonstrate that we can consistently identify about $5$% safety neurons, and by only patching their activations we can restore over $90$% of the safety performance across various red-teaming benchmarks without influencing general ability. The finding of safety neurons also helps explain the ''alignment tax'' phenomenon by revealing that the key neurons for model safety and helpfulness significantly overlap, yet they require different activation patterns for the same neurons. Furthermore, we demonstrate an application of our findings in safeguarding LLMs by detecting unsafe outputs before generation.
关键词
Large Language ModelsMechanistic InterpretabilitySafety AlignmentNeuron

评审与讨论

审稿意见
3

Focusing on the safety mechanism of LLMs, this paper proposes (1) inference-time activation contrasting, to locate safety neurons, and (2) dynamic activation patching, to evaluate their causal effects on model safety. The key observation is that only a few (5%) neurons contribute to the safety of the model. This paper also proposes applications of the observations.

优点

  1. Understanding the safety mechanism of LLMs is a crucial research problem.
  2. This paper focuses on various aspects of the proposed interpretability methods, including empirical observation on neurons, transferability, and potential application, making the contribution a comprehensive framework.

缺点

  1. The presentation of this paper can be substantially improved. Many terms are not well explained in the paper, e.g. cost scores in Table 3, (IA)3(IA)^3 in Section 4.1
  2. The observation that a few safety neurons contribute to the safety of LLMs has already been spotted in some related work, but they are not cited and discussed.
  • On Prompt-Driven Safeguarding for Large Language Models. ICML 2024
  • Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications. ICML 2024
  1. It seems that the 3 LLMs used are already aligned for safety (at least, to a certain degree) before they are released. What is the alignment in 4.1 here?
  2. In my opinion, it would be necessary to include some advanced jailbreaking attacks for evaluation (both for the main observation and the application), since current LLMs can easily refuse to answer vanilla harmful questions.
  3. Though evaluated 3 models, I still think the model scope is quite limited, e.g. all 3 models are in 7b size, but can the conclusion generalize to larger models?

问题

See weaknesses.

评论

Thank you for your constructive and valuable comments. Here is our response to the weaknesses and questions.

W1 The presentation of this paper can be substantially improved

Thank you for your suggestion, and we apologize for any unclear statements. We have revised the corresponding sections in the newly submitted version to improve clarity.

W2 Safety neurons in related work

Please refer to our general response. We also added a discussion on this aspect in the related work section of our newly uploaded paper.

W3 What is the alignment in 4.1?

We agree that the released chat- or instruct- versions of current LLMs are generally safe, but to conduct experiments in a more controllable setting, we chose to begin alignment from the pre-trained base models. As mentioned at the end of Section 3, we refer to "the pre-trained LLMs before SFT (denoted as Base)." Additionally, the names used in Section 4.1, such as Mistral-7b-v0.1, refer to the official model names, not the abbreviated versions of the aligned chat models. Sorry for the potential misunderstandings. The results in Table 2 show that the base models used in this paper are not safe enough and our alignment improves safety.

W4 It is necessary to include some advanced jailbreaking attacks for evaluation

We believe our practice is appropriate since using red teaming benchmarks to evaluate model safety is a very common setting used by existing works, such as the two papers mentioned in your review. Moreover, similar to the misunderstanding in Weakness #3, we agree that current LLMs can generally refuse harmful questions, but this is the case for chat- and instruct- models. For the based models used in our experiments, they are clearly not safe enough in the adopted benchmarks (shown in Table 2).

W5 Can the conclusion generalize to larger models?

In general, we believe it is a common practice to use this model size in interpretability research [1-5], but we agree this is a valid concern and are happy to add more experiments. Due to the more computation and limited resources, more experiments are still running. We provide the results for Llama2-13B below (the format is similar to Table 2), which show similar trends to the original experiments in the paper.

Llama2-13BBTRTGSMBBHMMLUTQA
Base-4.5-4.00.220.1510.5070.268
Base*-8.7-8.40.20.1420.4830.272
SFT-7.5-5.80.1650.1330.5250.268
SFT*-11.2-10.30.1650.1320.5280.278
DPO-12.2-11.20.1850.1220.5200.288

[1] Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, Kun Wang, Yang Liu, Junfeng Fang, & Yongbin Li. (2024). On the Role of Attention Heads in Large Language Model Safety.

[2] Shen Li, Liuyi Yao, Lan Zhang, & Yaliang Li. (2024). Safety Layers in Aligned Large Language Models: The Key to LLM Security.

[3] Zeping Yu, & Sophia Ananiadou. (2024). Neuron-Level Knowledge Attribution in Large Language Models.

[4] Ameen Ali, Lior Wolf, & Ivan Titov. (2024). Mitigating Copy Bias in In-Context Learning through Neuron Pruning.

[5] Alessandro Stolfo, Ben Wu, Wes Gurnee, Yonatan Belinkov, Xingyi Song, Mrinmaya Sachan, & Neel Nanda. (2024). Confidence Regulation Neurons in Language Models.

审稿意见
6

Summary

The authors propose methods to identify "safety neurons" within large language models (LLMs) that are responsible for safety behaviors. They introduce "inference-time activation contrasting" to pinpoint neurons active in aligned models but inactive in unaligned ones, and "dynamic activation patching" to assess the causal impact of these neurons on safety. These findings suggest a pathway toward more controlled and robust alignment of LLMs with human values and safety requirements.

优点

Strengths

  • The topic of LLM safety is highly relevant and timely.
  • The paper makes solid contributions by:
    • Identifying safety neurons in three open-source LLMs.
    • Proposing an effective safeguard application.

缺点

Weaknesses

  • Novelty Concerns: The novelty of the proposed approach is unclear. Previous studies have investigated critical neurons within LLMs. The authors should clarify how their methods differ from or improve upon existing approaches.
  • Limited Discussion: The paper lacks a sufficient discussion on how the proposed methods relate to existing representation engineering techniques (https://arxiv.org/pdf/2310.01405). A deeper comparison would help contextualize their contributions.

问题

Questions:

  • How does the proposed approach for identifying "safety neurons" differ from prior methods that target other types of critical neurons in LLMs?
  • Can the "dynamic activation patching" method be generalized to other alignment applications, such as aligning models with values beyond safety (e.g., fairness)?
  • Do you find any mechanistic insight? For example, did you observe specific patterns among the "safety neurons" related to particular types of safety risks, such as misinformation or toxicity?
  • For safeguard applications, what is the overhead of your proposed approach?
评论

Thank you for your constructive and valuable comments. Here is our response to the weaknesses and questions.

W1 & Q1 How does our method differ from prior ones that target other types of critical neurons?

Please refer to Section 7 of the revised paper (Lines 499-516) for discussions on related neuron-based works.

Also please refer to our general response for discussions on related works on interpreting safety. Thanks for the question and we will add the discussions if more space is permitted.

W2 Limited Discussion

Please refer to our general response.

Q2 Can the "dynamic activation patching" method be generalized

Dynamic activation patching is task-agnostic. At the very least, Figure 4(b) in our paper demonstrates its effectiveness in altering the model’s helpfulness. As for whether it can be extended to other aspects of value alignment, we believe it is possible and this is an interesting direction for future work.

Q3 Mechanistic insight from safety neuron

Our current mechanistic insight suggests that safety and helpfulness may share the same set of neurons but exhibit different activation patterns on these neurons, which could potentially explain the alignment tax phenomenon. We think your suggestion about exploring patterns among the "safety neurons" related to specific types of safety risks to be a very interesting perspective. We plan to investigate this further in future work and sincerely thank you for proposing this valuable direction.

Q4 Overhead of safeguard applications

The overhead of our safeguard mechanism primarily comes from a logistic regression classifier. When using activations from only 1,500 neurons, this requires merely computing the inner product of 1,500-dimensional vectors, which is negligible compared to the billions of parameters in an LLM. In fact, if certain outputs can be rejected early, the process could even accelerate generation. Based on our measurements, the classification step takes less than 0.01 seconds, accounting for less than 1/2500 of the total inference time.

[1] Xiaozhi Wang, Kaiyue Wen, Zhengyan Zhang, Lei Hou, Zhiyuan Liu, & Juanzi Li. (2022). Finding Skill Neurons in Pre-Trained Transformer-Based Language Models.

[2] Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, & Furu Wei. (2022). Knowledge Neurons in Pretrained Transformers.

[3] Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, & Dimitris Bertsimas. (2023). Finding Neurons in a Haystack: Case Studies with Sparse Probing.

[4] Zeping Yu, & Sophia Ananiadou. (2024). Neuron-Level Knowledge Attribution in Large Language Models.

[5] Ameen Ali, Lior Wolf, & Ivan Titov. (2024). Mitigating Copy Bias in In-Context Learning through Neuron Pruning.

[6] Alessandro Stolfo, Ben Wu, Wes Gurnee, Yonatan Belinkov, Xingyi Song, Mrinmaya Sachan, & Neel Nanda. (2024). Confidence Regulation Neurons in Language Models.

[7] Wes Gurnee, Theo Horsley, Zifan Carl Guo, Tara Rezaei Kheirkhah, Qinyi Sun, Will Hathaway, Neel Nanda, & Dimitris Bertsimas. (2024). Universal Neurons in GPT2 Language Models.

评论

About Answer to W1 & Q1: Thank you for the explanation of the methods and comparisons with [6] and [7]. However, there are areas where further clarification would greatly enhance understanding: 

  • Clarification of "Properties": At the beginning of Section 4, you refer to properties like sparsity, causal effect, transferability, and stability during training. Could you elaborate on what these "properties" specifically mean in the context of your work? How are they quantitatively or qualitatively defined, and how are they measured?
  • Controlling Properties: When you state that [7] "cannot control the specific properties of the neurons identified," could you clarify what it means to "control" a property, particularly for transferability? How does your approach enable control (or lack thereof) of such properties, and why is this ability to control properties important for identifying safety neurons?
  • Comparison with [6]: In your comparison with [6], you mention that their method starts from desired neuron properties and identifies neurons mechanistically, but suffers from reproducibility issues due to predefined assumptions. You also state that for safety neurons, "we lack prior knowledge about the properties they should exhibit, making such an approach unsuitable." This could use more elaboration: Could you explain why the lack of prior knowledge about safety neurons' properties makes the approach of [6] unsuitable? Does your method completely avoid making assumptions about neuron properties, or do you rely on certain implicit assumptions? If so, how do these differ from those of [6]?
评论

Thank you for your acknowledgment and for giving us the chance to explain. We are so sorry for the potential confusion caused by the original response, which is inaccurate and could be explained in a dangerous way like we are criticizing the related works. Here are our clarifications regarding your questions. To avoid spreading confusion, we have revised the last comment (the old version is visible in the revision history if needed) and the PDF.

Question 1: Clarification of "Properties"

  • Causal Effect: In this paper, the causal effect is measured by the formula in Equation 4.
  • Sparsity: The sparsity in this work is measured by the proportion of neurons required to achieve a decent level of causal effect (e.g., 0.9) on model safety. In our main results, safety neurons constitute approximately 5% of all neurons.
  • Transferability: Transferability refers to how well the safety neurons identified in one dataset also work on others. In our experiments, we identified safety neurons in Beavertails, and these neurons also work well on other red-teaming benchmarks, such as in Table 2.
  • Stability on Training: This evaluates the consistency of safety neurons identified across models trained with different random seeds. We measured this using both neuron overlap and Spearman rank correlation, achieving values above 0.95 across the three model families. Additionally, the narrow error bars in Figure 2 also exhibit stability.

Question 2: Controlling Properties

We apologize for the imprecise expression. The properties here do not refer to the properties of safety neurons in Section 4. Here we use “properties” to refer to the phenomena or properties of LLMs that we want to interpret by identifying neurons. For instance, the property corresponding to safety neurons in this context is model safety. In [7], the authors identified universal neurons that are responsible for “properties” like alphabet, position, suppression, etc. However, the method described in [7] does not allow specifying the “properties” to be interpreted before identifying neurons, and thus cannot be directly applied to the goal of this work, i.e., interpreting model safety.

Question 3: Comparison with [6]

First, we need to clarify that the “reproducibility issue” definitely does not mean there are difficulties in reproducing the original results from our side. It is a bad expression for “the method in [6] cannot be directly reused in our work”, and we deeply apologize for the horrible implications. In [6], the authors identified entropy neurons by searching for neurons with a high weight norm and minimal impact on the logits. The underlying assumption is that these neurons act as a near-constant addition to all logits before the softmax, resulting in a minimal effect on output probabilities while increasing their entropy.

For model safety, however, we lack such a clear mechanistic intuition about how safety-related neurons should work. Therefore, our only assumption is that safety neurons should work in different ways between safety-aligned and unaligned models, and we design the inference-time activation contrasting to identify them.

审稿意见
5

This paper introduces a novel methodology for identifying specific MLP neurons that contribute to safety alignment in large language models. The authors present two complementary techniques: inference-time activation contrasting, which identifies neurons by comparing their activation patterns between pre- and post-safety-finetuned model checkpoints; and dynamic activation patching, which employs causal interventions to quantify the extent to which the identified neurons are responsible for the model's safety behaviors.

The authors show that inference-time activation contrasting can robustly identify neurons that are causally responsible for safety behavior (as measured by dynamic activation patching), on a wide range of benchmarks.

Through extensive experimentation, the authors demonstrate several key findings. When safety neurons are patched into instruction-trained models that were finetuned for helpfulness, it increases safety but reduces helpfulness. The reverse effect is also observed, suggesting that safety and helpfulness behaviors rely on similar neural mechanisms - providing mechanistic evidence for the alignment tax hypothesis. Additionally, the identified safety neurons can be used for harmful prompt classification to prevent unsafe model outputs.

优点

  • The authors tested their method on a variety of model families (LLaMa2, Mistral, and Gemma), and used a variety of different datasets and cost models to evaluate safety. This helps increase confidence that the neurons are actually responsible for general safety behavior, and not just patterns present in a particular dataset/grading scheme.

  • The authors show that the projections of their safety neurons onto the unembedding of the model, result in different tokens than toxicity neurons identified in previous work [1]. This distinction highlights that more complex instruction-tuned models have more nuanced mechanisms for dealing with safety than simply downweighting neurons that respond with toxic content.

[1] Andrew Lee, Xiaoyan Bai, Itamar Pres, Martin Wattenberg, Jonathan K. Kummerfeld, & Rada Mihalcea. (2024). A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.

缺点

  • The primary contribution of this work lacks sufficient novelty in the context of existing research. Prior work has already demonstrated successful localization of safety-relevant components in language models across multiple architectural levels, including neurons [1], parameters [2], residual activations [3], attention heads [4] [5], and layers [6]. While the authors occasionally reference some of these works throughout the paper, they fail to provide a comprehensive discussion of this existing research in either the related work section or the discussion.

  • The authors fail to adequately justify their focus on MLP neurons as the optimal level of abstraction for localizing safety behavior in language models. While they concentrate exclusively on neurons, prior work has demonstrated that safety behaviors emerge across multiple architectural components, particularly in attention heads and residual stream activations. The decision to analyze only neurons, while excluding these other important components, requires stronger theoretical or empirical justification. This limitation is particularly notable given that existing research has specifically identified attention heads as crucial contributors to refusal behavior [4].

  • The paper’s main contribution beyond identifying safety neurons is showing that helpfulness and safety training utilize similar mechanisms, which accounts for the “alignment tax” seen during safety training. However, the evidence provided in favor of this hypothesis is limited. The evidence can also be explained by dynamic activation patching not being a very good way of transferring specific mechanisms between different checkpoints. The authors should also look at models finetuned on both helpful and harmful data at the same time (HHH trained model), and test whether safety and helpful neurons still conflict.

  • The classification results in Section 6 are very misleading. The authors suggest that safety neurons show promise in assisting with harmfulness classification. However, the results in Appendix E suggest that safety neurons aren’t that much more useful for classifying harmfulness compared to random neurons (with random neurons being better when using 1500 neurons). This suggests that the method does not actually localize safety neurons, or that localization is not very useful for probing for harmfulness. Also, if the authors are going to claim that safety neurons are useful for building defenses that improve safety, they should compare it against similar setups such as in [3].

[1] Andrew Lee, Xiaoyan Bai, Itamar Pres, Martin Wattenberg, Jonathan K. Kummerfeld, & Rada Mihalcea. (2024). A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.

[2] Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, & Peter Henderson. (2024). Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications.

[3] Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, & Dan Hendrycks. (2023). Representation Engineering: A Top-Down Approach to AI Transparency.

[4] Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, & Neel Nanda. (2024). Refusal in Language Models Is Mediated by a Single Direction.

[5] Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, Kun Wang, Yang Liu, Junfeng Fang, & Yongbin Li. (2024). On the Role of Attention Heads in Large Language Model Safety.

[6] Shen Li, Liuyi Yao, Lan Zhang, & Yaliang Li. (2024). Safety Layers in Aligned Large Language Models: The Key to LLM Security.

问题

  • What motivated your decision to focus exclusively on MLP neurons, given that prior work has shown attention heads are crucial for refusal and safety behavior?

  • Have you considered validating your hypothesis about helpfulness and safety mechanism overlap using models simultaneously trained on both helpful and harmful data?

  • Are the probing results, primarily a negative result? If so, the section should be edited to clarify that.

评论

Thank you for your constructive and valuable comments. Here is our response to the weaknesses and questions.

W1 Lack of comprehensive discussion

Please refer to our general response.

W2 & Q1 Why focus on neurons?

  1. From the perspective of research objectives: Our goal is to develop a mechanistic understanding of LLM safety. The residual stream studied in RepE represents the combined effects of attention heads and MLPs. Understanding how these effects are formed requires digging deeper into the MI perspective. Since MLP neurons account for approximately two-thirds of the model’s parameters and serve as the basic functional units, we chose neurons as the focus of our research.
  2. From a technical perspective: Compared to attention heads, identifying specific neurons presents a greater technical challenge due to their vastly larger quantity. For instance, in LLaMA2-7B, the number of neurons is about 340 times that of attention heads, and neurons often function in combinations, making it difficult to pinpoint them using DFA methods like those in [1]. These challenges motivated us to prioritize studying neuron findings. Of course, we do not claim that neurons alone provide a complete understanding of the safety mechanism. Given the complexity of safety, it likely requires the joint participation of neurons and attention heads. Additionally, the methods proposed in this paper can also be applied to identify attention heads, facilitating further exploration of their interactions, which we consider an interesting direction for future work.
  3. Regarding related work on attention heads: [2] was published after our submission, while [1] identifies attention heads correlated with refusal directions from the RepE perspective but does not verify their causal effects on safety. This focus differs from the scope of our study.

W3 & Q2 Evidence provided in favor of the alignment tax hypothesis is limited

Our experiments (e.g. the curves in Figure 4(b)) rely on testing the causal effects of different neurons (e.g., safety, helpfulness, reasoning, etc.) on model safety and helpfulness, and do not involve transferring other abilities. The effectiveness of dynamic activation patching in transferring safety and helpfulness has been validated by previous experiments such as Figure 2 and Table 2. Therefore, we cannot see another alternative interpretation of our experimental results. We are willing to address your concerns but are unsure about the reasoning behind your interpretation or the specific experiments you propose. We would really appreciate it if you could elaborate more, like describing the suggested experiments in detail, and we are more than happy to verify it.

W4 & Q3 The classification results in Section 6 are very misleading

Thank you for your valuable suggestions. We agree that the margin of the original results is not clear enough. To more comprehensively verify whether safety neurons encode more safety-related information compared to random neurons, we conducted additional experiments:

  1. For the datasets used in the experiments, we selected one dataset as the training set and merged the others as a single test set at a time, averaging the results across all rotations.
  2. We excluded safety neurons from the randomly sampled set.
  3. We added a group of random neurons sampled from all layers, as following the layer distribution of safety neurons may inherently carry safety-related information.

The updated results have been added to the appendix, and we revised the corresponding descriptions in the main paper to be more precise. Below is a brief summary of the results and our explanations:

150 neurons1500 neurons
safety neuron71.176.2
random neuron last layer67.774.2
random neuron same distribution68.374.8
random neuron all layers67.074.7

From the results, we observe that safety neurons are indeed more effective than random neurons in predictions. Additionally, random neurons with the same layer distribution as safety neurons are more effective than those sampled from other layers, which indicates the layer distribution of safety neurons may also encode safety information. This may partially explain the results in Appendix E. We sincerely apologize for our oversight and thank you for pointing this out.

Lastly, we would like to note that the differences in prediction performance are not very significant, which may be due to the following reasons:

  1. Safety neurons may not directly encode information about whether harmful content will be generated but instead exert their effects through subsequent components.
  2. Random neurons may still receive information from safety neurons.

We plan to further investigate these aspects in future work.

评论

[1] Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, & Neel Nanda. (2024). Refusal in Language Models Is Mediated by a Single Direction.

[2] Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, Kun Wang, Yang Liu, Junfeng Fang, & Yongbin Li. (2024). On the Role of Attention Heads in Large Language Model Safety.

评论

Regarding W1 & W2

I acknowledge the engineering challenges involved in interpreting neurons compared to more coarse-grained units like layers and attention heads. The authors correctly note that previous works have examined safety behavior at both higher granularity (parameters) and lower granularity (attention heads/layers). However, my concern about novelty remains: the central question is not whether one can study safety behaviors at different granularities, but whether doing so provides meaningful new insights.

Even if the proposed probing method were significantly better on safety neurons versus random neurons, I'd still be unsure what the actionable insight would be. Researchers building latent space defenses (in the RepE camp) aren't probing at random neurons right now, so it feels like an artificial comparison. A more convincing demonstration would have been to show that neuron-level probing outperforms probing at the residual stream (representation engineering), which would have provided stronger validation for the need for additional mechanistic interpretability approaches.

I question whether papers that merely localize the same behaviors at different levels of granularity, without demonstrating novel insights or superior practical utility, represent sufficiently substantial contributions to warrant publication.

Regarding W3

I think there are two related but distinct claims here

  1. There is some sort of fundamental conflict between the neurons the model tends to use for helpful behavior, versus the neurons that the model tends to use for safety behavior.
  2. Finetuning a base model on just helpfulness allows it to repurpose some of its safety neurons. Similarly, fine-tuning a base model on just safety allows it to repurpose some of its helpful neurons.

The results in the paper suggest 2, but do not prove 1. I think proving 1 will be a challenge, that will require a more careful methodology. For example, fine-tuning a model on both helpful and harmless behavior, and then patching its safety mechanism over to a model just trained on helpfulness, and measuring how safe it is. I think there is a lot of subtlety that could be discussed here.

Regarding W4

I think the authors should move Figure 10b, into Section 6. Without any additional context, it is hard to understand how good an accuracy of 76.2% is. I would have appreciated additional baselines here, such as probing directly on the residual stream. I would also appreciate error bars for the random neuron lines, considering that the margins are so small.

While I appreciate the authors' effort to strengthen the paper, particularly with the addition of a more comprehensive related works section, my core concerns remain. Therefore, I maintain my score of 5.

评论

Thanks for the response and experiment suggestions. We provide further explanations about our novelty and contribution, and we also add new experiments as suggested.

Regarding W1&W2

We agree that only providing a new localization at a different granularity is limited and it is important for an interpretability work to provide new insights. This is why we provide the interpretation of alignment tax with safety neurons. Also, to demonstrate the potential utility, we included the safeguard experiments. We understand your concerns about these two parts (W3 and W4) and have added more experiments as suggested.

To summarize, our contributions are two-fold:

  1. New techniques for Localizing Model Components: Our framework (inference-time activation contrasting and dynamic activation patching) identifies model components (not limited to neurons) that have a causal effect on specific behaviors (not limited to safety), even in the absence of ground-truth labels. This expands the scope for investigating various behaviors at different granularities.

  2. New insights from the safety neuron interpretation: The localization of safety neurons in this paper enables new insights about LLMs’ inner workings. For example, we are the first, to our best knowledge, to propose a mechanistic explanation for the alignment tax phenomenon, and we believe more insights about model safety can be revealed by more careful studies on the properties of safety neurons.

Regarding W3

Thank you for raising this important point and for the valuable suggestion. We acknowledge that our paper suggests point 2 but does not directly prove point 1. To address this gap, we conducted an additional experiment to verify point 1:

  1. We used DPO to train two models based on the same SFT model: one trained on HH-helpful (denoted as Helpful DPO), and the other one trained on both HH-harmless and HH-helpful (denoted as HH DPO).

  2. We patched 5% of neuron activations from HHDPO into Helpful DPO, while the neurons are identified from Helpful DPO for model helpfulness under the same pipeline of identifying safety neurons in the paper.

BTRTHBJL
Helpful DPO3.420.656.686.66
Helpful DPO (patched)-11.77-11.09-5.57-8.28
HH DPO-11.81-12.42-10.41-11.76

The results indicate that the neurons identified as helpful neurons are also crucial for improving model safety during HH training, which is exactly the case for your hypothesis. We will add these important results to the paper in the next permitted revision. If you have any concerns or further questions, we are more than happy to continue the discussion.

Regarding W4

Thank you for your suggestion. As the submission deadline has passed, we are unable to update the PDF with figures to showcase our results. Below are the details of our experimental setup and corresponding results:

We introduced a baseline using the residual stream and reported best-performing and average results across all layers. A partial summary of the results is presented in the table below (± is the standard error across random experiments):

15015003000
Safety Neurons71.0876.2476.89
RN-Same Distribution68.30±1.0274.80±1.3176.35±0.67
RN-Last67.66±0.5374.21±1.1974.91±0.29
RN-All67.05±1.1572.38±0.5474.34±0.50
Residual77.80 (layer 15)71.75 (average)

The effect of safety neurons is on par with the best performance of the residual stream. Considering that the neuron-level interpretation has the unique advantage of providing mechanistic interpretations (showcased in the interpreting alignment tax part) over the representation interpretations like using the residual stream, we believe this result is satisfactory.

评论

Overall, I believe that the addition of the new experiments, particularly the HH-DPO patching and the residual stream probing, are valuable contributions to the paper. However, I still believe that my initial score is appropriate.

评论

I disagree with this claim

Considering that the neuron-level interpretation has the unique advantage of providing mechanistic interpretations (showcased in the interpreting alignment tax part) over the representation interpretations like using the residual stream

It is also possible to perform patching over the residual activations, as identified by these two works [1] [2].

[1] Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, & Neel Nanda. (2023). Linear Representations of Sentiment in Large Language Models.

[2] Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, & Noah D. Goodman. (2024). Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations.

评论

We agree that patching over the residual activations is possible, but this paper is not about discussing which one of the neuron and representation interpretations is better. This paper has always been about interpreting model safety from the very beginning, and we are just saying that in this topic, our neuron-level interpretation demonstrates an advantage in understanding the mechanism (of alignment tax) over the representation-engineering related works. To avoid misunderstanding again, we are not saying that representation engineering has no way to gain a better mechanistic interpretation of model safety, but this hasn't been achieved (or demonstrated by a clear case like the alignment tax interpretation in our paper), and thus we believe our findings have unique values for now. We understand that one can have the favor of a technical direction over another, but we do not think the possibility of another direction should influence the value of actual findings.

Also, thanks for acknowledging our new experiments. Although in different positions, we respect your opinion.

评论

We sincerely thank the reviewers for their thoughtful feedback and constructive suggestions. We have revised the paper based on the comments provided, with all changes highlighted in blue for clarity. We note that all the reviewers commented about our novelty and relationship to prior works. Below, we give a discussion on this:

First of all, we believe that the safety mechanism of LLMs is an important topic that is far from being solved. Therefore, it is worthwhile to have multiple papers working on this topic. Existing interpretability research on LLM safety can be broadly categorized into two perspectives: Representation Engineering (RepE) and Mechanistic Interpretability (MI). We acknowledge the importance of RepE-focused studies, as they often demonstrate strong practical effectiveness in steering model behavior. For instance, [3][4] are firmly grounded in the RepE perspective, and [1][6] also incorporate some perspectives of this approach.

In contrast, our work adopts the MI perspective, which seeks a bottom-up understanding of models’ inner workings. This perspective emphasizes the importance of localizing model functionality to the most fundamental operational units—a core principle of mechanistic interpretability. In the case of transformers, MLP neurons constitute approximately two-thirds of the model's parameters and serve as the foundational units for functionality. Therefore, we focus our study on neurons as the target of analysis to uncover safety mechanisms.

For articles categorized under the MI perspective, [1] has been discussed in our paper, where we point out that toxicity is an incomplete part of model safety concerned in our work, a view also acknowledged in Strength 2 of Review 9rqA and recent work [7]. [2] adopts a different definition of “neuron”, which describes individual parameters rather than complete functional units in this paper. Since features in transformers are usually represented as vectors, it is difficult to interpret how different parameters in a single vector play different mechanistic roles. [5] is a work published after our submission, and we could not include it in the paper. We believe that the functionalities of neurons and attention heads are not in conflict; instead, complex functions like safety are more likely to result from their collaboration. We plan to further explore their relationship in future work. [6] adopts a safety layer perspective, which we consider too coarse-grained compared to neurons and attention heads for providing a mechanistic understanding.

Thanks again for referring to the related works. We have added the discussions in the revision.

[1] Andrew Lee, Xiaoyan Bai, Itamar Pres, Martin Wattenberg, Jonathan K. Kummerfeld, & Rada Mihalcea. (2024). A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity.

[2] Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, & Peter Henderson. (2024). Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications.

[3] Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, & Dan Hendrycks. (2023). Representation Engineering: A Top-Down Approach to AI Transparency.

[4] Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, & Neel Nanda. (2024). Refusal in Language Models Is Mediated by a Single Direction.

[5] Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, Kun Wang, Yang Liu, Junfeng Fang, & Yongbin Li. (2024). On the Role of Attention Heads in Large Language Model Safety.

[6] Shen Li, Liuyi Yao, Lan Zhang, & Yaliang Li. (2024). Safety Layers in Aligned Large Language Models: The Key to LLM Security.

[7] Yushi Yang, Filip Sondej, Harry Mayne, & Adam Mahdi. (2024). Ablation is Not Enough to Emulate DPO: How Neuron Dynamics Drive Toxicity Reduction.

AC 元评审

The recommendation is based on the reviewers' comments, the area chair's evaluation, and the author-reviewer discussion.

While the reviewers see some merits in using a mechanistic interpretability approach to study safety neurons in LLMs, this submission should not be accepted in its current form due to several fundamental issues, as pointed out by the reviewers, including

  • Distinction and novelty in comparison to existing works, especially "Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, & Peter Henderson. (2024). Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications."
  • Soundness of the methodology, especially the comments made by Reviewer 9rqA

During the final discussion phase, reviewers suggest to reject this submission and no reviewer is willing to champion this paper in its current form. I also believe the presentation and position of the paper can be improved and demand another round of full reviews. I hope the reviewers’ comments can help the authors prepare a better version of this submission.

审稿人讨论附加意见

This submission should not be accepted in its current form due to several fundamental issues, as pointed out by the reviewers, including

  • Distinction and novelty in comparison to existing works, especially "Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, & Peter Henderson. (2024). Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications."
  • Soundness of the methodology, especially the comments made by Reviewer 9rqA

During the final discussion phase, reviewers suggest to reject this submission and no reviewer is willing to champion this paper in its current form. I also believe the presentation and position of the paper can be improved and demand another round of full reviews.

最终决定

Reject