Redefining Experts: Interpretable Decomposition of Language Models for Toxicity Mitigation
Our method steers LLMs away from toxic words in real time, guiding generation toward safe alternatives using the output layer’s SVD decomposition. No retraining is needed, while fluency and context are preserved.
摘要
评审与讨论
The manuscript proposes a geometric characterization of toxic behavior identification in LLMs based on aggregated features of multiple layers (as opposed to neuron-based analysis). Experiments are provided on numerous architectures and datasets showing promises.
优缺点分析
The paper considers a very important problem within LLM research. While toxicity detection is often done ad-hoc from another LLM judging a previous LLM's output, or though extensive mitigations strategies and prompt tuning, there exists prior work on using internal layers features for toxicity detection and classification that the manuscript failed to cite:
- Characterizing Large Language Model Geometry Helps Solve Toxicity Detection and Generation, Randall Balestriero, Romain Cosentino, Sarath Shekkizhar, ICML 2024 While the procedure is slightly different, and thus this doesn't reduce the significance of the given submission.
One possible limitation is the lack of study on larger models, while this could be a computation bottleneck, it would be great to verify if the method also scales to the latest and larger models showing more applicability of the method (since the models studied in the paper wouldn't not relevant for real-world applications).
问题
Please see above.
局限性
Please see above.
最终评判理由
Please see my last answer to the authors
格式问题
None
We sincerely thank the reviewer for the thoughtful and constructive feedback. Below, we address the key concerns and suggestions raised.
Addressing the Missing Citation of Prior Work
We appreciate the pointer to the work by Balestriero et al. (ICML 2024):
Balestriero, R., Cosentino, R. and Shekkizhar, S., 2023. Characterizing Large Language Model Geometry Helps Solve Toxicity Detection and Generation. arXiv preprint arXiv:2312.01648.
We acknowledge the significance of the referenced work and appreciate the reviewer for highlighting it. We will incorporate a proper citation and provide a detailed discussion in the Related Work section of the revised manuscript, clearly outlining both the similarities and key differences between their approach and ours.
Scalability to Larger Language Models
Thank you for the valuable suggestion regarding evaluation on larger models. We fully agree with the importance of demonstrating scalability and real-world applicability. In response, we have extended our evaluation to include several larger LLMs. The results below highlight the effectiveness of our method across scales:
| Model | Toxicity (Before) | PPL (Before) | Toxicity (After) | PPL (After) | ↓ Toxicity Reduction | ↓ PPL Change |
|---|---|---|---|---|---|---|
| LLaMA-70B | 11.20% | 4.49 | 5.50% | 4.70 | 50.88% | -4.68% |
| Falcon-30B | 11.10% | 7.29 | 4.39% | 8.37 | 60.43% | -14.81% |
| LLaMA-13B | 9.99% | 5.69 | 3.89% | 7.83 | 61.06% | -37.61% |
| Mixtral 8x7B | 10.85% | 4.93 | 4.38% | 4.99 | 59.64% | -1.22% |
These results confirm that our method maintains strong performance even when scaled to more powerful and practically relevant LLMs.
We are deeply committed to building safe and interpretable AI systems. EigenShift is a meaningful step in that direction: lightweight, transparent, and generalizable. We appreciate the reviewer’s thoughtful questions and hope these clarifications demonstrate the rigor, novelty, and practicality of our approach.
We respectfully request reconsideration of our score in light of these improvements and experiments.
Sincerely,
Authors
Dear Reviewer mLbZ,
As the discussion period is closing soon, we wanted to kindly ask for your acknowledgment of our responses. We have invested significant efforts into understanding and addressing each of your feedback, including running additional experiments to bring more clarity. If our response has addressed your feedback, we would appreciate it if you could revise your ratings accordingly.
Thank you again for your thoughtful review and for helping us improve our work.
Dear Reviewer mLbZ,
As the discussion phase ends today, we hope our responses and additional experiments have addressed your feedback. If our clarifications align with your expectations, we would be glad for this to be reflected in your updated ratings.
Thank you again for your thoughtful review and for helping us improve our work.
Dear authors,
Thank you for your rebuttal, I have no further questions at this point, and I updated my score accordingly.
This paper investigates toxicity mitigation in large language models by questioning the reliability of neuron-level interventions and proposing a novel approach called EigenShift. The authors demonstrate that individual neurons are unreliable toxicity indicators due to their stochastic nature, while layer-wise representations provide more robust signals. They introduce EigenShift, which uses eigen-decomposition of the final output layer (lm_head) to identify and selectively suppress "generation experts" responsible for toxic outputs while preserving "detection experts." The method requires no additional training and is evaluated on Jigsaw and ToxiCN datasets, showing superior performance with a novel TPH (Toxicity-Perplexity Harmonic) score.
优缺点分析
I find the distinction between "detection experts" and "generation experts" is conceptually interesting, and it addresses a key limitation in prior work that conflates these two roles. The eigen-decomposition approach provides a principled way to target generation-specific components. Authors also did a systematic comparison between neuron-level and layer-wise representations across multiple models (BERT, BART, LLaMA, Mistral) and datasets (English Jigsaw, Chinese ToxiCN). I also appreciate that the authors test their methods on both English and Chinese datasets, which strengthens the generalizability claims.
The major concern that I have is how such decomposition methods would influence the performance of large language models. Without extensive evaluation on this, it's hard to conclude how those methods would be useful for the community. Furthermore, while the SVD decomposition is mathematically sound, the paper lacks rigorous theoretical justification for why eigen-directions should correspond to semantic concepts like toxicity. The hypothesis that eigenvectors represent "semantic choices" is largely empirical and could benefit from stronger theoretical grounding. While the paper compares against neuron-level methods (Det-0, DAMP, AURA), it lacks comparison with other structural approaches or more recent safety techniques like constitutional AI or preference learning methods.
问题
See above
局限性
yes
最终评判理由
The authors provide some evidence for how such detox methods would influence the performance of LLMs, but further experiments are needed.
格式问题
No
We sincerely thank the reviewer for their positive assessment of our work's core contributions, including:
- The novel distinction between "detection experts" and "generation experts"
- Our principled eigen-decomposition approach
- The systematic cross-lingual evaluation
We appreciate the thoughtful concerns raised and address them below.
W1: The major concern that I have is how such decomposition methods would influence the performance of large language models. Without extensive evaluation on this, it's hard to conclude how those methods would be useful for the community.
[W1] Response:
We believe our paper provides an extensive and direct evaluation of the performance impact:
-
Direct Performance Measurement:
We use perplexity, the standard metric for language model fluency, to measure performance impact. As stated in our Evaluation Setup, perplexity is computed on a fixed snapshot of the English Wikipedia corpus, a widely used benchmark for this purpose. -
Comprehensive Experimental Results:
Table 2 presents a detailed, side-by-side comparison of performance trade-offs between EigenShift and three other baselines across five different LLMs. This clearly illustrates changes in perplexity versus toxicity reduction the most direct evaluation method for our goal. -
Negligible Impact of SVD Itself:
To isolate the impact of the decomposition from the intervention, we analyzed the reconstruction error from SVD. As shown in Appendix C.4 and Table 5, the Frobenius reconstruction loss is negligible (e.g., 8.00×10⁻⁵ for LLaMA-7B), confirming that decomposition does not measurably affect perplexity. -
Qualitative Performance:
In addition to quantitative results, Table 4 provides qualitative examples showing that while baseline methods lead to "incoherent generation" or "catastrophic forgetting", our approach preserves fluency and semantic intent, replacing toxic terms with neutral alternatives.
We believe these four points, perplexity benchmarking, cross-model comparisons, reconstruction analysis, and qualitative insights, together demonstrate a thorough evaluation of performance impact and the practical utility of EigenShift.
W2: While the SVD decomposition is mathematically sound, the paper lacks rigorous theoretical justification for why eigen-directions should correspond to semantic concepts like toxicity.
[W2] Response:
We agree that a formal theoretical connection between eigenvectors and semantic concepts is a challenging open research problem. Our contribution lies in proposing a novel and interpretable hypothesis, supported by robust empirical evidence across multiple models and languages.
-
A Principled Hypothesis:
Our hypothesis is grounded in model architecture: the final linear layer (lm_head) serves as a semantic decision bottleneck. The eigenvectors of this transformation are hypothesized to correspond to principal axes of semantic choice during generation. -
Validation Through Rigorous Experimentation:
The core of our paper is an empirical validation of this hypothesis. We demonstrate that:- Identifying eigen-directions via activation differences isolates toxicity signals (see Equation 1), and
- Selective damping of these directions reduces toxic generation while preserving fluency (Table 2).
This empirical validation is a key contribution, and we hope it lays the groundwork for future theoretical investigations.
- Interpretability Justification:
LLMs are widely regarded as black-box systems. Our work takes a significant step toward interpreting their internal decision mechanisms, bridging the gap between structure and semantics.
If these clarifications address the concerns and misalignments in understanding, we kindly ask you to consider revisiting your score.
We remain deeply committed to building safe and interpretable AI systems. EigenShift represents a meaningful step in that direction it is lightweight, transparent, and broadly generalizable. We sincerely appreciate the reviewer’s thoughtful questions and engagement with our work. We hope that the clarifications and additional experiments presented here underscore the rigor, novelty, and practicality of our approach.
We respectfully request a reconsideration of our score in light of these responses.
Sincerely,
The Authors
Thank you for the response. I believe that relying solely on perplexity is insufficient for evaluating LLMs, especially given the wide range of established public benchmarks available.
Dear Reviewer j4b1,
We appreciate your feedback. However, in line with prior work on toxicity mitigation in LLMs [1,2,3,4], we adopted perplexity and toxicity shift as our primary evaluation metrics. We kept our evaluation to these metrics to ensure fair comparison. While we agree that broader evaluation can be valuable, we note that no specific alternative metrics were suggested in the review. If there are particular metrics you believe would be relevant to assess the performance of our method, we would appreciate your recommendations. Given that our method is training-free, we are open to incorporating additional evaluations where applicable.
References
[1] Suau, X., Delobelle, P., Metcalf, K., Joulin, A., Apostoloff, N., Zappella, L. and Rodríguez, P., 2024. Whispering experts: Neural interventions for toxicity mitigation in language models. arXiv preprint arXiv:2407.12824.
[2] Geva, M., Caciularu, A., Wang, K.R. and Goldberg, Y., 2022. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. arXiv preprint arXiv:2203.14680.
[3] Wang, Y. and Demberg, V., 2024. RSA-Control: A Pragmatics-Grounded Lightweight Controllable Text Generation Framework. arXiv preprint arXiv:2410.19109.
[4] Lee, A., Bai, X., Pres, I., Wattenberg, M., Kummerfeld, J.K. and Mihalcea, R., 2024. A mechanistic understanding of alignment algorithms: A case study on dpo and toxicity. arXiv preprint arXiv:2401.01967.
I think that at least additional general language understanding benchmarks should be evaluated, such as MMLU. It's unclear to me why previous works do not adopt these more realistic evaluations than perplexity (it could be that those benchmarks are not widely adopted at the moment).
Dear reviewer,
Thank you for your response. We would like to highlight that other reviewers suggested the exact same feedback already in their initial review, to which we responded with detailed additional benchmarking experiments (see response to Reviewer 6xMk, W9uw, and mLbZ). We have included evaluations across three representative MMLU categories:
MMLU - Algebra (Mathematical Reasoning)
| Base Model | Before Intervention | After Intervention |
|---|---|---|
| LLaMA | 35 | 34 |
| Mistral | 29 | 32 |
| GPT-2 | 22 | 24 |
| Falcon | 27 | 25 |
| MPT | 22 | 21 |
MMLU - Logical Fallacies (Logical Reasoning)
| Base Model | Before Intervention | After Intervention |
|---|---|---|
| LLaMA | 46 | 46 |
| Mistral | 73 | 73 |
| GPT-2 | 19 | 18 |
| Falcon | 31 | 30 |
| MPT | 32 | 31 |
MMLU - U.S. Foreign Policy (Factual Knowledge)
| Base Model | Before Intervention | After Intervention |
|---|---|---|
| LLaMA | 59 | 58 |
| Mistral | 82 | 81 |
| GPT-2 | 28 | 29 |
| Falcon | 32 | 30 |
| MPT | 31 | 29 |
These results confirm that EigenShift’s intervention maintains the reasoning, factual knowledge, and problem-solving skills of the base models. Therefore, while there is a moderate perplexity increase, it is a carefully engineered and justifiable trade-off that avoids the catastrophic degradation seen in prior work.
Thanks for the response. It clears some of my concerns, thus I have updated my score to 4.
The authors introduce an approach "EigenShift" that decomposes the weights of the last layer (that maps to the vocabulary) with an SVD, and adjusts the weights along the principal components to mitigate toxic text generation. To do this, they measure the activations from the last transformer block, and pick components where the expected difference in activation between toxic and non-toxic texts is largest. The authors also evaluate neuron-level interventions for toxicity mitigation, and show that their layer-wise approach is significantly more effective. The method is evaluated both on encoder-only models like BERT and decoder only models like Llama, and shows improvements across baselines across the board.
优缺点分析
Strengths:
- The study of neuron-level interventions, and demonstrating their consistent ineffectiveness is very interesting. It highlights that representations for semantic concepts in LLMs tend to be spread across layers as opposed to individual neurons (it would be nice if the authors could contrast their results with the findings in https://arxiv.org/abs/1704.01444, where they find ~90%+ accuracy from a single neuron intervention).
- The layer-wise method is simple to understand/implement and provides significant improvements over the baselines (though not enough to be considered a viable method yet, see weaknesses below).
Weaknesses:
- A 58% increase in perplexity is substantial -- while better than the neuron-level intervention baselines, it would certainly be something that raises eye-balls if used in a product system. A 58% increase corresponds to the perplexity of a model several times smaller in size (e.g., you're using Llama 7B but getting the performance of a 2B/3B model, which sounds not so ideal).
- The AUROC for the layer-wise method EigenShift proposed in this work is still ~60%. While much higher than the baselines, which perform nearly the levels of a random classifier, it it still not a very high accuracy for a classifier.
问题
Questions:
- How does the computational cost of "finetuning" or "RLHF" compare with your method? It sounds like you still need to gather activations across many thousands of inputs.
- It would be interesting to repeat the experiment across other semantic concepts beyond "toxic", and see if the final layer is still effective at capturing those? (Maybe it's much better for certain concepts)
- Would it be possible to add a histogram of the input activations to the final layer for both toxic and non-toxic text, at least for the key principal components where you decide to intervene? It would make for an interesting visualization.
- What would happen if you ensembled the interventions across multiple layers? It would be interesting to look at this for 2-3 layers at least, maybe this gives significant improvements in performance?
- I think it would make the paper stronger to position this as work on interpreting LLMs, rather than an approach for toxicity mitigations. The low-ish performance would definitely be a lot less concerning in that case. It would also be interesting to search across the layers and remark on which layers encode what concepts, and so on.
局限性
yes
最终评判理由
I think the two weaknesses I highlight above have not be adequately addressed. I would like to stay with my rating for now.
格式问题
None
We sincerely thank the reviewer for their thoughtful review and constructive feedback. We appreciate the time you have taken to analyze our work and are glad you found our contributions valuable. We address your concerns and questions below.
Rebuttal to Weaknesses
W1: A 58% increase in perplexity is substantial.
We thank the reviewer for highlighting this important point. We agree that a 58% increase in perplexity (from 6.23 to 9.84 for LLaMA-7B) is significant and merits close scrutiny.
However, it is crucial to view this increase in context. Our method, EigenShift, prioritizes a careful balance between toxicity reduction and linguistic fluency. Unlike other zero-training intervention approaches that severely degrade language quality, our method achieves this trade-off with minimal sacrifice to coherence. As shown in Table 2:
- Det-0 increases perplexity from 6.23 to 43,517.97 — a staggering 700,000% increase.
- Damp increases it to 741.65 — an 11,724% rise.
- Aura, the best-performing baseline prior to our work, raises perplexity to 19.30, a 210% increase.
In contrast, EigenShift yields a final perplexity of 9.84, which, although increased by 58%, represents a massive preservation of fluency relative to existing methods. To quantify this trade-off, we introduced the Toxicity-Perplexity Harmonic (TPH) Score, where EigenShift attains a leading value of 60.37%, demonstrating its superior balance of toxicity mitigation and coherence.
To further address concerns regarding potential degradation of the model’s core abilities, we evaluate our intervened models on widely accepted downstream benchmarks using the MMLU (Massive Multitask Language Understanding) suite across diverse reasoning domains. Results below indicate that model capabilities remain largely intact:
MMLU - Algebra (Mathematical Reasoning)
| Base Model | Before Intervention | After Intervention |
|---|---|---|
| LLaMA | 35 | 34 |
| Mistral | 29 | 32 |
| GPT-2 | 22 | 24 |
| Falcon | 27 | 25 |
| MPT | 22 | 21 |
MMLU - Logical Fallacies (Logical Reasoning)
| Base Model | Before Intervention | After Intervention |
|---|---|---|
| LLaMA | 46 | 46 |
| Mistral | 73 | 73 |
| GPT-2 | 19 | 18 |
| Falcon | 31 | 30 |
| MPT | 32 | 31 |
MMLU - U.S. Foreign Policy (Factual Knowledge)
| Base Model | Before Intervention | After Intervention |
|---|---|---|
| LLaMA | 59 | 58 |
| Mistral | 82 | 81 |
| GPT-2 | 28 | 29 |
| Falcon | 32 | 30 |
| MPT | 31 | 29 |
These results confirm that EigenShift’s intervention maintains the reasoning, factual knowledge, and problem-solving skills of the base models. Therefore, while there is a moderate perplexity increase, it is a carefully engineered and justifiable trade-off that avoids the catastrophic degradation seen in prior work.
W2: The AUROC for the layer-wise method...is still ~60%...not a very high accuracy for a classifier.
We appreciate the reviewer's observation. We wish to clarify that the goal of our layer-wise AUROC analysis was not to build a state-of-the-art classifier, but to address RQ1 and RQ2 by proving that layer-wise representations are a more stable source of signal for toxicity than individual neurons. Our results in Table 1 confirm this, showing a +15.84 percentage point improvement in AUROC on the Jigsaw dataset over neuron-based methods.
The main contribution, EigenShift, uses this signal for a targeted intervention, not classification. Its success is therefore measured by the final generation quality and toxicity reduction, not the intermediate AUROC score. The table below, extracted from our LLaMA-2 results, highlights the final performance where EigenShift excels:
| Method | Toxicity Reduction | Perplexity Change | TPH Score |
|---|---|---|---|
| EigenShift (Ours) | 57.47% | +58% | 60.37% |
| Aura (SOTA baseline) | 67.38% | +210% | 43.73% |
This demonstrates that even a moderate signal, when used in a principled way, can lead to state-of-the-art intervention outcomes.
Answers to Questions
Q1: How does the computational cost of "finetuning" or "RLHF" compare with your method?
The computational cost of EigenShift is orders of magnitude lower than finetuning or RLHF.
- EigenShift Cost:
- Involves a single forward pass over a sample dataset to collect activations , where is the number of tokens and is the hidden dimension.
- A one-time Singular Value Decomposition (SVD) of the final layer weight matrix — , which is computationally inexpensive on modern hardware (e.g., for LLaMA, and ).
- Replace the weight matrix with the modified one — . (for 7B models ~ 30 seconds)
This intervention takes less than 2 minutes in practice, but more importantly, has tractable polynomial complexity. Our implementation is open-sourced to ensure faithful reproducibility, underlining the practicality and superiority of our approach.
- Finetuning/RLHF Cost:
- Requires full backpropagation through a multi-billion parameter model — , where is the number of epochs and is the number of model parameters.
- RLHF further introduces additional costs for human preference data collection and policy training.
Conclusion: EigenShift is a lightweight, efficient model-editing method that avoids retraining and is highly accessible, unlike conventional heavy approaches like RLHF and finetuning.
Q2: It would be interesting to repeat the experiment across other semantic concepts beyond "toxic."
We agree that this is an exciting and promising direction. As noted in our "Limitations and Future Work" section, EigenShift is a generalizable framework. It can be extended to concepts like hate speech, vulgarity, cultural references, and emotional tone, by identifying and steering the corresponding eigen-directions.
We selected toxicity for this study as it presents a high-impact, well-bounded, and measurable case to demonstrate our method's effectiveness.
Q3: Would it be possible to add a histogram of the input activations to the final layer for both toxic and non-toxic text?
Absolutely we appreciate this suggestion. We will include this visualization in the appendix in the camera ready version.
Q4: What would happen if you ensembled the interventions across multiple layers?
This is a valuable and insightful question. In fact, we discuss this in our Future Work section. While ensembling interventions across multiple layers could potentially capture richer semantics, it is currently computationally expensive for large models with 30+ transformer layers (e.g., LLaMA-7B).
Our current design intentionally focuses on the final linear layer (lm_head) as it offers a precise, interpretable, and minimally invasive point of control. Future work will explore whether multi-layer decompositions can provide additional gains without compromising fluency.
Q5: I think it would make the paper stronger to position this as work on interpreting LLMs, rather than an approach for toxicity mitigation.
We wholeheartedly agree with this perspective and appreciate the suggestion. In fact, our central thesis is that interpretability enables principled mitigation.
- Our title, "Redefining Experts: Interpretable Decomposition of Language Models for Toxicity Mitigation", reflects this dual focus.
- RQ3 directly addresses interpretability: "Can we uncover interpretable components...aiming to make black box models more understandable?"
- Toxicity mitigation serves as a practical validation of the decomposition technique. Without a high-impact application, the interpretability would remain abstract.
Thus, our method represents a novel interpretability-driven model editing technique that is not only transparent but also effective in real-world interventions. We hope this framing clarifies our contribution and vision.
We are deeply committed to building safe and interpretable AI systems. EigenShift is a meaningful step in that direction: lightweight, transparent, and generalizable. We appreciate the reviewer’s thoughtful questions, and hope these clarifications demonstrate the rigor, novelty, and practicality of our approach.
We respectfully request reconsideration of our score in light of these improvements and experiments.
Sincerely,
Authors
Dear Reviewer W9uw,
As the discussion period is closing soon, we wanted to kindly ask for your acknowledgment of our responses. We have invested significant efforts into understanding and addressing each of your feedback, including running additional experiments to bring more clarity. If our response has addressed your feedback, we would appreciate it if you could revise your ratings accordingly.
Thank you again for your thoughtful review and for helping us improve our work.
Dear Reviewer W9uw,
As the discussion phase ends today, we hope our responses and additional experiments have addressed your feedback. If our clarifications align with your expectations, we would be glad for this to be reflected in your updated ratings.
Thank you again for your thoughtful review and for helping us improve our work.
"We wish to clarify that the goal of our layer-wise AUROC analysis was not to build a state-of-the-art classifier"
I understand the goal is not to build a SoTA classifier, but for any classifier, claiming it works well at an AUROC of 60% is a fairly weak claim even if it works slightly better than some weaker baselines.
Thank you for the response, it clarifies some details. However, I think my primary criticisms still hold:
- The degradation in model quality seems to be far too substantial to use this method for model editing in the real-world. While other methods are more computationally expensive or require human annotation, arguably, they are still more practical because of the preservation of model capabilities.
- I think a large section in the paper is dedicated to showing that the classification abilities improve, and hence why these features are useful for interpreting the model. I think the classification performance is not substantial for us to draw strong conclusions, and the authors suggest the focus should be on the editing evaluation (it is unclear why the classification is then such a big part of the narrative of the paper). I have concerns regarding the editing performance as well (see point 1).
In my opinion, while mitigating toxicity is a high-impact application, the method is not ready for being used in practice. In such a case, focusing on the interpretability component and examining various recent models is far more interesting than demonstrating the editing ability on toxicity that also significantly degrades the model performance.
Dear Reviewer,
We thank you for taking your time to go deep into our paper’s findings, and we sincerely appreciate you raising points that can definitely raise the standard of the paper (including the additional experiments we conducted in our discussion so far). Regarding your current feedback:
-
We agree that the current method is not positioned for immediate product, and this is in line with prior state-of-the-art works in this area, which also focused on improved methodology rather than immediate product. Our primary contribution is methodological: reducing the perplexity increase from (~700000% or ~210%) in earlier approaches (see previous response) to only 58% while retaining a training-free pipeline. Respectfully, we emphasise this as a ‘substantial’, not ‘slight’, improvement in the trade-off between toxicity mitigation and fluency, especially compared to methods requiring extensive retraining or annotation. Also, when compared to computationally heavy and time-consuming methods that require retraining or annotation, our training-free approach offers a ‘different trade-off’ that may be preferable in scenarios where efficiency is critical.
-
A significant part of the paper first asseses the efficacy of layer-based vs neuron-based interventions, and the classification experiments are included to support this methodological choice. Although our main narrative focuses on toxicity mitigation, the underlying theories are built on top of these early classification findings. This justification is explicitly discussed in Lines 79-90 (c.f. Section 2) of the paper, and the classification component should be viewed as foundational analysis rather than the primary performance claim.
And we thank you for constructively highlighting the interest in our study, as our core aim was to maintain focus on how mitigation can be achieved both (a) efficiently and (b) effectively.
Thank you again for your constructive feedback.
This paper introduces EigenShift, a training-free intervention technique for toxicity mitigation in large language models. Rather than operating at the neuron-level, EigenShift performs a singular value decomposition of the model's final linear layer, identifying semantic directions aligned with toxic content then damping its singular values. The method requires no gradient updates for the model parameters, adds only two tunable hyperparameters (damping factor and for the top‑ eigenvectors). In the experiments, the authors demonstrate the improvements of EigenShift over neuron-level baselines with Jigsaw and ToxiCN dataset.
优缺点分析
Strengths
- The proposed method does not require additional fine-tuning, making it computationally efficient and easy to adopt in practice.
- Compared to prior neuron-level interventions, the eigen-decomposition approach shows greater stability and consistency across contexts and languages.
- The proposed method frames the toxicity control as an eigen‑decomposition grounds the intervention in linear‑algebraic structure which provides clearer interpretability than ad‑hoc neuron masking.
- Despite being simple to implement, the proposed method achieves superior performance than the baseline algorithms.
Weaknesses
- The method introduces two tunable hyperparameters (damping factor and top- selection), whereas most baselines tune only a single damping coefficient. Selecting two hyperparameters per model/dataset possibly limited practicality.
- It remains unclear whether the experimental comparisons fairly highlight the advantages of the proposed method over existing baselines (e.g., cost for hyperparameter selection, only using automatic evaluation metric).
- All conclusions rely on automated toxicity evaluation and perplexity. While automatic metrics are used extensively, it is unclear whether the gains translate to real user perception.
- The potential downsides of toxicity suppression (e.g., performance degradation on other tasks or semantic shifts) are not thoroughly analyzed.
- While the method is described as training-free, this point is emphasized only in the conclusion section. It would be helpful to highlight and justify this benefit more explicitly throughout the main text.
- The paper does not compare EigenShift to training-based toxicity mitigation approaches, leaving unclear how much performance is sacrificed for the benefit of being training-free.
问题
- Hyperparameter Sensitivity and Practicality: Results in Figure 4 and Table 6 suggest sensitivity to both damping factor and top- selection. Are these parameters highly model/dataset-specific? If so, does this limit real-world deployment without per-case tuning? Can you show a comparison of optimal hyperparameter search results for a wider variety of models and datasets?
- Hyperparameter Selection Criterion: What criterion was used to select optimal hyperparameters for each method? Was it based on maximizing TPH or some other combination of toxicity and perplexity?
- Comparison with Training-Based Methods: What are the results when comparing performance experimentally with training-based methods? It doesn't necessarily outperform training-based methods, but this comparison would help demonstrate the advantages of training-free methods.
- Performance with Limited Data: How does the method perform under low-resource settings? For example, would its advantage grow when the amount of training data is limited compared to training-based methods?
- Impact on Other Downstream Tasks: Does the proposed intervention negatively impact the model’s performance on non-toxic generations or other downstream tasks? Any empirical evidence would be valuable.
- Toxicity-Perplexity Frontier visualization:** Instead of showing only TPH scores, could the authors present a toxicity-perplexity trade-off curve for each method with various hyperparameter settings? This would better illustrate the behavior of different intervention strategies under varying conditions.
- Need for Human Evaluation: While automatic metrics like perplexity and classifier-based toxicity scores are used, human evaluation (for toxicity and naturalness) would greatly strengthen the conclusions.
- Comparison with Final-Layer Neuron-Based Methods: Given the claim that the final LM head is a critical semantic decision point, have the authors compared EigenShift with neuron-level interventions applied only to the last layer?
- Clarification on Neuron-Based Baselines in Table 1: What specific neuron-based method was used for the baseline comparisons? Were the hyperparameters of all baselines sufficiently tuned?
- Generalizability to Other Tasks: Can the proposed method be extended to other controllable generation tasks beyond toxicity (e.g., sentiment, style)? Is it also extend to the multi-objective LLM control settings?
局限性
Yes, the authors adequately addressed the limitations and potential negative societal impact of their work.
最终评判理由
I will maintain my score (4: Borderline accept). The author's response addressed most questions, and I'm still in favor of accepting this paper. However, concerns remain regarding the experimental aspects of the paper, making it difficult to raise the rating.
格式问题
There are no paper formatting concerns for this submission.
We thank the reviewer for their detailed feedback and insightful questions. We are encouraged that the reviewer found our work to be "technically solid" with good quality, significance, and originality. We address each feedback and question below.
[W1/Q1] We thank the reviewer for this practical concern. EigenShift has two hyperparameters and, unlike baselines that apply a uniform damping factor, our method first identifies and ranks novel 'generation experts' based on their directional influence (Δi). There is no existing metric to pre-determine the optimal number of these experts to target. Therefore, the Top-k parameter is a necessary and natural mechanism to select the most influential directions for intervention. The damping factor, α, then controls the intensity of this intervention.
Our ablation study in Appendix C.5 (Table 6, Figure 4) was included to provide clear guidance and avoid the need for a complex search. Our findings show that setting the damping factor α=0.9 is a relatively stable choice that consistently preserves model fluency (perplexity) across various Top-k values, as shown in the Table below. This simplifies tuning by effectively reducing the search to a single dimension. With α fixed, Top-k becomes a direct and interpretable control knob for the user to manage the toxicity-fluency trade-off.
The table below is a snapshot from Table 6 in the paper (c.f. Appendix C.5 for the complete table).
| Alpha | Top_k | Toxicity-Rate | PPL | Toxicity - Delta | PPL - Delta | TPH score |
|---|---|---|---|---|---|---|
| 0.9 | 1024 | 7.28% | 6.78 | 16.64% | 0.00% | 28.53% |
| 0.9 | 5 | 9.28% | 6.23 | 15.64% | 0.00% | 28.53% |
| 0.9 | 41 | 9.37% | 6.23 | 15.79% | 0.00% | 27.27% |
| 0.9 | 410 | 9.38% | 6.23 | 15.69% | 0.00% | 27.12% |
| 0.9 | 1024 | 8.97% | 6.23 | 19.39% | 0.00% | 32.48% |
[W2] We ensured a fair comparison by re-running the baselines from their original source. We used established configurations and standard (consistent) practices for these baselines. The hyperparameter selection for all methods, including our own, was guided by optimizing the trade-off between toxicity and perplexity, which we unified with our proposed TPH score (which is also consistent throughout the work). The use of automatic metrics is addressed in W3.
[W3 / Q7] We utilized widely accepted and rigorous automated benchmarks from the previous literature [1, 2, 3]:
RealToxicityPrompts for toxicity and the Wikipedia corpus for perplexity. To bridge the gap with human perception, we included a qualitative case study in Table 4 (c.f. Page 8 ).
This analysis demonstrates a real-world example where EigenShift successfully steers a toxic generation (e.g. "...who allegedly rapd...") to a neutral alternative ("...involved in the assault..."*) while preserving the original intent and coherence, which other methods fail to achieve. This provides qualitative evidence that our method's improvements are meaningful.
The RoBERTa-based classification model (s-nlp/roberta_toxicity_classifier) employed in our work is a standard toxicity classifier, widely used in prior studies [1], and has achieved an inter-annotator agreement (IAA) score of with toxicity 41.2 on the Jigsaw dataset. The following table provides a deeper insight into how human perspectives were integrated into the modeling process and highlights the rationale behind selecting s-nlp/roberta_toxicity_classifier as our primary toxicity classification model.
| Model | Training data | Toxicity [%] | IAA [κ] |
|---|---|---|---|
| Perspective API | Jigsaw | 55.7 | — |
| s-nlp/roberta_toxicity_classifier | Jigsaw (2018, 2019, 2020) | 41.2 | 0.66 |
| MilaNLProc/bert-base-uncased-ear-mlma | MLMA | 87.8 | 0.12 |
| cardiffnlp/twitter-roberta-base-hate-latest | Collection of 13 datasets | 17.1 | 0.15 |
| Narrativaai/deberta-v3-small-finetuned-hate_speech18 | hate_speech18 | 18.6 | 0.13 |
| christinacdl/olid_offensive_bert_multilingual | OLID | 75.6 | 0.47 |
[W4/Q5] We acknowledge the concern regarding the potential for catastrophic forgetting due to the suppression of generation experts, and the subsequent impact this might have on the model’s performance on downstream tasks. However, since our intervention selectively dampens only the generation experts, our findings show that the core semantic and reasoning capabilities of the model remain largely unaffected.
To validate this, based on your suggestion, we conducted further experiments to assess the impact of our intervention on general-purpose capabilities. Specifically, we evaluated the intervened models on the widely-used MMLU (Massive Multitask Language Understanding) benchmark, which spans a diverse set of reasoning domains, and report the accuracy metric below. The results, presented below, demonstrate that the model retains its core abilities with minimal degradation.
MMLU - Algebra (Mathematical Reasoning)
| Base Model | Before Intervention | After Intervention |
|---|---|---|
| LLaMA | 35 | 34 |
| Mistral | 29 | 32 |
| GPT-2 | 22 | 24 |
| Falcon | 27 | 25 |
| MPT | 22 | 21 |
MMLU - Logical Fallacies (Logical Reasoning)
| Base Model | Before Intervention | After Intervention |
|---|---|---|
| LLaMA | 46 | 46 |
| Mistral | 73 | 73 |
| GPT-2 | 19 | 18 |
| Falcon | 31 | 30 |
| MPT | 32 | 31 |
MMLU - U.S. Foreign Policy (Factual Knowledge)
| Base Model | Before Intervention | After Intervention |
|---|---|---|
| LLaMA | 59 | 58 |
| Mistral | 82 | 81 |
| GPT-2 | 28 | 29 |
| Falcon | 32 | 30 |
| MPT | 31 | 29 |
These results confirm that EigenShift’s intervention maintains the reasoning, factual knowledge, and problem-solving skills of the base models. Therefore, while there is a moderate perplexity increase, it is a carefully engineered and justifiable trade-off that avoids the catastrophic degradation seen in prior work.
[W5] We appreciate the reviewer’s suggestion. While we do state in the abstract that “our method requires no additional training or fine-tuning, incurs minimal computational cost, and is grounded in rigorous theoretical analysis,” we agree that it would be beneficial to reiterate this more clearly in the main text. We will ensure that this point is explicitly addressed in the camera-ready version.
[W6 / Q3 / Q4] The primary objective of this work is to interpret deep models, which are often considered black box, through an intuitive framework. To the best of our knowledge, this is the first approach that leverages such simplicity for interpretability while reducing toxicity without requiring any additional training. Our method incurs minimal computational overhead, which is comparable to a single forward inference pass, and mitigates toxicity with negligible catastrophic forgetting. This distinguishes our approach in terms of both computational efficiency and interpretability.
[Q2] We introduced the Toxicity-Perplexity Harmonic (TPH) Score precisely to serve as this selection criterion. The hyperparameters for EigenShift reported in Table 2 were selected by identifying the configuration that maximized the TPH score, ensuring a balanced optimization of both toxicity reduction and fluency preservation. For the baselines, we followed the code in their repos and configurations recommended in their respective original publications to ensure a fair comparison.
[Q4] This is a key strength of EigenShift. Our method is training-free and operates directly on the model's pre-existing weights. The data used is solely for identifying the toxic eigen-directions, not for fine-tuning the model's parameters. This process requires a relatively small set of example generations. Therefore, EigenShift is inherently well-suited to low-resource settings where the large, high-quality datasets required for fine-tuning are unavailable. Its advantage over training-based methods would indeed be more pronounced in such scenarios.
[Q6] This is a relevant suggestion. Figure 4 in the appendix already provides a visualization of this trade-off surface by plotting the TPH score against Top-k for different α values. However, we agree that a direct 2D plot of perplexity versus toxicity reduction would make the trade-off frontier even more explicit. We will incorporate such a visualization into the final version of the paper (as we are not allowed to post an image/URL here).
[Q8] Our work is premised on the hypothesis that the final layer is a critical semantic choice as it is the decision layer of an LLM. However, instead of intervening on unstable, individual neurons within that layer, we propose a more robust and interpretable approach. We also acknowledge in our future work that exploring how individual layers decompose to provide insights is an important research direction. Currently, this is computationally expensive, as LLMs often have many decoder layers (e.g., average 7B models have 32 layers), making it relatively complex to analyze in depth. We consider this a valuable area for future research.
[Q9] The neuron-based baseline in Table 1 corresponds to the expert identification method proposed in prior work, which we evaluate to answer RQ1 and RQ2. We compare this fine-grained neuron-level classification with our proposed layer-wise approach, and show that the latter provides a more stable and generalizable signal across examples. We emphasize that this neuron-based analysis is distinct from the intervention baselines (e.g., DetZero, Damp, and Aura) evaluated in Table 2, which focus on modifying model behavior rather than interpretability. For all baselines, we reproduced results using the official implementations or publicly released code from prior literature. Where applicable, we adhered to the default hyperparameter settings used in those works to ensure a fair and consistent comparison.
[Q10] Yes. We state this explicitly as a key advantage and future direction in our Limitations and Future Work section, as well as our discussion of RQ3. The EigenShift framework is designed to be concept-agnostic. By providing examples of any desired semantic concept (e.g., formality, sentiment, specific writing styles), one can identify the corresponding eigen-directions and steer the model's generation accordingly. This transforms the "black-box" LLM into a set of interpretable semantic axes, opening promising avenues for multi-objective control, which we see as an exciting area for future research.
We are deeply committed to building safe and interpretable AI systems. EigenShift is a meaningful step in that direction: lightweight, transparent, and generalizable. We appreciate the reviewer’s thoughtful questions and hope these answers clarify your points.
References:
[1] Suau, X., Delobelle, P., Metcalf, K., Joulin, A., Apostoloff, N., Zappella, L. and Rodríguez, P., 2024. Whispering experts: Neural interventions for toxicity mitigation in language models. arXiv preprint arXiv:2407.12824.
[2] Geva, M., Caciularu, A., Wang, K.R. and Goldberg, Y., 2022. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. arXiv preprint arXiv:2203.14680.
[3] Wang, Y. and Demberg, V., 2024. RSA-Control: A Pragmatics-Grounded Lightweight Controllable Text Generation Framework. arXiv preprint arXiv:2410.19109.
Dear Reviewer 6xMk,
As the discussion period is closing soon, we wanted to kindly ask for your acknowledgment of our responses. We have invested significant efforts into understanding and addressing each of your feedback, including running additional experiments to bring more clarity. If our response has addressed your feedback, we would appreciate it if you could revise your ratings accordingly.
Thank you again for your thoughtful review and for helping us improve our work.
Thank you for the detailed response from the authors. Most of my questions were answered through the author's response. I do not have any further questions, and I'm still in favor of accepting this paper. I will also check all the other reviews and responses and decide whether to change the ratings or not.
Dear Reviewer 6xMk,
Thank you again for your earlier feedback and for indicating support for accepting our paper. Since your comment, we’ve had a productive discussion with the other reviewers that includes exchanging clarifications, additional analyses, and perspectives. This has led to a healthy convergence on key points and, in some cases, an increase in ratings.
If you feel that our rebuttal has fully addressed your original feedback and that the discussion with other reviewers has been constructive, we would appreciate it if you could consider updating your rating accordingly before the discussion deadline (in a few hours).
We sincerely appreciate the time you’ve put into your review and the discussion, and we truly enjoyed our engagement in this thread.
as announced by program chairs, “Mandatory Acknowledgement” button is to be submitted only when reviewers fulfill all conditions below (conditions in the acknowledgment form):
- read the author rebuttal
- engage in discussions (reviewers must talk to authors, and optionally to other reviewers and AC - ask questions, listen to answers, and respond to authors)
- fill in "Final Justification" text box and update “Rating” accordingly (this can be done upon convergence - reviewer must communicate with authors first)
We sincerely thank Reviewers 6xMk, W9uw, j4b1, and mLbZ for their constructive feedback.
-
We are grateful for the recognition of ‘very important problem’ by moving beyond the prevalent ad-hoc and costly methods, with a cost-effective and interpretable internal approach. (mLbZ)
-
We are pleased by reviewers’ acknowledgement of our method as training-free and computationally efficient, yet achieving superior performance, making it highly practical. (6xMk, j4b1)
-
Reviewers also recognized the novel contributions on two levels. First, by introducing a conceptually interesting framework that separates ‘detection’ and ‘generation’ experts to address a key limitation, and second, by grounding this framework in a principled, mathematically grounded eigen-decomposition method for superior interpretability. (j4b1, 6xMk)
-
The work is supported by systematic, cross-lingual evaluation varying across multiple models, which strongly supports its generalizability claims. (6xMk, j4b1)
-
We made efforts in answering the questions by reviewers, to which reviewers confirmed our clarifications addressed their questions and expressed support for acceptance.
During the discussion phase, we expanded two major experiments to address reviewers’ feedback, to which we also received a positive acknowledgement:
-
The preservation of core model capabilities, confirmed by evaluating on the MMLU benchmark, showed that our intervention maintains essential reasoning skills. This finding brings clarity to the method's performance trade-offs, confirming that the intervention does not cause catastrophic forgetting. (W9uw, 6xMk, j4b1)
-
Demonstrated scalability and architectural generalization, validated by testing on larger models (LLaMA-70B, Falcon-30B) and diverse architectures like Mixture-of-Experts (Mixtral 8x7B), showing it is not limited to specific model sizes or types. (mLbZ)
We appreciate the constructive dialogue and the convergence on key points during the discussion phase, and we look forward to incorporating the discussion findings into the camera-ready version, if accepted.
(a) Summary
The paper proposes EigenShift, a training-free intervention that SVD-decomposes the LM head to identify and damp “generation-aligned” components while preserving “detection” signals, arguing layer-wise features are more stable than neuron-level cues. The method aims to reduce toxic generation with minimal loss of fluency, introduces a TPH (toxicity–perplexity harmonic) score for evaluation, and presents cross-model, cross-lingual results (e.g., BERT/BART/LLaMA/Mistral; Jigsaw/ToxiCN). During discussion, the authors added MMLU checks (showing little capability loss) and scaling to larger/MoE models (e.g., LLaMA-70B, Falcon-30B, Mixtral 8×7B).
(b) Strengths
- The approach is training-free, simple to apply, and computationally efficient compared to fine-tuning or RLHF. [6xMk]
- The layer-wise perspective is more stable than neuron-level interventions and is conceptually well motivated. [6xMk, j4b1]
- The EigenShift decomposition provides an interpretable target (generation vs. detection experts) and yields stronger detoxification than prior neuron-level baselines. [j4b1, 6xMk]
- The problem is important and the study spans multiple models and languages, supporting generality. [mLbZ, j4b1]
(c) Weaknesses
- The perplexity increase is still sizable, raising practicality concerns for real deployments. [W9uw]
- The classification AUROC used to motivate layer-wise signals remains modest, limiting the strength of interpretability claims. [W9uw]
- The paper lacks comparisons to training-based safety methods (e.g., RLHF/constitutional AI) and to broader structural baselines. [6xMk, j4b1]
- The theoretical link between eigen-directions and semantic concepts is mostly empirical and could be better grounded. [j4b1]
- Results initially focused on smaller models; although larger-model tests were added, a human evaluation of toxicity/naturalness is still missing. [mLbZ, 6xMk]
(d) Discussion summary
- Authors reported MMLU results showing minimal degradation post-intervention; This alleviates “catastrophic forgetting” concerns. [6xMk, j4b1]
- New experiments on LLaMA-70B/Falcon-30B/Mixtral showed consistent detox with small or moderate fluency cost; this improves external validity. [mLbZ]
- Authors argued EigenShift runs in minutes (one pass to collect activations + SVD) and clarified α/top-k tuning via TPH; this partially addresses deployment questions but some sensitivity remains. [6xMk]
- Reviewers remained cautious about the ~60% AUROC as a basis for strong claims, even if it outperforms neuron-level baselines [W9uw]
- Authors added analyses but not human studies or direct training-based baselines; this limits conclusions about product readiness. [j4b1, W9uw]
(e) Decision rationale
The work offers a clear, interpretable training-free mechanism that consistently reduces toxicity with a comparatively favorable trade-off and new evidence of capability retention and scaling. As a resource-efficient, principled step toward interpretable safety interventions, I recommend accepting the paper.