Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought'' Control
We examine safety-tuned LLMs and discover representation vectors for measuring and controlling censorship imposed through refusal and thought suppression.
摘要
评审与讨论
This paper claims to present a novel approach to understanding and controlling censorship mechanisms within large language models (LLMs) that have undergone safety tuning. The authors introduce a method for extracting a "refusal-compliance vector" from the internal representations of these models, which can be used to either enforce or bypass censorship at inference time. The approach is tested on a range of open-weight, safety-tuned models. The work also explores a unique dimension of censorship in reasoning LLMs called "thought suppression", where models deliberately skip their reasoning process when handling politically sensitive topics.
接收理由
- Topical: Focus on Reasoning LLMs, which is a very topical and widely used type of model at present.
- Good evaluation: Tested on many models and evals
拒绝理由
- No comparison to the previous method [1] is a major issue here. I don't understand why this baseline is not applicable here as a baseline.
- What are some comparisons to using another model like WildGuard for judging the prompts? It would be good to understand how the numbers may differ. I don't actually care if its better or worse because the two methods are different.
[1] Programming refusal with conditional activation steering. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=Oi47wc10sm
给作者的问题
Also, I am curious about how different sampling strategies might affect the performance of this model.
Could we see a Pareto frontier curve scaling a lambda on harmless (eval used for checking toxicity) vs helpfulness (alpaca, XSTest_{safe}) for a model?
W1: No comparison with Lee et al. (2025) as baseline
We refer the reviewer to our response to Reviewer LdFB’s W1, which provides additional results comparing our method and Lee et al. (2025).
W2: “What are some comparisons to using another model like WildGuard for judging the prompts?"
We have researched alternative models and chose WildGuard over others for several reasons—It is open-source, shows performance on par with GPT-4, and can detect both refusal and harmfulness. In contrast, other models have only been trained and tested for detecting harmfulness or safety.
Q1: “...how different sampling strategies might affect the performance of this model”
We’re not sure if the reviewer referred to the sampling strategy we used in evaluation or estimating refusal probability for vector extraction, regarding which specific performance.
Q2: “...scaling a lambda on harmless vs helpfulness (alpaca, XSTest_{safe}) for a model”
Please see to our response given to Reviewer hXbe’s second question (Q2), where we present the result for harmfulness and overall quality of generated outputs with varying λ values.
W1: Thank you for your response. I believe it adequately addresses the initial concern. I do have a follow-up: is the Pearson correlation referenced here conceptually similar to the Pearson correlation presented in Figure 2? Additionally, I noticed some inconsistencies in the models being evaluated throughout the paper. Are the core models meant to be LLaMA-3.1-7B and Qwen2.5-7B? If so, I recommend standardizing the evaluated models across the main text and figures and bringing the relevant appendix figures into the main body. The current variation may unintentionally create the impression that something is being obscured, even though that is clearly not the case.
W2: I am glad to see that you used WildGuard, particularly because of its open-source nature. However, could you clarify the basis for the claim that it "shows performance on par with GPT-4"? Is this drawn from the original WildGuard paper, or are there additional experiments presented here that support this statement?
Regarding Q1: I was referred to lines 197–198: "We generate five model responses for each instruction using nucleus sampling with top-p=0.8 and a maximum token limit of 256." This sampling configuration differs from what I typically encounter, particularly for LLaMA models, which are often evaluated using temperature sampling with a temperature around 0.6. I am curious whether you observed any changes in results under that more standard setting. I do not expect major differences, but would appreciate confirmation. Could you also comment on how these specific hyperparameters were chosen?
Regarding Q2: I found the current table presentation somewhat difficult to interpret. Would it be possible to visualize this data in a graph and share it via Google Slides? This should be relatively straightforward, and I believe the slides can be made anonymous by using the “Publish to web” option in Google Slides.
I would be willing to revise my score to a 6 if these questions are addressed. I appreciate the authors' effort in conducting these experiments.
W1: We’re glad that it addresses your comment. For the follow-up question, yes, we measure the Pearson correlation here following the same method we used for Figure 2. We appreciate the reviewer for pointing out the inconsistencies we weren’t aware of and will definitely move the figure for Llama-3.1-8B to the main body in our final paper.
W2: The claim is based on the evaluation results in the original paper of WildGuard [1].
Q1: Thanks for the clarification. Prior work on red teaming mostly uses greedy decoding for evaluation [2], which may not accurately reflect most real-world uses with sampling. We evaluate all models following the default configuration of Qwen-Chat models with top-p=0.8. We have not tried other configurations, but we hope that generating five outputs per instruction can help address potential discrepancies in the result.
Q2: Yes, we’ve uploaded a graph for the result, which you can view with this link: https://ibb.co/Jw74XhKk (We’ve tried the method suggested by the reviewer, but found it would display the name when we’re viewing the file.)
[1] Han et al., WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs, 2024
[2] Mazeika et al., HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal, 2024
Thank you for your response!
I have updated my score from a 3 to a 5.
Final Questions: Q1: I believe the sampling method used may give some readers pause. I suggest including results from a couple of different sampling methods to demonstrate that they yield similar outcomes and that the choice does not significantly affect performance in the final version of the paper.
Q2: Based on the plot, I think it’s missing the perspective I'm looking for. I would like to see refusal rates on SorryBench or JailbreakBench plotted against Alpaca or XSTest as a Pareto frontier. The goal is to understand the tradeoff between refusal and response rates across a mixture of harmful and harmless requests. I’m particularly interested in whether there is a default λ (lambda) value that achieves the best balance.
We appreciate the reviewer for updating the score based on our response!
Q1: We have tried using the default sampling configuration as suggested by the reviewer. Here are the results for Llama-3.1-8B (https://ibb.co/TDcYwQ8G) and Qwen-2.5-7B (https://ibb.co/4Z1pHbPt), comparing refusal/harmfulness of outputs with different coefficients on all the tasks. We do not find any noticeable difference from our previous results when using different sampling parameters.
Q2: The plots we shared in Q1 may be what the reviewer requested, though we’re not sure what kind of tradeoff or balance the reviewer meant. We believe that this will depend on the person using our tool and in what context.
Thank you for Q1 plot!
For Q2, you probably have a table with at least three columns: lambda, refusal rate of XStest-safe, and refusal rate of XStest-unsafe. What I am looking for something like is a scatter plot of XStest safe on the x-axis and XStest unsafe on the y-axis over these two columns. This will indicate that the vectors can be utilized with some default value for a given task, or if a classifier is required for determining what the lambda value should be.
The reviewer may have misunderstood our paper’s goal. Our steering vector is designed to capture the model’s censoring behavior. It does not encode information to distinguish harmful inputs. We do not assume a universal default value, but an adjustable value that can reliably control the degree of censoring in model outputs.
Although our method isn’t a one-size-fits-all solution, it offers flexibility for adapting to contexts where what is considered “safe” or “appropriate” may differ from the definition set by us ML researchers in benchmark tasks. In certain cases, rather than outright refusal, a partial response may be more helpful, e.g., providing a warning before complying, offering an alternative answer. (See examples in Appendix E.1)
Please let us know if you have any further questions!
Thank you for the clarification. However, based on the previous plot provided, I believe that if F1 scores were plotted against varying values of lambda, it would likely reveal the existence of a nonzero lambda (approximately 0.2) that maximizes the F1 score, potentially on a benchmark such as XSTest. I had hoped this analysis would be included, as it could strengthen the paper by demonstrating that steering vectors may be effectively used with a default lambda value.
That said, I find the current contributions of the paper to be sufficient, though not outstanding. The most significant contribution appears to be the application of steering vectors to reasoning models, which necessitated extending the method to N-token matching rather than single-token matching. The analysis presented toward the end of the paper is also interesting, though I do not believe it constitutes a major contribution. Furthermore, although the paper performs better than Lee et al. (2025), the improvement in Pearson correlation is not particularly substantial. That said, this work does appear to yield higher-quality responses, and I suggest the authors emphasize this point more clearly in a revised version, explaining this issue with previous methods and fleshing out the experiments in the camera-ready version of the paper.
Given the improvement over prior methods in an important dimension (i.e, response quality), I have raised my score to a 6. If the other reviewers believe the paper merits acceptance, I see no significant issues that would justify rejection. The reason I have not scored it higher is due to my feelings about the contributions.
I thank the authors for their time, effort, and thoughtful submission. Best of luck moving forward.
This paper studies the notion of refusal in language models, and how refusal (often defined through a safety or alignment training phase) is represented in the LLM’s activations.
They propose a technique to extract a refusal-compliance vector from the model’s hidden state (which is matched over N subsequent tokens). Refusal scores are defined by a function defined on string-matching over a set of refusal strings (given in Appendix A.1), such as “I cannot” or “I am unable”. Hidden vectors are grouped and thresholded by the value of this refusal score.
Then, the authors propose an intervention technique that both detects and controls the amount of censorship in the model’s responses. To steer away from censorship, the authors subtract the steering vector (times some scalar ) from the activations.
Finally, the extend this to reasoning language models (e.g., DeepSeek) and find that an additional vector arises that precludes reasoning behavior, demonstrating that similar interventions allow users to better enforce reasoning models to perform reasoning (and not just terminate their <think> tokens immediately after two \n characters).
接收理由
The proposed method seems quite effective in steering models away from and towards refusal on the considered benchmarks.
Results are demonstrated across a wide variety of open-sourced models (e.g., different architectures and model scales).
拒绝理由
The somewhat heuristic nature of string matching sets limits the broad applicability of this approach to more general notions of refusal – results with a classifier-based approach would strengthen results here.
For this method to be more practically useful, I believe activation steering shouldn’t significantly degrade performance when we don’t know apriori if we want a refusal or compliance. For instance, if we want our models to be safer, then using this steering shouldn’t significantly degrade the quality of the helpful queries or introduce significant overrefusal on benign queries. Some results in this setting – where we don’t know the exact desired response type would strengthen my perception on the practicality of the method.
给作者的问题
Do the authors notice any scenarios in the reasoning experiments, where thought suppression occurs, but compliant responses still arise?
Regarding the first reason in “Reasons to Reject”, can you provide qualitative or quantitative evaluation of output helpfulness on benign queries when λ is set to –1, to assess any unintended degradation?
W1: Heuristic nature of string matching sets limits applicability to more general notions of refusal – results with a classifier-based approach would strengthen results here
We have considered using a classifier like WildGuard. However, this introduces computational overhead and perhaps the classifier’s own bias to the resulting steering vectors. We chose to implement string matching for its efficiency, but it could be replaced within our method with any function measuring censoring signals. We also show how our method can be adapted to distilled DeepSeek-R1 models, which exhibit a different type of censoring behavior based on thought suppression.
W2: Practicality–settings where the desired response type (refusal or compliance) is unknown
Our method only targets the detection and control of model outputs. For this type of setting, it would require using another method to detect the input given to the model before applying our steering vector. A similar setup has been studied by Lee et al. (2025) [1].
Q1: Cases where thought suppression occurs, but compliant responses still arise
We evaluated these cases with WildGuard and found only ~3% classified as non-refusal. However, based on manual inspection, these responses are still censored—they either give answers that are indirectly related or more aligned with the Chinese government's values.
Q2: Qualitative/quantitative evaluation of output helpfulness on benign queries when
We believe that helpfulness may be subject to a user's background and requires further human assessment. Given the time limit, we use JudgeLM [2] to evaluate the overall output quality. We prompt JudgeLM 7B with the task instruction and a pair of responses—one without steering (baseline) and one after steering. Each response is rated on a scale of 1 to 10 based on helpfulness, relevance, accuracy, and level of detail. We compute a score ratio based on the ratings of five response pairs for each instruction. A ratio of 1 indicates the rating remains unchanged after steering, while a ratio < 1 means the steered response receives a lower rating than the baseline.
The tables below show the results of Qwen-2.5-7B on Alpaca and XSTest_safe, along with the average refusal and harmfulness probabilities measured by WildGuard. We find that steering with a coefficient (λ) between 0 and -1 has little impact on the ratings on average. However, increasing λ from 0 to 1 results in lower ratings for steered outputs compared to their baseline counterparts—this is expected from the refusal behavior induced by steering.
Alpaca (300 instructions):
| coeff (λ) | score ratio (steered/baseline) | refusal | harmfulness |
|---|---|---|---|
| -1 | 1.04 | 0 | 2.5e-3 |
| -0.8 | 1.03 | 0.01 | 6e-3 |
| -0.6 | 1.05 | 0 | 5.8e-3 |
| -0.4 | 1.04 | 0.01 | 5.7e-3 |
| -0.2 | 1.04 | 0.01 | 4.5e-3 |
| 0 | 1.04 | 0.01 | 3.5e-3 |
| 0.2 | 1.07 | 0.04 | 1.7e-3 |
| 0.4 | 0.97 | 0.26 | 2.4e-3 |
| 0.5 | 0.87 | 0.43 | 3.7e-3 |
| 0.6 | 0.67 | 0.65 | 2.4e-3 |
| 0.8 | 0.38 | 0.95 | 5e-5 |
| 1 | 0.34 | 1 | 2.8e-5 |
XSTest_safe:
| coeff (λ) | score ratio (steered/baseline) | refusal | harmfulness |
|---|---|---|---|
| -1 | 1.01 | 0 | 1.0e-2 |
| -0.8 | 1 | 0 | 9.9e-3 |
| -0.6 | 1.01 | 0 | 8.0e-3 |
| -0.4 | 1.01 | 0.01 | 6.9e-3 |
| -0.2 | 1.04 | 0.01 | 5.5e-3 |
| 0 | 1.04 | 0.03 | 4.8e-3 |
| 0.2 | 1.02 | 0.09 | 4.5e-3 |
| 0.4 | 1.01 | 0.24 | 7.5e-3 |
| 0.5 | 0.99 | 0.4 | 4.0e-3 |
| 0.6 | 0.86 | 0.59 | 4.1e-3 |
| 0.8 | 0.61 | 0.88 | 2.6e-4 |
| 1 | 0.42 | 0.98 | 7.4e-5 |
Edit: We added a visualized result here: https://ibb.co/Jw74XhKk
[1] Lee et al., Programming refusal with conditional activation steering, ICLR 2025
[2] Zhu et al., JudgeLM: Fine-tuned Large Language Models are Scalable Judges, ICLR 2025
Thanks for your response and clarification!
I appreciate the new results -- and agree that this is the expected behavior of steering and that helpfulness would indeed degrade in settings where steering towards refusal is more highly encouraged.
I understand that the reviewers focus the scope of the paper on the setting where we have an input classifier present -- although I personally believe this narrows the utility of the contribution, since training such a classifier gives us the ability to simply use a default refusal message if we determine that "steering" is necessary.
I also agree with reviewer f3Kh that the value of seems a bit task / setting dependent (and perhaps even model dependent), and more analysis on a potential ideal value of such a would strengthen the contribution.
We recognize that the reviewer may have a specific problem setting in mind. However, this is not the goal of our paper. We believe that there is no universal agreement on what is considered “safe” or “harmful” and what should be “censored”. The ideal value of will vary depending on the context and the goal of the stakeholders involved. Our work aims to detect and control model censorship at a fine-grained level, which is different from prior work that considers controlling models in a binary manner. Our method can be useful in settings where different levels of content moderation are needed for users in different age groups. It can also support audits that may require more nuanced measurements.
We’re happy to answer any further questions the reviewer may have!
This paper applies a method previously used to identify "gender" steering vectors (Cyberey et al., 2025), to manipulate LLM refusals along a "refusal-compliance" axis instead. Experiments on three red-teaming/jailbreak benchmarks demonstrate the efficacy of this approach on several open-weight LLMs; further experiments & qualitative analysis on distilled DeepSeek-R1 models further demonstrate the spectrum between suppression of CoT reasoning and uncensored output.
接收理由
-
This paper follows a line of work on activation steering in the context of model safety/alignment/censorship that exploits linear separability of model representations. While the method is not strictly novel, it seems to be a novel application distinct from prior, non-concurrently developed methods in this context (e.g., Arditi et al., 2024), particularly the ability to adjust the magnitude of the refusal steering vector.
-
The experiment demonstrating changes in refusal rate based on the steering vector magnitude (section 3) is reasonably convincing, which is further supported by the quantitative/qualitative analysis in section 4. (While there are of course more experiments that could be done, these should largely suffice for the aims of this paper.)
拒绝理由
-
Baselines for other steering methods (e.g., one or more of those mentioned in section 2.2) would be helpful context for whether the method offers an improvement in refusal rate compared to others, or if the net benefit is more so the ability to vary the steering coefficient.
-
This may be out of scope for the paper given space constraints, but: while the experiments focus on refusal rate / suppression of CoT reasoning, it could be of interest to validate that generation quality holds under this version of steering, or whether the uncensored outputs are factual (e.g., for prompts like the example in E.2).
-
It would be helpful to elaborate on how this method differs conceptually from other adaptable methods for steering refusal/safety mentioned in L102.
给作者的问题
Please feel free to correct any misunderstandings above. Thanks!
W3: Conceptual difference between our method and previous methods for steering refusal/safety (Line 102)
Here, we summarize the key differences:
- Steering for different goals: Scalena et al. (2024) introduce a method for steering multiple concepts simultaneously. Lee et al. (2025) propose conditional steering that can enable/disable refusal based on the input context (e.g., hate speech). While most work focuses on steering binary directions, our method allows fine-grained control and detection of model behavior.
- Methods for finding steering vectors: Our method is based on the assumption that activations of different inputs encode varied degrees of censoring signals. We use "soft labels" instead of binary labeled prompts, as used in prior work (Scalena et al., 2024; Lee et al., 2025; He et al., 2025).
W1: Baselines for other steering methods
The table below shows results evaluated on the same four tasks described in Section 3.3. In the third column, we assess the effectiveness of steering vectors in representing model refusal by the Pearson correlation between the refusal probability of a prompt and its scalar projection on the vector at the last token position. Compared with Lee et al. (2025), the steering vectors found by our method exhibit a higher correlation with model refusal. We also evaluate whether the direction of projections can be used for refusal detection. As shown in the fourth column, our method achieves a higher accuracy, with >90% for both models.
| Method | Model | (refusal, proj) | Detection Acc |
|---|---|---|---|
| (Lee, 2025) | Llama-3.1-8B | 0.843 | 0.856 |
| Ours | Llama-3.1-8B | 0.908 | 0.953 |
| (Lee, 2025) | Qwen-2.5-7B | 0.883 | 0.586 |
| Ours | Qwen-2.5-7B | 0.909 | 0.912 |
We use the code from Lee et al. (2025)'s repository to extract vectors. The refusal probability is measured by WildGuard.
Next, we compare methods used for applying steering vectors. Lee et al. (2025) use activation addition [1], which does not work well for fine-grained, precise steering for the following reasons:
- Different inputs require using different coefficients: We apply steering with increasing coefficient magnitudes to find the point at which a model would refuse (or comply with) all prompts. As activation addition applies a uniform steering strength, we find that the model would start producing unnatural sentences for some prompts before reaching the desired responses for others. Our method addresses this by “neutralizing” activations before steering with activation addition (Equation 4). We provide quantitative results comparing both methods in our response to W2.
- Unknown coefficient range: With Lee et al. (2025)’s method, it’s difficult to identify a valid range for steering, whereas we offer a general approach for scaling vectors (Line 169) that allows steering within -1~1 and considers 0 as the “neutral” point where refusal probability ≈ 0.5.
W2: Generation quality, factuality of uncensored outputs
We evaluate the overall quality of uncensored outputs with JudgeLM 7B [2], prompting it to rate a pair of outputs produced using Lee et al. (2025) and our steering method on a scale of 1~10. We generate five outputs for each instruction and exclude ones with a refusal probability > 0.5. The table reports the result with coefficients λ that result in comparable refusal and harmfulness probabilities on average. The last column computes the average ratio of ratings between the pair of outputs. A ratio of 1 indicates outputs produced by both methods are rated similarly; a ratio > 1 suggests our method receives a higher rating. Our results suggest that while both methods can effectively reduce the refusal rate, our method produces outputs with a higher quality on average.
| Model | Task | Coeff λ | Refusal | Harmfulness | Rating ratio |
|---|---|---|---|---|---|
| Llama-3.1-8B | jailbreakbench | -1 / -3.5 | 0.11 / 0.13 | 0.68 / 0.65 | 1.54 |
| Llama-3.1-8B | sorrybench | -1 / -3.5 | 0.06 / 0.07 | 0.60 / 0.61 | 1.37 |
| Llama-3.1-8B | xstest_unsafe | -1.4 / -3.5 | 0.07 / 0.04 | 0.45 / 0.54 | 1.84 |
| Qwen-2.5-7B | jailbreakbench | -1 / -35 | 0.06 / 0.09 | 0.71 / 0.68 | 1.42 |
| Qwen-2.5-7B | sorrybench | -1 / -40 | 0.02 / 0.02 | 0.67 / 0.64 | 1.83 |
| Qwen-2.5-7B | xstest_unsafe | -1 / -40 | 0.05 / 0.08 | 0.56 / 0.54 | 2.07 |
| 3rd to 5th columns: (ours/Lee) |
In terms of factuality, we notice that smaller size models, e.g., DeepSeek-R1 Distilled Qwen 1.5B, tend to exhibit more hallucination than larger models after bypassing censorship with steering. This may be due to their inherent capabilities. We appreciate the reviewer’s insightful comment and will continue to investigate this in our future work!
[1] Rimsky et al., Steering Llama 2 via Contrastive Activation Addition, ACL 2024
[2] Zhu et al., JudgeLM: Fine-tuned Large Language Models are Scalable Judges, ICLR 2025
A belated thank you to the authors for the additional results & clarifications, which sufficiently address my concerns -- would love to see these (especially the baseline results) in the paper/appendix, space permitting.
The goal of the paper is to understand how "censorship" works in models engineered to refuse to answer harmful requests and to produce responses that better align with "human" preferences. For this study, the authors use steering vectors to manipulate the LLM's censorship behavior. The authors provide a procedure for computing the steering vectors which they call refusal-compliance steering vectors. The paper then presents experiments to test the effectiveness of these vectors in terms of censorship steering on various benchmarks and open-source models. The results show that the steering vectors are extremley effective. In addition, the paper considers reasoning models and discusses the nuance between thought suppression and refusal to answer and show the effectiveness of their steering vectors.
接收理由
The experiments and discussion on thought suppression is interesting.
拒绝理由
Lack of novelty: It is well understood that a broad range of concepts can be represented by one-dimensional projections. In particular, there is evidence that leads to the presumption that censorship lies in a one-dimensional suspace. The procedure for deriving the steering vectors is not novel.
"Lack of novelty"
We agree with the reviewer that steering vectors and one-dimensional projections are not a new contribution, and we do not claim this to be novel. Our work makes several novel technical contributions in this emerging area:
- While previous studies focus on refusal-based censorship in the safety context, we study censorship through a broader lens (Lines 40-42).
- We propose a method that enables precise control and fine-grained detection of LLM censorship. Previous work mostly focuses on steering binary directions (e.g., refusal or non-refusal) but not the levels of model behavior (e.g., varied degrees of refusal).
- Our work is the first to examine censorship in reasoning LLMs and present a novel countermeasure that targets “thought suppression”.
Thank you for your response. I am happy to increase the rating based on your response.
We thank all the reviewers for their time and valuable feedback! We will make sure to include the additional results with the previous baseline in our final version of the paper.
Reviewers have some concerns about the novelty of the work but it seems like there are some substantial contributions, along the lines of modifying the magnitude of the steering vector and applying the basic ideas to censorship. Based on the reviews this paper seems borderline, but I think a potentially under-appreciated aspect of the paper's contribution is the qualitative discussion and analysis of how censorship arises practically in models and how it changes under steering. It was genuinely interesting to read and I think is an important thing for the community to discuss.
However, in my view the paper overstates its contribution as currently written. I'm recommending acceptance under the assumption that the final version of the paper will be much clearer about the limits of the novelty of its contribution, acknowledging that the single refusal direction was already shown by Arditi et al and providing an explicit comparison to Lee et al.