PaperHub
5.5
/10
Poster4 位审稿人
最低2最高4标准差0.7
2
3
4
3
ICML 2025

Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning

OpenReviewPDF
提交: 2025-01-24更新: 2025-07-24

摘要

关键词
LLMAlignmentOver-refusal

评审与讨论

审稿意见
2

This paper addresses the over-refusal problem in aligned LLMs that unnecessarily reject benign user prompts. The authors identify specific layers whose latent representations best distinguish between benign and malicious prompts, then selectively adjust embeddings to move prompts "just enough" from rejection to acceptance. Their approach uses a proxy that measures refusal contribution through query projection onto the refusal vector, deriving the shift from a locally linear approximation of the refusal boundary. Experiments across three models demonstrate higher compliance rates than fine-tuning while maintaining safety scores and general functionality.

给作者的问题

  1. Can you clarify what you mean by pseudo-harmful prompts?
  2. How does the learned shift generalize to unseen datasets? Have you considered training on some datasets and evaluating on others?
  3. Why were activation-based baselines evaluated on only one model? Would broader comparisons strengthen the claims?
  4. Given that the method is not applicable to black-box models, do you see potential adaptations that could make it usable in such settings?

论据与证据

Yes

方法与评估标准

Yes

理论论述

N/A

实验设计与分析

Yes

补充材料

Yes

与现有文献的关系

The paper mitigates the over-refusal issue of LLMs while maintaining a high safety score and general capability, enhancing the usability of LLMs.

遗漏的重要参考文献

No

其他优缺点

Strength:

  1. One of the key novelties of this paper lies in the proposal of "just enough" shift. The paper gives one motivation as uniform shift leads to gebbrish generation.
  2. The method targets specific layers for adjustment rather than modifying the entire model, requires minimal compute compared to full model retraining
  3. Results show improved compliance rates compared to fine-tuning approaches while maintaining safety scores and general model functionality

Weakness:

  1. The proposed method is not applicable to black-box models.
  2. The generalization of the learned shift remains untested. Training on some datasets and evaluating on non-overlapping datasets would provide insight into its robustness.
  3. The evaluation is limited in scope—only two architectures are tested, and activation-based baselines are only evaluated on one model.

其他意见或建议

The presentation can be improved. For example, in Figure 1, there is too much space inside the quote box and the text is not centered.

作者回复

Thank you for your insightful feedback. We have provided our responses below, and we hope they clarify the points you raised. If our responses have adequately addressed your initial concerns, we would be grateful if you would consider adjusting your evaluation accordingly.

(A)

We define pseudo-harmful prompts in line 39 of the Introduction- they are queries which appear potentially harmful but are benign in nature.

(B)

We address the question of generalization to unseen datasets through our experimental setup. We train our model on 25 examples each from XSTest, Scope, ORBench, and PHTest, and then evaluate on the corresponding held-out partitions of those datasets. In addition, we include OKTest as an entirely unseen benchmark to assess how well our learned shift extends beyond the trained domains. The results are summarized in Table 1.

To further address concerns about out-of-domain generalization, we also conduct a supplementary experiment in which training is restricted to only 25 queries from XSTest and ORBench along with the harmful and harmless queries from Hexphi and UltraChat. As shown in the experiments below, the model retains its out-of-distribution (OOD) robustness even when we vary the over-refusal distributions used during training.

MethodXS Test CR (In-Dist)SCOPE CR (OOD)Orbench CR (OOD)Phtest CR (OOD)OKTest CR (OOD)Avg OR Compliance RateAdvbench Safety ScoreTradeoff Score
ACTOR (25 XS Test)96.0090.3973.0995.0394.0089.7099.0394.37
MethodXS Test CR (OOD)SCOPE CR(OOD)Orbench CR (In-Dist)Phtest CR(OOD)OKTest CR (OOD)Avg OR Compliance RateAdvbench Safety ScoreTradeoff Score
ACTOR (25 ORB-H)94.6790.7373.9196.0594.3389.9498.8594.40

(C)

We agree that broader comparisons can further strengthen our work. We have now incorporated comparisons against multiple baselines. First, we present the following table discussing the features of all existing approaches and their comparison with ACTOR.

MethodMitigation of Over-RefusalTrain Time/ Inference TimeRobustness
Self-CDHighInference Time - Requires 2 generations for the same input-
DROPoorTrain Time - Introduces external parameters-
Safety PatchingHighTrain Time - Requires training twice-
Safe-DecodingPoorTrain Time - Requires 2 generations for the same input-
SCANSHighInference TimeLow
SurgicalHighInference TimeLow
ACTORHighTrain TimeHigh
  1. Results for Llama-2-7b-chat-hf
MethodXS Test CRSCOPE CROrbench CRPhtest CROKTest CR)Avg OR CRAdvbench SSTradeoff Score
Default8052.6129.4569.67661.5399.6280.58
Safe-Decoding29.1215.327.4526.0345.2424.6310062.32
DRO5821.2514.1162.227646.3210073.16
Self CD90.6780.9461.9487.419282.5995.7789.18
SCANs95.3376.7240.5290.449980.4099.2389.82
Surgical90.6789.3869.1693.4289.3386.3999.4292.90
Ours95.3391.5776.2896.8693.6790.7499.0394.88
  1. Results for Gemma-7b-it
MethodXS Test CRSCOPE CROrbench CRPhtest CROKTest CRAvg OR CRAdvbench SSTradeoff Score
Default72.0158.1865.7188.9274.0071.7694.0082.88
Safe-Decoding32.1219.438.3238.2140.3427.6898.3263.00
DRO52.0444.9258.3975.0171.2860.3397.7879.06
Self CD78.0064.7574.2088.0873.0075.6187.1281.36
SCANs56.6656.1570.8780.1253.6663.4993.6578.57
Surgical76.6761.2074.2089.7276.3375.6290.9683.29
Ours79.3362.7373.8391.1578.0077.0192.584.75

Our results continue to show that ACTOR outperforms these broader baselines in reducing over-refusals while maintaining safety scores.

(D)

We acknowledge that our approach relies on access to model internals, which may be unavailable for users of closed-source or black-box systems. However, we do not view this as a fundamental limitation. In practice, closed-source model developers do have full access to these internals and can adopt our method if they choose. Moreover, the ability to examine the model’s internal representation space provides critical insights for understanding and mitigating over-refusals—insights that purely API-level or prompt-based methods cannot readily capture.

审稿意见
3

The paper proposes a fine-tuning based method to solve the over-refusal problem encountered by many LLMs. The method first tries to extract an over-refusal vector from the models using different prompts and then it tries to steer the model towards the embedding as defined in equation 9. The overall performance is strong as measured on various benchmarks.

给作者的问题

  1. can the author explain the difference between the current work and [1], looks like it's built upon [1] with a little bit fine-tuning.

[1] Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation

论据与证据

Yes

方法与评估标准

Yes

理论论述

The theoretical claim is not rigorous since the problem cannot be exactly defined. Thus most of the conclusion in the paper is empirical.

实验设计与分析

Yes, the experiment design is valid.

补充材料

Yes, I looked through the supplementary material such as the algorithm and some further discussions.

与现有文献的关系

The paper contributes a method to greatly reduce the over-refusal behaviors without affect its original performance. So it will be generally useful to trading off between being safe and helpful.

遗漏的重要参考文献

N/A

其他优缺点

Strength

  1. The paper is well written and easy to follow
  2. The proposed method is simple and effective
  3. After fine-tuning, the model's original performance is well-preserved.

Weakness

  1. the proposed method is very close to [1] except with a fine-tuning stage, this makes the contribution less significant. And [1] is not discussed extensively in the work.
  2. lack of comparison with baselines, although SafeDecoding or DRO may show lower performances, but it's better to have them in the table for comparison. Are they better or worse than SFT?
  3. lack of ablation study on the number of calibration samples, how is 210 and 25 selected for the training dataset, how does it affect the downstream performances if you change them?
  4. all the results are reported at a high level, but most benchmarks have provided specific categories for refusal. what are the performance on specific categories in the benchmarks. Currently there doesn't seem to be such analysis

[1] Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation

其他意见或建议

N/A

作者回复

Thank you for your insightful feedback. We have provided our responses below, and we hope they clarify the points you raised. If our responses have adequately addressed your initial concerns, we would be grateful if you would consider adjusting your evaluation accordingly.

(A)

Our approach differs fundamentally from the single‐vector “surgical” method [1], which computes a “false‐refusal” vector from a set of harmless, harmful, and pseudo‐harmful queries. We observe that simply switching the pseudo‐harmful dataset from ORB-H to OKTest led to substantial fluctuations in performance.

MethodOR Compliance Rate
Surgical (ORB-H)86.39
Surgical (OKTest)63.88

These fluctuations arise because the method bakes a single fixed vector directly into the model’s weights and applies it uniformly to every query—a “one‐size‐fits‐all” mechanism. Consequently, the model’s efficacy heavily depends on the specific distribution from which the vector is extracted, rendering it brittle under distributional shifts. shifts. We explore a natural extension of this approach in the fine‐tuning setting under the subsubsection “Would Uniform Shifts Work?” (Sec 3.2). We show that a single uniform vector shift leads to destructive fine‐tuning, with more discussion in Appendix B.

Instead of a single fixed vector, ACTOR repeatedly updates both the model parameters and the refusal direction during fine‐tuning. Mathematically, rather than enforcing a constant scaling factor, ACTOR’s loss function promotes an “individualized” or “just enough” shift—proportional to each query’s projection onto the refusal direction—so that the model can adapt its internal representation on a query‐by‐query basis via minimal intervention. We discuss these design choices in more detail in Section 3. This iterative, dynamic mechanism makes ACTOR robust: instead of relying on a single axis of correction, it exploits the full capacity of the model’s internal activation space to handle diverse data distributions.

To further validate this robustness, we computed the initial refusal direction for ACTOR using three different harmful data distributions as shown in Fig 5 (Sec 4.2) and the table below. ACTOR consistently maintains strong performance, even when the refusal direction is computed from different data sources.

Datasets Used for Refusal DirectionMethodAvg OR Compliance RateAdvbench Safety Score
D_harmful = HexphiACTOR90.0199.03
D_harmful = BeaverTailsACTOR89.8498.95
D_harmful = MaliciousACTOR89.9498.85

We hope this clarifies why our method is not merely an extension of [1] augmented with fine‐tuning. Our design deliberately addresses the pitfalls of a single‐vector solution and provides a more adaptive, reliable alignment strategy.

(B)

Baselines- Kindly refer to our rebuttal to Reviewer eEU9 (C). Both SafeDecoding and DRO suffer from a major over-refusal problem and perform worse than SFT.

(C)

The 210 benign examples are randomly sampled from the 7 categories of the UltraChat Dataset (n=30). The 25 queries from XSTest, SCOPE, OR-Bench-Hard-1k, PhTest benchmarks are also sampled randomly with their held-out versions used for evaluation and OKTest serving as an OOD dataset.

We experiment with 2 additional settings where we set n=15 and 50 for UltraChat-

MethodXS Test CRSCOPE CROrbench CRPhtest CROKTest CRAvg CRAdvbench SSTradeoff Score
ACTOR (n=15)94.6789.8872.7696.2194.0089.5098.8594.17
ACTOR (n=50)95.0090.3973.0996.2193.6789.6099.0394.31

Similarly, we also conducted additional ablation experiments by selecting 10 and 50 random over-refusal queries from the abovementioned datasets-

MethodXS Test CRSCOPE CROrbench CRPhtest CROKTest CR)Avg CRAdvbench SSTradeoff Score
ACTOR (n=10)95.3390.0573.0195.9094.0089.6599.0094.32
ACTOR (n=50)95.3391.9174.9896.8193.6790.5498.7594.65

We also want to highlight that our method remains effective even under low data budgets. As shown in Fig 4, using only 25 over‐refusal queries during training yields performance gains that surpass SFT trained on 100 over‐refusal queries, underscoring the data efficiency of our approach.

(D)

We agree that adding analysis on specific categories would make the paper more rich. XSTest, ORBench-Hard-1k and SCOPE are the benchmarks with such categories included. We show the Compliance Rates on these categories before and after intervention with ACTOR here.

[1] Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation

审稿意见
4

Language Models (LMs) must balance refusing unsafe prompts while complying with benign ones. Despite safety training, LMs often refuse benign prompts that contain spurious correlations with harmful ones, a behavior known as over-refusal. This paper introduces ACTOR, a technique inspired by representation engineering. This technique involves first identifying a refusal direction in activation space and then fine-tuning a target layer to minimally shift model activations along this direction based on labels indicating the harmfulness of the training query. Empirical evidence shows that this representation-based technique is more data-efficient and task-performant than traditional Supervised Fine-Tuning (SFT).

给作者的问题

My understanding is that the authors perform full-parameter fine-tuning for the target layer. Have the authors looked into parameter-efficient techniques like LoRA? Successful parameter-efficient experiments can further demonstrate the effectiveness of the technique and allow experimentation with even larger models.

论据与证据

The paper provides strong empirical evidence for its claims. Beyond studying over-refusal directly, the paper benefits from comparing against test-time steering interventions which are, at present, a popular approach in the literature.

However, the paper focuses on single-turn harmful prompts. While this is a common evaluation setup in the literature, it is unclear whether the robustness generalizes to challenging multi-turn attacks [1, 2] which may represent a more realistic threat model [3]. I suggest that the authors either include multi-turn experiments or acknowledge this gap as a potential limitation in the generalization of their results.

[1] - Russinovich, M., Salem, A., & Eldan, R. (2024). Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack. ArXiv, abs/2404.01833.

[2] - Li, N., Han, Z., Steneker, I., Primack, W.E., Goodside, R., Zhang, H., Wang, Z., Menghini, C., & Yue, S. (2024). LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet. ArXiv, abs/2408.15221.

[3] - Haider, E., Perez-Becker, D., Portet, T., Madan, P., Garg, A., Majercak, D., Wen, W., Kim, D., Yang, Z., Zhang, J., Sharma, H., Bullwinkel, B., Pouliot, M., Minnich, A., Chawla, S., Herrera, S., Warreth, S., Engler, M., Lopez, G., Chikanov, N., Dheekonda, R.S., Jagdagdorj, B., Lutz, R., Lundeen, R., Westerhoff, T., Bryan, P., Seifert, C., Kumar, R.S., Berkley, A., & Kessler, A. (2024). Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle. ArXiv, abs/2407.13833.

方法与评估标准

The evaluation metrics and datasets fit the proposed research questions. The paper especially benefits from studying out-of-distribution (OOD) robustness, data efficiency, and multi-turn overall performance.

理论论述

The paper motivates the proposed technique with theoretical claims regarding the geometry of refusal. The paper cautions that the theoretical intuition serves as a useful motivation for the technique but does not serve as proof in its own right and is thus not to be considered a core contribution of the work. I did not attempt to prove these claims and instead rely on the empirical evidence as support for these theoretical claims.

实验设计与分析

I did check the experiment design and read the papers for the leveraged benchmarks. This experiment design is in line with the existing literature. The authors acknowledge that using a variety of refusal/safety benchmarks can have the confounding factor of variance in the labeling policies of the benchmark authors.

补充材料

NA

与现有文献的关系

Over-refusal is a prominent challenge in modern LM safety training. This work makes a valuable contribution by showing that focusing on optimizing against internal representations during train-time can outperform traditional fine-tuning as well as dynamic test-time steering techniques.

遗漏的重要参考文献

There are no obvious missing references that aren't considered concurrent work.

其他优缺点

Strength: This work is well-written, especially the section describing the theoritical motivations of the technique.

Weakness: The paper seems to use a custom GPT-4o prompt for refusal and harm classification. There are existing classifiers int he literature for this task such as HarmBench [1], LlamaGuard [2], and Wild Guard [3]. Using a custom prompt makes comparisions across papers more difficult.

[1] - Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., Forsyth, D., & Hendrycks, D. (2024). HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal. ArXiv, abs/2402.04249.

[2] - Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y., Tontchev, M., Hu, Q., Fuller, B., Testuggine, D., & Khabsa, M. (2023). Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations. ArXiv, abs/2312.06674.

[3] - Han, S., Rao, K., Ettinger, A., Jiang, L., Lin, B.Y., Lambert, N., Choi, Y., & Dziri, N. (2024). WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs. ArXiv, abs/2406.18495.

其他意见或建议

There is a typo on line 330.

作者回复

We appreciate your thoughtful remarks and the positive rating you assigned to our paper. Below, you'll find our responses, which we hope clarify the points you raised.

(A)

While multi‐turn attacks indeed pose a more realistic challenge, there are currently no established benchmarks specifically designed for multi‐turn over‐refusal scenarios. Evaluating both multi‐turn over-refusal and multi‐turn safety, as well as understanding how they intersect, remains an open research problem. We therefore leave this exploration to future work, and we appreciate your suggestion to further examine the robustness of our method in more complex dialog settings.

(B)

We found that existing classifiers such as HarmBench, LlamaGuard, and WildGuard can be overly conservative for the pseudo-harmful queries used in our benchmarks, often flagging them as harmful. For example- LlamaGuard classifies 33% of the total queries from ORBench-Hard-1k as harmful.

Earlier approaches [1][2] in our line of work include human evaluation and string matching for evaluation. To achieve a balance between context‐awareness, reproducibility, and scalability we opted for an LLM judge with a carefully crafted system prompt—an approach used extensively in related literature [3]. We provide all relevant prompts and details in Appendix E, allowing others to replicate our setup and compare results more directly.

(C)

Since our current approach only fine-tunes a single layer—already providing a relatively efficient setup—we consider integrating LoRA into our design as an exciting avenue for further increasing parameter efficiency and performance even on larger models.

Thanks again for the valuable suggestions.

[1] Cao, Zouying, Yifei Yang, and Hai Zhao. "Nothing in excess: Mitigating the exaggerated safety for llms via safety-conscious activation steering." arXiv preprint arXiv:2408.11491 (2024).

[2] Wang, Xinpeng, et al. "Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation." arXiv preprint arXiv:2410.03415 (2024).

[3] Qi, Xiangyu, et al. "Fine-tuning aligned language models compromises safety, even when users do not intend to!." arXiv preprint arXiv:2310.03693 (2023).

审稿人评论

Thank you for responding to my questions. My score remains unchanged.

作者评论

Dear Reviewer b27H,

We are happy to hear that our responses have addressed your concerns and questions. We appreciate you taking the time to read our rebuttal and adjust your evaluation accordingly. We will incorporate all the clarifications, additional experimental results, and suggested modifications discussed during the rebuttal into our revised version. Thank you once again for your valuable, constructive feedback and for your consideration.

Best regards,

Authors

审稿意见
3

This paper focuses on addressing the over-refusal issue in aligned LLMs. The proposed technique, ACTOR, leverages internal activation for fine-tuning a single layer of the model to reduce the over-refusal rate.

update after rebuttal

Thanks for the authors' response, which addresses most of my concerns.

给作者的问题

See above sections.

论据与证据

Overall, most evidence is clearly presented. However, I have some concerns regarding the limitations of existing work claimed in Section 2, which I did not find references or evidence to support, for example (not limited to)

These inference-time solutions, while computationally efficient, are highly sensitive to initial data distributions, leading to inconsistent performance across different contexts. Additionally, these approaches offer a one-size-fits-all solution and typically do not provide differentiated treatment for various types of queries.

方法与评估标准

The proposed method, using representation vectors for fine-tuning, intuitively makes sense. However, it seems directly applying representation fine-tuning methods in this scenario, so I think the novelty can be further justified.

理论论述

The theoretical analysis is correct but overly simplified and straightforward. Moreover, it does not align with other main claims made in this paper, e.g. why ”Just Enough” Shift does not harm natural performance or why a single layer is sufficient to achieve the fine-tuning.

实验设计与分析

Mostly comprehensive, subject to the coverage of models, datasets, and ablation studies. However, the experiment only includes SFT and SCANS as baselines, missing comparison with other broad existing methods discussed in this paper. There is also a lack of an ablation study on the selection of the fine-tuning layer.

补充材料

Yes.

与现有文献的关系

Addressing the over-refusal problem could benefit the balance between safety and utility of LLMs.

遗漏的重要参考文献

NA

其他优缺点

Some figures/tables are overly large/small. A better format is appreciated.

其他意见或建议

See above sections.

作者回复

Thank you for your thoughtful feedback. Below, you'll find our detailed responses, which we hope clarify the points you raised.

(A)

We appreciate the opportunity to clarify the basis of our claims in Sec 2. While we had experimental results (Sec 4) that support these points, we realize they were not explicitly cited. We will revise the manuscript to better link these findings to our claims.

  1. We claim that SCANS and the “surgical” method [1]—are highly sensitive to the specific datasets used to derive their refusal vectors. In Figure 5, we show that SCANS exhibits significant performance fluctuations when we vary the harmful datasets used. Likewise [1] also exhibits similar variance when switching among different pseudo-harmful datasets.
MethodOR Compliance Rate
Surgical (ORB-H)86.39
Surgical (OKTest)63.88
  1. We also claimed that SCANS and [1] offer a uniform, one-size-fits-all adjustment to the model, applying the same refusal vector—regardless of each query’s unique activation patterns. This approach can lead to unnecessary or insufficient corrections, as each query requires a different level of ablation to ensure compliance (Sec 3.2).

ACTOR’s Adaptive Mechanism: In contrast, our proposed approach employs a dynamic training objective- moving ahead from specific task vectors. ACTOR repeatedly updates both the model parameters and the refusal direction during fine-tuning, enabling an input-dependent shift proportional to each query’s projection onto the refusal direction. This underpins ACTOR’s robustness; it is not limited to a single axis correction but leverages the full capacity of the model’s internal activation space to maintain robust performance even under distributional shifts (Fig 5). As shown by our experimental results in the paper (Tables 1–4) and the addition of new baselines (Reviewer eEU9 response (C))- ACTOR outperforms all baselines in balancing both compliance and safety. We have also included a table comparing existing methods with ACTOR to further support our claims on their limitations in our response to Reviewer eEU9 (C).

(B)

We would like to clarify that ACTOR is distinct from ReFT in both design and motivation.

  1. Parametrization: Rather than introducing additional parameters, ACTOR fine-tunes a single layer of the original model, avoiding the overhead associated with auxiliary interventions.

  2. Learning Objective: Its objective departs from standard output-based losses by encouraging a “just enough” shift in the model’s internal representations, scaled by each query’s specific projection onto the refusal direction. This prevents both excessive and insufficient corrections, which can result when a one-size-fits-all approach is applied.

  3. Annotated Data: Unlike ReFT, ACTOR does not rely on full response supervision as it draws on internal activation information as its supervision signal making it more cost-effective and straightforward to deploy.

Motivationally, ACTOR is designed for cases where the degree of over-refusal varies by query, requiring individualized corrections (Sec 3.2). This focus drives both the training algorithm and the learning objective, ensuring a reduction in over-refusal while being computationally light. We will revise the appropriate sections to highlight these distinctions and justify our novelty.

(C)

We would like to clarify that the intent of the theoretical analysis is not to explain why the “Just Enough” Shift preserves natural performance or why fine-tuning a single layer is sufficient. Rather, as stated in Lines 231–233, the theory is designed to offer intuition behind the design of our training objective—specifically, why subtracting out the projected activation shift with the refusal direction enables targeted correction for over-refusal, without imposing a uniform change across all inputs.

We agree that theory is simplified to maintain clarity and provide a conceptual foundation for the method’s core ideas. It is not intended to serve as a comprehensive performance guarantee.

As for the claims regarding natural performance preservation and single-layer sufficiency, we support these empirically through extensive experiments in Sec 4.2.

  1. The “Just Enough” Shift mechanism maintains high performance on benign queries (Table 1)

  2. Fine-tuning a single layer already yields substantial improvement in over-refusal mitigation without degrading natural performance (Table 3). We will revise the text to clarify the scope and role of the theoretical analysis and explicitly distinguish it from our empirical findings.

(D)

Comparison With Baselines: Kindly refer to our rebuttal to Reviewer eEU9 (C)

Ablation Study on the Fine-Tuning Layer: An ablation on the choice of the fine-tuning layer is present in Appendix C.1

[1] Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation

审稿人评论

Thanks for the authors' response, which addresses most of my concerns. Though clarified by the authors, I still think the novelty and theoretical contributions are somewhat weak, but I appreciate the technical and empirical part of this work. Thus I have raised my score and am not opposed to acceptance.

作者评论

Dear Reviewer Fyfs,

We are happy to hear that our responses have addressed your concerns and questions. We appreciate you taking the time to read our rebuttal and adjust your evaluation accordingly. We will incorporate all the clarifications, additional experimental results, and suggested modifications discussed during the rebuttal into our revised version. Thank you once again for your valuable, constructive feedback and for your consideration.

Best regards,

Authors

最终决定

ACTOR addresses the over-refusal issue in LLMs by finetuning a single layer based on activation patterns. It identifies and adjusts specific activation that triggers unnecessary refusals, while aiming to keep the model safe and useful. Reviewers found the method effective, simple and efficient. The method also can reduce over-refusal without hurting the model's safety and usefulness properties. There exist a few prior works on representation-based methods but the method is still unique. The paper mainly conducted experiments on single turn scenarios but could potentially be extended to multi-turn in the future work. The rebuttal addresses some concerns on baselines and ablations.