/10

Poster2 位审稿人

最低4最高4标准差0.0

ICML 2025

POROver: Improving Safety and Reducing Overrefusal in Large Language Models with Overgeneration and Preference Optimization

Batuhan K. Karaman,ishmam zabir,Alon Benhaim,Vishrav Chaudhary,Mert R. Sabuncu,Xia Song

提交: 2025-01-23更新: 2025-07-24

TL;DR

This paper investigates the impact of using advanced language models as teachers on the safety and usefulness of student models and explores preference optimization methods to mitigate overrefusals.

摘要

关键词

LLM safetyLLM usefulnessOverrefusal in LLMsresponsible AI

评审与讨论

审稿意见

评分: 42025-03-21

Making LLMs behave in a safe fashion, a major research concern, often comes with unwanted side effects such as overrefusal of prompts that may seem unsafe. This paper makes two contributions. 1. They show an improvement when using finetuning data overgenerated from a more advanced teacher LLM. It also presents a preference optimization algorithm that further improves performance, reducing overrefusals while maintaining safety.

Update after rebuttal

Because I didn't find any substantive issues with the paper during the review process, my positive assessment of the paper continues.

给作者的问题

None

论据与证据

Yes, benchmark results across several teacher and student models provide robust evidence for claims regarding improvement and accuracy.

方法与评估标准

Yes, evaluation builds off of existing relevant datasets and uses LLamaGuard which is well established.

理论论述

The claims of this paper are empirical, not theoretical.

实验设计与分析

Algorithm 1 appears a valid method for generating preference data. Use of multiple student and teacher models, as well as tuning the containment threshold and the increase in ASD all give me confidence in the robustness of the results.

补充材料

Yes, the appendix appears complete, and contains the prompt examples I had hoped to see while reading through the main text of the paper.

与现有文献的关系

This paper contributes meaningfully to the tradeoff between safety and helpfulness in LLM alignment. It also applies preference optimization (an established technique) in a novel context.

遗漏的重要参考文献

None

其他优缺点

While the contributions are novel, the real-world applications are the greatest strength of the paper. Overrefusals cause problems for users of LLMs every day, and mitigating them without impacting safety is valuable.

其他意见或建议

I would change the 'toxic question example' in the appendix figure 7 to something a little more explicitly toxic.

Update

The authors' updated question is more appropriate and illustrative.

作者回复

2025-04-01

We thank the reviewer for their review and comments.

We will answer the points raised individually.

1. (Other Comments Or Suggestions) I would change the 'toxic question example' in the appendix figure 7 to something a little more explicitly toxic.

We thank the reviewer for this suggestion. In our revision, we will include the following example which is more explicitly toxic. It targets personal privacy.

https://drive.google.com/file/d/1X8yfzFXwMTwxgXzXglaiV8cNAqYIBted/view?usp=sharing

Thank you very much for your valuable time and thoughtful review! We hope you would consider increasing your score if we have addressed your suggestion. Please let us know if you have additional comments and we are happy to follow up. Thanks!

审稿意见

评分: 42025-03-25

This paper addresses the challenges of balancing safety and usefulness in large language models. It explores the effects of using more advanced teacher models to generate completions for instruction finetuning. Their main contributions include:

They show that using more advanced teacher models (e.g. GPT-4o) to overgenerate completions for both general-purpose and toxic prompts during instruction finetuning improves the safety-usefulness trade-off in student models.
The paper introduces POROver, a post-finetuning method that complements safety finetuning that reduces the overrefusal rate while maintaining high safety levels.

The authors evaluate their method on multiple model families (Llama-3 and Phi-3) and sizes (3B to 8B) and come to similar conclusions for all models tested.

Update After Rebuttal

The authors added some additional experimental results for a wider variety of model families, as well as some additional jailbreaking results that added to the strength of the overall results.

给作者的问题

The paper focuses on GPT-4o as the teacher model. Have you experimented with other advanced models as teachers, and if so, how do the results compare?
How does the computational cost of your approach scale with model size? Do you anticipate any challenges in applying these methods to even larger language models?
Your POROver method shows promising results in reducing overrefusal. Have you investigated how this approach performs on more diverse or challenging types of prompts beyond the benchmarks used in the paper? For example, how does it handle ambiguous requests or prompts that require more nuanced ethical reasoning?

论据与证据

Overall, the claims made in the paper are generally supported by extensive experimentation, however, there are a few areas where the evidence could be strengthened:

The claim that POROver maintains "comparable safety levels" while increasing usefulness seems to be supported, but the slight decrease in safety could be discussed more explicitly.
The generalizability of the results across different model families is demonstrated, but the sample size (two model families) is relatively small. Including results from more diverse model families would strengthen this claim.

方法与评估标准

The use of established benchmarks like AlpacaEval, OR-Bench, and XSTest provide a solid foundation for their evaluations and seem appropriate for what the paper attempts to show. The choice of GPT-4o as the teacher model is completely reasonable, but comparing results with other advanced models (e.g. Claude, Gemini) could provide more insight into how well the approach generalizes.

理论论述

This paper does not make any significant theoretical claims and is primarily empirical in nature.

实验设计与分析

The experimental designs appear sound and well executed. They use appropriate statistical measures and isolate the effects of different components during their experimentation. A few observations:

The human evaluation on XSTest Seemingly Toxic dataset (Appendix F.1) is a strong addition, validating the automated metrics.
I appreciated the analysis of saturation with ASD to show the data efficiency of the approach.

补充材料

I reviewed the supplementary material found in Appendices D, E, and F. I appreciate the extensive experimentation shown by this extra information and that I could find the details of most of their experiments here.

与现有文献的关系

This paper builds on several important areas in the field of LLM safety and alignment such as instruction finetuning, safety-utility trade-off, preference optimization methods, and attempting to reduce overrefusal. Getting the desired balance between safety and usefulness in models is an extremely important field of research as models continue to improve and gain new capabilities.

遗漏的重要参考文献

Not that I'm aware of.

其他优缺点

Strengths:

Extensive experimentation to demonstrate their claims.
The proposed methods are practical and show significant improvements over baselines.
The release of their generated datasets is valuable for further research in the community.
The paper's proposed method POROver can be used with a variety of preference optimization methods.

Weaknesses:

While the results are promising, the discussion of the long-term generalization of the proposed methods to real-world tasks could be explored further.
While the paper does cover two model families across sizes from 3B to 8B, the sample size is still relatively small. Including a wider range of model architectures and sizes could show the potential generalizability of the proposed approaches.

其他意见或建议

Typos:

On line 103, "and increase its the Not-Overrefusal Rate..."
On line 256, "on the same initial model instance instance of reach set..."
On line 309, "with a modest reduction ... usefulness" (in usefulness?)
Figures 3, 9, and 11 each have have a typo where it says "safey" rather than "safety."
Table 5's caption says "annoted" (annotated?)

作者回复

2025-04-01

We thank the reviewer for their review and comments.

We will answer the points raised individually.

1. (Claims And Evidence) The slight decrease in safety after POROver could be discussed more explicitly.

We thank the reviewer for this suggestion. We believe the slight decrease in safety may stem from our grid search over the containment threshold parameter. The intermediate values we tested (0.01, 0.03, 0.1) correspond to slightly different points along the safety–usefulness curve. This suggests that our current choices for the containment threshold may be suboptimal, and finer or adaptive tuning could further improve results. In future work, we plan to explore automated methods for optimizing the containment threshold to better balance safety and usefulness.

We will add this discussion to our revised manuscript.

2. Including a wider range of model architectures and sizes for student and teacher models could show the potential generalizability of the proposed approaches.

New student models: We have expanded our analysis to include Falcon-7B as an additional student model, increasing the total number of model families to 3. To further broaden the range of student model sizes evaluated, we also include results for Llama-3.2-11B. Therefore, our analysis now covers student models ranging from 3B to 11B in size. We leave the exploration of models larger than 11B to future work, as we did not have access to sufficient compute resources at this time.

Falcon-7B results: (To be included before April 2nd) Llama-3.2-11B results: (To be included before April 2nd)

New teacher model: We agree that incorporating additional teacher models could provide deeper insight into the generalizability of our methods. Moreover, relying solely on a proprietary teacher model like GPT-4o may limit the broader real-world applicability of our approach and reduce its impact. To address these concerns, we have expanded our analysis to include results using Llama-3-70B, an open-weight model, as the teacher.

Llama-3-70B results:(To be included before April 2nd)

While the exact metric values for the student models vary slightly, all of our main conclusions remain consistent. We believe that including an open-weight teacher model as well as more student models strengthens the robustness of our findings, reduces dependency on proprietary models, and enhances the practical applicability of our methods.

3. While the results are promising, the discussion of the long-term generalization of the proposed methods to real-world tasks could be explored further.

We thank the reviewer for raising this important point. One key aspect of long-term generalization to real-world tasks is robustness to adversarial prompts, as real-world usage often exposes models to unexpected or malicious inputs. To evaluate this, we have added a detailed adversarial robustness analysi. Specifically, we test all of our student models against three adversarial attack methods: Prompt Automatic Iterative Refinement (PAIR), Greedy Coordinate Gradient (GCG), hand-crafted jailbreaks from Jailbreak Chat (JBC), using the behavoral prompts from the JailBreakBench Benchmark.

Jailbreaking results: https://drive.google.com/file/d/13aVd3igdd7JGZ_5PdfQRdcWOkF2Q19UU/view?usp=sharing

For our supervised finetuning approach—i.e., overgeneration with better teacher models—we find that it improves adversarial robustness (measured by attack success rate) against GCG, PAIR and JBC significantly.

We further show that our preference optimization method does not compromise adversarial robustness across any of the three attack types. This demonstrates that it effectively reduces overrefusal without degrading safety under real-world adversarial conditions.

We think that these observations offer strong empirical support for the real-world reliability and generalization of our methods, especially in adversarial and high-stakes deployment settings.

4. (Questions for Authors) How does the computational cost of your approach scale with model size? Do you anticipate any challenges in applying these methods to even larger language models?

In the model sizes we have examined, we found that the training takes longer GPU hours although the convergence times remain similar for the larger models. We have already included a numerical analysis in our original manuscript Appendix D. Other challenges associated with larger models include increased GPU memory requirements and the need for more complex code to enable distributed training. We will discuss these challenges in our revised manuscript.

Thank you very much for your valuable time and thoughtful review! We hope you would consider increasing your score if we have addressed your suggestions and concerns. Please let us know if you have additional comments and we are happy to follow up. Thanks!

审稿人评论

2025-04-05

I appreciate the added discussion points about the level of safety after using POROver, especially the jailbreaking results which would be a beneficial addition to the paper's results.

I acknowledge the discussion of the addition of more model sizes and architectures, but would love to see the results from adding those models to the experiments.

I'm grateful for the response about computational cost and about where to find more information on those metrics.

I appreciate the author's response and revisions, and after reviewing, I have decided to keep my original score of a 4.

作者评论

2025-04-07

We thank the reviewer for their response and appreciate their approval of our answers regarding points 1, 3, and 4 we have made in our initial response.

Regarding point 2, i.e., including a wider range of model architectures and sizes for student and teacher models:

New student models' results:

Falcon-3-7B: https://drive.google.com/file/d/1uW52KRIDrEHehRgrY4RV-yAV-R1PrUo6/view?usp=sharing

Llama-3.2-11B: https://drive.google.com/file/d/1NRtRs8fD8kho9EXTQq3VjHYaFeNv3Pu8/view?usp=share_link

As we have stated previously, while the exact metric values for the student models vary, all of our main conclusions remain consistent in both Falcon-3-7B and Llama-3.2-11B. We will add these results to our revised manuscript.

New teacher model results:

Llama-3-70B: https://drive.google.com/file/d/1riqK5Cprl9VycQRov5cCTnmRAVg_2w4B/view?usp=sharing

We note that we used Llama-3.1-8B as the student model for the Llama-3-70B teacher. Llama-3-70B, as an open-weight model, still offers substantial improvements over older teacher models (e.g., GPT-3 or GPT-3.5) and serves as a highly effective teacher for our methods. Thus, all of our main conclusions remain consistent with Llama-3-70B as the teacher. Compared to our experiments using GPT-4o, we observe that GPT-4o—being a less overrefusing model [1]—leads to student models that exhibit lower overrefusal, as expected.

[1] Cui, Justin, et al. “OR-Bench: An Over-Refusal Benchmark for Large Language Models.” ArXiv.org, 2024, arxiv.org/abs/2405.20947.

Once again, we believe that including an open-weight teacher model as well as more student models strengthens the robustness of our findings, reduces dependency on proprietary models, and enhances the practical applicability of our methods. We will add these results and discussions to our revised manuscript.

Thank you so much for reviewing our responses and acknowledging them! Also, thank you for the typo reminders, we will fix them in our revised manuscript. We hope you would consider raising your score if you feel we have addressed your remaining suggestions. If you have any further comments, please let us know—we are happy to follow up. Thanks again!

最终决定Accept (poster)

2025-05-01

[Because this paper only received two full reviews, I have fully reviewed it as well.]

This paper addresses the important problem of balancing safety and usefulness in LLMs. It contributes the POROver "algorithm", which is an elaborate data generation scheme; the data is then used in conjunction with a preference optimization algorithm to fine-tune weights of an LLM. The paper is entirely empirical, with extensive experiments that explore the effectiveness of the data pipeline.

While the reviewers generally liked this paper, I find myself more lukewarm: the results seem solid, but it's unclear how generalizable the results really are, both across LLM model families, and across model sizes. The size question is especially concerning, and while I appreciate the authors' lack of computational resources to test larger models, because refusals and other high-order reasoning seems to emerge at scale, I am concerned that the method may or may not work on larger models that are central to most applications of LLMs.

Nevertheless, given the importance of the subject and the intellectually interesting approach, I weakly recommend acceptance.