PaperHub
6.0
/10
Poster4 位审稿人
最低2最高5标准差1.1
2
4
4
5
4.0
置信度
创新性2.3
质量2.8
清晰度3.3
重要性2.8
NeurIPS 2025

RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29

摘要

关键词
large language modelsafetymoderation

评审与讨论

审稿意见
2

This paper presents RSafe, a two-stage, reasoning-based guard model designed to improve robustness and adaptability of LLM safety moderation. In Stage 1 (Guided Reasoning), RSafe elicits explicit, policy-aware step-by-step reasoning (enclosed in <think></think> tags) over a user-specified safety taxonomy; in Stage 2 (Reinforced Alignment), it applies rule-based reinforcement learning (GRPO) with format and accuracy rewards to refine reasoning paths for correct safety judgments.

优缺点分析

Strengths

  1. By grounding decisions in explicit, user-specified safety policies, RSafe generalizes beyond fixed training taxonomies to novel harmful categories and adversarial jailbreaks, a capability lacking in traditional guard models.

  2. The enforced <think></think> schema yields verifiable reasoning traces at both train and test time.

Weaknesses

  1. At its core, RSafe applies reinforcement learning with two straightforward reward components—format compliance and classification accuracy. While effective, this combination follows established training patterns with minimal adaptation.

  2. The guided reasoning and RL are only conducted on Qwen2.5-7B. Weaker backbones may yield shallow rationales, limiting safety gains. A controlled comparison with GuardReasoner is needed.

  3. The mechanism by which dynamic safety categories are integrated into training is described only at a high level. Critical implementation details are omitted. Providing a complete account of data curation and training schedules would improve reproducibility and allow readers to gauge practical overhead.

  4. The adaptive mode presumes users can accurately define relevant safety taxonomies; in practice, overlooking critical policy categories (e.g., emergent threats) may degrade performance. No guidance or fallback is provided when policies are mis-specified.

问题

See weaknesses part.

局限性

N/A

最终评判理由

At its core, the method applies basic GRPO on top of prompt engineering to enforce format compliance and classification correctness. This is an incremental advance, given that GRPO has known limitations: It uses 0/1 outcome reward with no supervision of intermediate reasoning steps, causing instability if a single sub‑step is erroneous.

Despite adopting reasoning-based training, the overall effectiveness seems to remain inferior to GuardReasoner‑8B.

格式问题

N/A

作者回复

Comment 1: Straightfoward reward components with minimal adaptation

"At its core, RSafe applies reinforcement learning with two straightforward reward components—format compliance and classification accuracy. While effective, this combination follows established training patterns with minimal adaptation.

Thank you for raising this point. RSafe's 'minimal' reward design is not simplistic by oversight but by thoughtful choice: it provides a direct, noise-free approach to optimizing safety moderation objectives, avoiding reliance on intermediate reward models that could introduce noise or the pitfalls of complex reward designs. More specifically, Reinforcement learning (RL) algorithms are :

  1. Highly sensitive to reward noise [1]. Adopting verifiable ground-truth labels avoids extra stochasticity that would otherwise be introduced by a learned reward model or external classifier.;
  2. Prone to reward hacking, where the policy model can exploit flaws in the reward function rather than learning the intended behavior [2]. RSafe's concise verifiable reward surface leaves fewer unintended optima.

We outline several potential extensions below and explain why we approach them cautiously, ultimately returning to RSafe's straightforward design.

Candidate reward termApproachPotential benefitRisks / drawbacks
Difficluty-weighted rewardStratify data according to difficulty levels, with additional bonuses for adversarial prompts.Enhancing exploration on hard examples to improve robustness on edge cases.Limited coverage may lead to overfitting to a narrow subset of adversarial examples.
Policy-reference rewardGrant additional bonus if the reasoning path of rollout explicitly references the correct safety policy.Stronger grounding in reasoning over the correct safety taxonomy.Many harmful prompts violate multiple safety categories, thus introducing noise that can hinder moderation performance; policy model can exploit category frequencies rather than reasoning over safety context.
Calibration rewardPenalize models when they output incorrect answers with high confidence or correct answers with low confidence.Incentivizing calibrated predictions, mitigating guard model's hallucinations under out-of-distribution safety domains and thereby improving robustness.Approaches for obtaining confidence estimates can introduce additional inaccuracies thus may lead to reward hacking [3].

Comment 2: Reliance on the capabilities of the backbone model

"The guided reasoning and RL are only conducted on Qwen2.5-7B."

Thank you for raising this point. We have added experiments using Qwen2.5-3B and Llama3.1-8B as alternative backbones, with updated results for effectiveness and robustness shown in Table 1 and Table 2. The results confirm that the choice of backbone influences performance: Qwen2.5-3B performs worse than Qwen2.5-7B, while Llama3.1-8B performs slightly worse on effectiveness but achieves higher robustness. These results suggest that RSafe can better leverage stronger backbone models to enhance safety through more robust reasoning capabilities.

Table 1. Performance comparison on safety moderation benchmarks across different backbone models.

ToxicChatOpenAI ModAegisSafeRLHFBeaverTailsXSTESTWeighted Avg
MethodBackbone modelAccF1AccF1AccF1AccF1AccF1AccF1AccF1
RSafeQwen2.5-7B-Instruct0.9570.7280.8170.7640.8340.8490.9140.9210.8140.8320.9700.8160.9040.889
RSafeQwen2.5-3B-Instruct0.9470.6920.7950.7390.7890.8000.8420.8540.7220.7450.8060.4070.8390.820
RSafeLlama3.1-8B0.9340.6480.7390.6980.8120.8330.9100.9180.8240.8410.9140.6170.8900.878

Table 2. Robustness comparison on WildGuardTest across different backbone models.

OverallAdversarialVanilla
Backbone ModelMethodAccF1AccF1AccF1
Qwen2.5-7B-InstructRSafe0.8280.7720.7790.6680.8710.845
RSafe-adaptive0.8450.8230.7910.7170.9110.899
Qwen2.5-3B-InstructRSafe0.8070.7540.7720.6930.8390.802
RSafe-adaptive0.8300.8000.7690.7170.8830.868
Llama3.1-8BRSafe0.8490.8150.7890.7150.9010.888
RSafe-adaptive0.8610.8380.8020.7470.9120.905

Comment 3: Disclosure of training details

"The mechanism by which dynamic safety categories are integrated into training is described only at a high level."

Thank you for raising this issue. During training, we used a default instruction prompt set providing in-distribution data to teach the model context-aware reasoning based on given safety policies. At inference, we support dynamic safety categories, allowing users to specify their required safety policies based on their needs.

To improve reproducibility and address practical overhead, we provide detailed training data scale and compute usage in Appendix C.3. In total, we curated a training dataset of approximately 12,000 examples, balanced across harmful and unharmful samples from five safety-related corpora, covering both prompt-level and response-level harmfulness detection tasks across diverse safety scenarios.

For computational resources, we used the VERL codebase on 4×A100 80GB GPUs with a batch size of 128 and a maximum sequence length of 2048 tokens. Each run was trained for 3 epochs with rollout as 4, taking approximately 3–4 hours.

Comment 4: Missing / mis-specified safety taxonomies

"The adaptive mode presumes users can accurately define relevant safety taxonomies; in practice, overlooking critical policy categories (e.g., emergent threats) may degrade performance."

Thanks for bringing up this problem. When users encounter novel safety situations and cannot supply a clear policy, RSafe provides an option to omit the safety taxonomy while still providing strong moderation. Table 4 presents this configuration as “RSafe (w/o category specification).”, where it outperforms its baselines and the default RSafe version. RSafe can draws on its internal safety self-awareness [5] and the context-aware reasoning capability acquired during training to provide flexible and accurate decisions, instead of trying to fit new hazards into specified safety categories that may be ill-suited. These results demonstrate that RSafe remains both robust and protective even when users cannot specify a taxonomy and the scenario lies outside all predefined categories.

Table 4. RSafe's performance without safety category specification compared to its adaptive mode and other baselines on out-of-domain benchmarks featuring novel safety categories.

WildGuardTestMedSafetyBenchCARES (base)CARES (adv)
AccF1AccAccF1AccF1
OpenAI Moderation0.5830.1220.0060.2410.0690.2200.018
ShieldGemma-9B0.7310.5900.2880.4750.5040.3240.247
LlamaGuard3-8B0.8220.7630.6440.6460.7110.3820.354
RSafe0.8190.7580.5880.5760.6310.4330.429
RSafe-adaptive0.8400.7970.8040.7090.7730.5010.535
RSafe (w/o category specification)0.8350.7920.6910.6430.7070.4540.475

References

[1] Liang, Xize, et al. "ROPO: Robust Preference Optimization for Large Language Models." (ICML 2024).

[2] Fu, Jiayi, et al. "Reward shaping to mitigate reward hacking in rlhf." (2025).

[3] Mroueh, Youssef. "Reinforcement Learning with Verifiable Rewards: GRPO's Effective Loss, Dynamics, and Success Amplification." (2025).

[4] Lin, Yuping, et al. "Towards understanding jailbreak attacks in llms: A representation space analysis." (2024).

评论

At its core, the method applies basic GRPO on top of prompt engineering to enforce format compliance and classification correctness. This is an incremental advance, given that GRPO has known limitations: It uses 0/1 outcome reward with no supervision of intermediate reasoning steps, causing instability if a single sub‑step is erroneous.

Despite adopting reasoning-based training, the overall effectiveness seems to remain inferior to GuardReasoner‑8B.

My comment about the model failing when users cannot specify harmful categories was misunderstood, my concern was that the model’s reliance on user‑specified categories could undermine its moderation capability in practice.

评论

Thanks for your time and your comment, regarding your concerns:

  1. The concern that the lack of supervision over intermediate steps might introduce instability does not hold in our setting. Through manual inspection, we repeatedly confirm the consistency between the reasoning path and the final prediction. Moreover, eliminating intermediate supervision mitigates the risks of label noise and reduces annotation burden, ultimately enhancing the robustness of our framework.

  2. RSafe, trained on Qwen2.5-7B with approximately 10k examples, achieves competitive overall effectiveness compared to GuardReasoner, which is trained on Llama3.1-8B with 127k examples. For a controlled comparison, we conducted experiments using the same backbone model and training datasets, with the only difference being the training pipeline. The results, presented in Table 1 and Table 2, demonstrate that RSafe's superior performance.

Table 1. Performance comparison on safety moderation benchmarks across different methods.

BackboneMethodToxicChat AccF1OpenAI Mod AccF1Aegis AccF1SafeRLHF AccF1BeaverTails AccF1XSTEST AccF1Weighted Avg AccF1
Qwen2.5-7B-InstructRSafe0.9570.7280.8170.7640.8340.8490.9140.9210.8140.8320.9700.8160.9040.889
Qwen2.5-7B-InstructGuardReasoner0.9370.6650.7920.7300.8190.8370.9010.9110.8200.8420.8990.5790.8870.827

Table 2. Robustness comparison on WildGuardTest across different methods.

BackboneMethodOverall AccF1Adversarial AccF1Vanilla AccF1
Qwen2.5-7B-InstructRSafe0.8280.7720.7790.6680.8710.845
Qwen2.5-7B-InstructRSafe-adaptive0.8450.8230.7910.7170.9110.899
Qwen2.5-7B-InstructGuardReasoner0.8220.7820.7720.6730.8750.859
  1. While providing accurate and fine-grained safety categories yields the best results, RSafe remains robust even without such specifications, according to results shown in Table 4 in the rebuttal response. Deliberately specifying incorrect or contradictory categories falls outside the scope of our intended use and is thus not addressed in this work.
审稿意见
4

The paper introduces RSafe, a system‑level guard model for LLMs that (i) adds an explicit reason‑then‑predict stage (“Guided Reasoning”) and (ii) aligns this behaviour with a lightweight rule‑based reinforcement‑learning procedure (“Reinforced Alignment”, based on GRPO). This exhibits generalisation to emerging harmful categories and jailbreak attacks (WildGuardTest). At inference time, users can supply arbitrary safety policies; RSafe reasons over the supplied categories without retraining. The rationale is also always returned, enabling interpretability in the guard model.

优缺点分析

Strengths: The primary strength seems to be that RSafe addresses, through a unique RL objective, is the need for guardrail models that adapt with different safety policies and user-specified safety taxonomies by accepting arbitrary user‑supplied safety taxonomies at runtime. This drop‑in policy swap is done without retraining; a capability missing in prior guard models. This can generalize to various safety violation scenarios, such as intellectual property, which are more out-of-distribution. Furthermore, the reasoning is done for each safety policy.

Weaknesses:

  • The experiments do not test the adaptiveness of the policy, despite this being the primary advantage that is proclaimed by RSafe -- see "Questions" and "Limitations" for more experiments I'd like to see done. While there was an evaluation on WildGuardTest, which shows increased performance on out-of-distribution adversarial examples, this is distinct from being able to show that safety policies can change and RSafe can adapt to different safety needs.
  • The experiments shown do not demonstrate a significant performance improvement over other open-source guardrail models.

问题

The primary claim of the paper is its ability to accept various user-supplied safety taxonomies, allowing developers to include and exclude safety policies at will. This is mentioned as one of the most significant contributions of the paper on page 3. However, the experimental results do not necessarily investigate this claim in depth, beyond finding improved performance on a pre-established adversarial examples dataset.

Many questions I would like to see answered are around providing experimental evidence to strengthen or weaken this claim. Primarily:

  1. One evaluation that would be interesting would be to see how RSafe handle entirely idiosyncratic, edge-case, or otherwise highly specific user‑defined safety rules, particularly when compared with GPT‑4o or Claude.
  2. How does RSafe work when certain safety metrics are excluded (e.g. are users/developers able to exclude copyright if they do not care about this specific risk, and "remove refusals" related to copyright?). Can it delineate and clearly change safety taxonomies compared with other baseline guard options?
  3. Is there possibility to expand the testing of robustness beyond WildGuardTest (Section 4.2), toward evals that demonstrate that it can follow instructions with respect to novel safety criteria?

Both positive and negative results would strengthen my confidence and rating.

局限性

The main limitation has already been mentioned above -- centering around addressing the claim that the safety policy of RSafe can be dynamically aligned with task-specific requirements.

最终评判理由

The issue I raised in my review -- allowing developers to include and exclude safety policies at will -- was resolved. I am not providing a stronger rating because RSafe's performance in this area is good but not stellar. In the results shown in the rebuttal, RSafe is highly comparable with 4o's performance.

格式问题

N/A

作者回复

Comment 1: Robustness on edge cases / adaptiveness to highly specific safety rules

"One evaluation that would be interesting would be to see how RSafe handle entirely idiosyncratic, edge-case, or otherwise highly specific user‑defined safety rules, particularly when compared with GPT‑4o or Claude.""

Thank you for raising this interesting question. We interpret your suggestion as encompassing two aspects: (1) robustness evaluation under edge-case adversarial prompts, such as those from red-teaming scenarios, and (2) adaptivity evaluation under highly specific user-defined safety rules. We conducted two separate experiments accordingly.

For the first aspect, we focused on edge-case prompts from two red-teaming datasets: JailbreakBench [1] and JailJudge [2]. These datasets include jailbreak attacks (e.g., PAIR [3], GPTFuzzer [4]) and deceptive strategies such as role-playing and hypothetical scenarios, as well as prompts based on real-world risks. To ensure the selected prompts represent genuinely challenging safety risks that are not trivially detectable, we filtered out prompts that GPT-4o refused outright with generic rejections like "I'm sorry, I can't help with...". After filtering, 17 prompts from JailbreakBench and 432 from JailJudge remained.

We then used RSafe and GPT-4o with identical instruction prompts to perform safety moderation. As shown in Table 1, RSafe performed slightly better than GPT-4o on JailJudge but worse on JailbreakBench, likely due to the small sample size in the latter making the difference appear more pronounced. While RSafe's accuracy in detecting safety risks drops compared to its ~0.9 average on in-domain general safety benchmarks (calculated from Table 1 in the original paper), it maintains strong protective capability and matches GPT-4o when used as LLM-as-judge, demonstrating robustness among edge cases.

Table 1. Robustness on edge cases: RSafe vs. GPT-4o.

Jailbreakbench (#17)Jailjudge (#432)
AccAcc
GPT‑4o0.8820.682
RSafe0.7060.694

For the second experiment evaluating adaptiveness, we follow the setup from Aegis 2.0 [5] to synthetically generate prompts that either violate or comply with fine-grained guidelines across three domains: Financial Advice, Legal Advice, and Medical Advice. We construct dataset with 40 prompts per category, balanced between violations and compliant cases.

As shown in Table 2, both RSafe-adaptive and GPT-4o achieve near-perfect accuracy. This result is expected: the policies are clearly defined, and the synthetic prompts are tightly aligned with the policy framing, making this task primarily a test of instruction-following rather than nuanced reasoning. By contrast, the adversarial prompts in Table 1, drawn from red-teaming attacks, are significantly more challenging and more likely to bypass safety defenses. This highlights the importance of focusing on adversarial/edge-case examples when evaluating robustness under highly specific, user-defined safety scenarios.

Table 2. Adaptiveness on highly user-specific rules: RSafe vs. GPT-4o.

Financial Advice (#40)Law Advice (#40)Medical Advice (#40)
AccF1AccF1AccF1
GPT-4o111111
RSafe-adaptive110.9750.9740.9750.974

Comment 2: Adaptiveness under safety category exclusion

"How does RSafe work when certain safety metrics are excluded""

Thank you for bringing this up. RSafe-adaptive allows users to specify safety policies during inference, enabling the exclusion of certain policies when specific risks are not relevant. Based on your suggestion, we added evaluations to test RSafe's ability to exclude safety policies.

We used three benchmarks with annotated safety taxonomies: Sorry-Bench, HarmBench, and WildGuardTest. We evaluated two settings: (1) RSafe-adaptive with full taxonomy adherence, and (2) RSafe-adaptive (w/ removal) that exclude specified categories. For example, in Sorry-Bench, we removed five subcategories under Hate Speech Generation, leaving 40 subcategories, then examined performance on both removed and remaining categories. GPT-4o was included for comparison using identical prompts.

As shown in Table 3, excluding categories had no significant impact on remaining category performance. However, both RSafe and GPT-4o showed limited ability to remove refusals in excluded categories.

Investigation revealed that prompts failing to be excluded often violated multiple taxonomies simultaneously. For instance, prompts annotated as hate speech frequently triggered other safety categories due to policy overlaps. Even after removing Hate Speech Generation, RSafe continued flagging these prompts under different taxonomies. Manual inspection across all benchmarks confirmed this overlap was consistently the cause. Successfully excluded prompts violated only the removed category.

The result demonstrates RSafe's policy exclusion capability. However, safety taxonomies are not disjoint, and prompts often violate multiple categories, making complete exclusion non-trivial.

Table 3. Adaptiveness comparison on exclusion tasks for RSafe and GPT-4o.

SORRY-BenchHarmBenchWildGuardTest
Hate Speech Gen (#50)Others (#400)Misinformation (#65)Others (#335)Copyright (#30)Others (#1659)
AccAccAccAccAccAccF1
GPT-4o - adaptive0.9800.7800.8620.9760.7000.8490.825
GPT-4o-adaptive (w/ removal)0.9400.7830.7070.9640.6330.8560.834
RSafe - adaptive1.0000.7280.7690.7490.4000.8600.822
RSafe-adaptive (w/ removal)1.0000.7300.6460.7280.3670.8470.802

Comment 3: Expand robustness evaluation

"Is there possibility to expand the testing of robustness beyond WildGuardTest (Section 4.2), toward evals that demonstrate that it can follow instructions with respect to novel safety criteria?"

Thank you for raising this point. We expanded the original Section 4.2 experiments on WildGuardTest to validate RSafe's adaptivity to novel safety criteria by adding two medical ethics benchmarks: MedSafetyBench and CARES, with CARES including both base and adversarial versions. As shown in Table 3, OpenAI Moderation, ShieldGemma-9B, LlamaGuard3-8B, and RSafe all exhibit significant accuracy drops on these datasets compared to their average performance on general safety benchmarks (results from Table 1 in the original paper). This indicates that MedSafetyBench and CARES present out-of-distribution safety challenges beyond all models' training domains.

As shown in Table 3 below, RSafe outperforms baseline models on both datasets. When guided by specific medical principles at inference (RSafe-adaptive), performance improves notably. Compared to LLM-as-judge (GPT-4o with corresponding ethical guidelines), RSafe-adaptive achieves comparable performance, demonstrating strong adaptiveness and reliability as a safety guardrail model.

Table 3. Adaptiveness comparison of RSafe and baseline models on novel safety topics.

Guard ModelWeighted Avg (general)MedSafetyBenchCARES (base)CARES (adv)
AccAccAccF1AccF1
OpenAI Moderation0.6260.0060.2410.0690.2200.018
ShieldGemma-9B0.7020.2880.4750.5000.3240.247
LlamaGuard3-8B0.8770.6440.6460.7110.3820.354
RSafe0.9040.5880.5760.6310.4330.429
RSafe-adaptive-0.8040.7090.7730.5010.535
LLM-as-judge (GPT-4o)-0.8810.6320.7380.4560.527

References

[1] Chao, Patrick, et al. "Jailbreakbench: An open robustness benchmark for jailbreaking large language models." (NIPS 2024).

[2] Liu, Fan, et al. "Jailjudge: A comprehensive jailbreak judge benchmark with multi-agent enhanced explanation evaluation framework." (2024).

[3] Chao, Patrick, et al. "Jailbreaking black box large language models in twenty queries." (2023).

[4] Yu, Jiahao, et al. "Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts." (2023).

[5] Ghosh, Shaona, et al. "Aegis2.0: A diverse ai safety dataset and risks taxonomy for alignment of llm guardrails." (2025).

评论

Thank you for the detailed rebuttal. I will raise my score.

评论

Thanks for your time and your acknowledgement.

审稿意见
4

The paper proposes RSafe – an adaptive safeguard for large language models that first performs policy-guided, safety reasoning and then applies rule-based reinforcement learning to align those reasoning paths with correct safety judgments. This two-stage paradigm enables the model to generalize across different safety principles rather than merely memorize labels, letting it moderate both prompts and responses under user-specified policies. Experiment results demonstrate its effectiveness across different benchmarks compared to baselines such as Llama Guard and ShieldGemma.

优缺点分析

Strengths

  1. The paper introduces an adaptive safeguard that couples policy-guided, step-by-step safety reasoning with rule-based policy optimization, producing a method that is both promising and empirically effective.

  2. The manuscript is clear and well structured: the problem statement, algorithmic details, and experimental protocols are presented in a logical sequence, making the work easy to follow.

  3. The evaluation is comprehensive. It spans several model scales, safety benchmarks, and jailbreak settings, and the authors provide both quantitative metrics and qualitative analyses that support their claims.

Weaknesses

  1. The proposed formulation is closely related to the GuardReasoner framework. The principal difference is that GuardReasoner relies on DPO, whereas the present work adopts GRPO with verifiable rewards such as accuracy and format compliance. This change, while sensible, may appear incremental and therefore limits the novelty of the contribution.

  2. In the robustness experiment on unseen safety categories, the “Criminal Planning / Confessions” topic can overlap with categories already covered by Llama-Guard (violent crime) and ShieldGemma (dangerous content). To better demonstrate adaptiveness to truly novel topics, I suggest (i) choosing a domain unrelated to safety—e.g., a keyword-mentioning or mathematics discussion task—and (ii) adding an LLM-as-judge baseline (such as GPT-4o) to evaluate safeguard reliability when judgment is delegated to an external model.

问题

Please respond to the weaknesses above.

局限性

N/A

最终评判理由

My major concerns are resolved. However, other reviewers also have questions regarding the adaptiveness of the proposed method and reviewer A8Ug's point on reasoning process reward supervision remain valid. Thus, I will keep my score.

格式问题

N/A

作者回复

Comment 1: Comparison between DPO and GRPO

"The proposed formulation is closely related to the GuardReasoner framework. The principal difference is that GuardReasoner relies on DPO, whereas the present work adopts GRPO with verifiable rewards such as accuracy and format compliance. This change, while sensible, may appear incremental and therefore limits the novelty of the contribution."

Thank you for your thoughtful comment. While the transition from DPO to GRPO may initially seem like a natural extension, from pairwise preference learning to verifiable reward optimization, and merely a change in the loss function, GRPO in fact introduces several essential capabilities that go beyond this surface-level shift. Specifically, GRPO enables more effective alignment in safety moderation tasks by leveraging (1) verifiable supervision, (2) multi-objective reward integration, and (3) on-policy optimization. These properties make GRPO particularly well-suited for safety moderations, as explained in details below.

AspectDPOGRPOAdvantage
Supervision sourceRequires reasoning process data specific to safety moderation tasksRequires only the ground-truth safety label for the prompt or responseReduces reliance on reasoning process data, matches well with the rule-checkable safety moderation task
Multi-objective rewardNot natively supportedSupports multiple reward dimensionsCurrently, the rewards we incorporate are format and accuracy. As specific needs require, we can add rewards in dimensions such as fluency, helpfulness, length and confidence to guide the generation of reasoning process.
Robustness to OOD dataOffline, off-policy : relies on offline preference pairs, susceptible to out-of-distribution (OOD) data [4][5]Online, on-policy : generates training data dynamically through rollouts from the current policy, allowing continuous adaptation and stronger generalization to out-of-distribution safety scenarios [6].GRPO’s on-policy optimization that support dynamic exploration beyond static supervision aligns naturally with the demands of safety moderation, which requires robust generalization to unseen jailbreaks and evolving harms.

Comment 2: Demonstrate adaptiveness to truly novel safety topics

"In the robustness experiment on unseen safety categories, the “Criminal Planning / Confessions” topic can overlap with categories already covered by Llama-Guard (violent crime) and ShieldGemma (dangerous content). To better demonstrate adaptiveness to truly novel topics, I suggest (i) choosing a domain unrelated to safety—e.g., a keyword-mentioning or mathematics discussion task—and (ii) adding an LLM-as-judge baseline (such as GPT-4o) to evaluate safeguard reliability when judgment is delegated to an external model."

Thanks for raising this point. The initial adaptiveness evaluation dataset WildGuardTest may overlap with the training data of some baseline guard models, potentially favoring them. To mitigate this, we first evaluated on GSM8K, a math dataset unrelated to safety. As expected, all models, including RSafe and its baselines, correctly identified the content as harmless.

To better assess adaptiveness to novel safety topics, we turned to two medical ethics benchmarks: MedSafetyBench [7] and CARES [8], the latter including both base and adversarial versions. As shown in Table 1, OpenAI Moderation, ShieldGemma-9B, LlamaGuard3-8B, and RSafe all show significant drops in accuracy on these datasets compared to their average performance on general safety benchmarks (from Table 1 in original paper). This indicates that MedSafetyBench and CARES present out-of-distribution safety challenges beyond the training domains of all models.

RSafe outperforms its baselines on both datasets. When guided by specific medical principles at inference (RSafe-adaptive), its performance improves notably. Compared to LLM-as-judge (GPT-4o provided with corresponding ethical guidelines), RSafe-adaptive achieves comparable performance, demonstrating strong adaptiveness and reliability as a safety guardrail model.

Table 1. Adaptiveness comparison of RSafe and baseline models on novel safety topics.

GSM8K promptGSM8K responseWeighted Avg (general)MedSafetyBenchCARES (base)CARES(adv)
AccAccAccAccAccF1AccF1
OpenAI Moderation110.6260.0060.2410.0690.2200.018
ShieldGemma-9B0.99810.7020.2880.4750.5000.3240.247
LlamaGuard3-8B110.8770.6440.6460.7110.3820.354
RSafe110.9040.5880.5760.6310.4330.429
RSafe-adaptive11-0.8040.7090.7730.5010.535
LLM-as-judge (GPT-4o)11-0.8810.6320.7380.4560.527

References

[1] Liu, Yue, et al. "Guardreasoner: Towards reasoning-based llm safeguards." (2025).

[2] Rafailov, Rafael, et al. "Direct preference optimization: Your language model is secretly a reward model." (NIPS 2023).

[3] Guo, Daya, et al. "Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning." (2025).

[4] Xu, Shusheng, et al. "Is dpo superior to ppo for llm alignment? a comprehensive study." (ICML 2024).

[5] Wu, Junkang, et al. "Towards robust alignment of language models: Distributionally robustifying direct preference optimization." (ICLR 2025)

[6] Wang, Zhichao, et al. "A comprehensive survey of llm alignment techniques: Rlhf, rlaif, ppo, dpo and more." (2024).

[7] Han, Tessa, et al. "Medsafetybench: Evaluating and improving the medical safety of large language models." (NIPS 2024)

[8] Chen, Sijia, et al. "CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs." (2025).

评论

Thank you for you response. I will keep my score.

评论

Thanks for your time and your acknowledgement.

审稿意见
5

One natural approach to model safety is training/constructing an unsafe input/output filter. This paper proposes training the input/output filter using RL instead of the standard fine-tuning approach utilized by prior works. Furthermore, they enable adaptive classification by allowing users to specify content policies in natural text.

优缺点分析

Strengths

  1. The paper framing/motivation is incredibly clear. The writing is also easy to follow.
  2. The results/baselines are clearly described, as well as showing string gains in the WildGuardTest setting.

Weaknesses

  1. The method lacks some novelty. For example, training an input/output filter and supporting adaptive policies has been introduced in prior works such as LlamaGuard. The main contribution seems to be limited to utilizing RL
  2. For the WildGuardTest evaluation, don't the guard models also admit adaptive variants by specifying policies? Do these methods match performance?
  3. [Minor] Is it possible to report averaged numbers as well for Table 1 for easier comparison?

问题

Asked in weaknesses

局限性

Yes

格式问题

N/A

作者回复

Comment 1: Adaptive safety policy support has been explored in prior work

"The method lacks some novelty. For example, training an input/output filter and supporting adaptive policies has been introduced in prior works such as LlamaGuard. The main contribution seems to be limited to utilizing RL."

Thanks for bringing this up. As Table 1 shows, some widely adopted guard models like ShieldGemma [1] support user-defined safety policies at inference time, enabling adaptation beyond their default coverage. While RSafe also supports user-specified safety policies during inference, it fundamentally differs from traditional guardrail models. Conventional models rely on supervised fine-tuning to fit safety judgments over in-distribution data. In contrast, RSafe learns to reason about safety during training via safety thinking—instead of merely fitting labels, it internalizes how to make context-aware safety decisions based on provided policies. This enables better adaptivity and interpretability.

To compare the adaptivity of RSafe with traditional counterparts, we employ MedSafetyBench [2], a benchmark constructed around nine medical ethics principles that lie outside the default safety policies seen during training for both ShieldGemma and RSafe. The benchmark comprises 1,800 harmful instructions that explicitly violate these principles. As shown in Table 4, both RSafe and ShieldGemma exhibit substantially lower accuracy on this benchmark compared to their in-distribution performance, confirming that MedSafetyBench constitutes a novel and out-of-distribution safety domain for both models.

When we provided the 9 medical principles at inference time (adaptive version), both models improved significantly. However, RSafe-adaptive continued outperforming ShieldGemma-adaptive, indicating our reasoning-based training paradigm enables stronger adaptiveness to novel safety policies than traditional data-fitting methods. The reasoning trace also provide interpretability for safety decisions.

Table 1. Comparison of RSafe with prevailing guardrail models.

Guard modelAccept user specified safety policyAccept new safety policyExplainability
OpenAI Moderation
ShieldGemma-9B
LlamaGuard3-8B
WildGuard-7B
GuardReasoner-8B
RSafe

Table 2. RSafe vs. ShieldGemma on adaptative capability to novel safety policies.

Guard modelMedSafetyBench (Acc ↑)
ShieldGemma-9B0.288
ShieldGemma-9B-adaptive0.708
RSafe0.588
RSafe-adaptive0.804

Comment 2: Missing adaptive baseline comparison in robustness evaluation

"For the WildGuardTest evaluation, don't the guard models also admit adaptive variants by specifying policies? Do these methods match performance?“

Thanks for bringing this up. Since ShieldGemma is the prevailing guardrail model that also supports injecting user-specific safety policies during inference, we included an adaptive version of ShieldGemma in the WildGuardTest benchmark evaluation. We observed that incorporating custom policies led to significant performance improvement over the default policy. However, its performance remained consistently lower than RSafe's. These results further highlight the superior generalization ability of RSafe's reasoning-based approach when handling out-of-distribution safety scenarios, including adversarial jailbreak scenarios.

Table 3. RSafe vs. ShieldGemma for robustness evaluation on WildGuardTest.

Guard ModelOverallVanillaAdversarial
AccF1AccF1AccF1
ShieldGemma-9B0.6940.4820.7310.5950.6510.314
ShieldGemma-9B-adaptive0.8060.7830.8460.8390.7060.711
RSafe0.8280.7720.8710.8450.7790.668
RSafe-adaptive0.8450.8230.9110.8990.7910.717

Comment 3: Report average performance for better comparison

"Is it possible to report averaged numbers as well for Table 1 for easier comparison?""

Thank you for pointing this out. We have added the weighted average scores in Table 4 and will include them in the revised paper. As shown in Table 4, RSafe achieves competitive performance on general safety benchmarks compared to leading guardrail models.

Table 4. Weighted average accuracy and F1-score across safety benchmarks, provided as a complement to Table 1 (which presents performance comparisons on prompt and response harmfulness detection tasks) in the original paper.

ModelWeighted AccWeighted Unsafe F1
Closed-source APIs
OpenAI Moderation0.6260.214
GPT-4o0.6440.460
Open-source models
ShieldGemma-9B0.7020.522
LlamaGuard3-8B0.8770.833
WildGuard-7B0.8930.884
Open-source (+ reasoning)
GuardReasoner-8B0.8990.889
Our model
RSafe w/o rl0.8300.807
RSafe w/o reason0.8460.813
RSafe0.9040.889

References

[1] Zeng, Wenjun, et al. "Shieldgemma: Generative ai content moderation based on gemma." (2024).

[2] Han, Tessa, et al. "Medsafetybench: Evaluating and improving the medical safety of large language models." (NIPS 2024)

评论

Thank you for the detailed response and clarifications. I maintain my (positive) rating of 5. It is not lower because there is a method/artifact with comparisons to fair baselines, it is not higher because of the limited novelty and minimal gains.

评论

Thank you for your comment and your recognition of our work. We present an adaptive safeguard that enables policy-aware, step-by-step safety reasoning, leveraging rule-based reinforcement learning without requiring additional annotated data. Our method is both conceptually promising and empirically effective. The evaluation shows strong performance on in-distribution safety benchmarks and significant improvements over baselines on out-of-distribution novel safety categories.

最终决定

The paper proposes RSafe, a reasoning-based safeguard that integrates policy-guided reasoning with GRPO alignment for adaptable LLM safety moderation. Three reviewers found the work technically solid, clearly presented, and empirically validated, with strong results on out-of-distribution and adversarial settings; two explicitly recommended acceptance, and one raised their score after rebuttal. The authors strengthened the case with new experiments on medical ethics and red-teaming benchmarks, comparisons against GPT-4o, and alternative backbones, which addressed most concerns about robustness and adaptiveness. The remaining critical review (m19C) judged the contribution incremental and less effective than GuardReasoner, but this view was not shared by the majority. Given the overall positive reception and substantial empirical value despite ongoing debate about novelty, I recommend acceptance.