Dear Reviewer,

We sincerely thank you for your thorough review and insightful comments that will help improve our paper. We appreciate your recognition of our work's empirical strengths and contributions to adversarial robustness. Below we address your concerns in detail.

Thank you for raising this important theoretical concern. Let me explain how our method fundamentally overcomes the limitations of refusal training through a rigorous mathematical analysis. The key insight lies in how our method changes the optimization landscape. Traditional refusal training optimizes a global loss:

As shown in Theorem 1, this leads to an inherent trade-off because the gradients of and often conflict, forcing compromises that affect model utility. In contrast, our method decomposes the problem into two orthogonal optimization objectives through the Prism architecture and this allows us to maintain orthogonality between safety and utility in the representation space. Specifically, when the activator detects potential harm (), the router makes localized decisions:

This selective intervention preserves benign content while surgically removing harmful elements, avoiding the global trade-off. We can formally prove that this approach maintains a mutual information bound: where is a small constant determined by the router's precision. This guarantees that our method preserves utility on benign inputs while achieving safety through targeted intervention rather than global constraints. The empirical results in Tables 1-3 validate this theoretical analysis, which shows our method achieves superior performance on both safety and utility metrics compared to global optimization approaches.

Thank you for suggesting additional baseline comparisons. We would like to clarify that our ablation study (Section 4.2) already includes the token-redaction training only baseline under "w/o Activators". However, we acknowledge that direct comparison with adversarial training methods would be challenging since our approach fundamentally differs in its objective - we aim for selective redaction rather than complete refusal. The goal is to preserve beneficial information while surgically removing harmful content, making it conceptually distinct from traditional adversarial defenses.
We appreciate your attention to practical implementation details. In our implementation, the PRISM structure operates in parallel with the LLM backbone network. Since our method utilizes intermediate layer features, the computation can be completed before the backbone network finishes its forward pass, introducing no additional latency. This parallel processing architecture is a key design feature that maintains efficiency while enabling fine-grained moderation.
Thank you for raising this interesting perspective about potential circumvention using third-party models. While this is an important consideration, we believe it falls outside the scope of test-time defense evaluation. The scenario you describe - using other models to decode redacted content - is fundamentally a different problem from test-time robustness. Users could similarly use alternative models to generate harmful content even when faced with complete refusal. Our focus is on providing effective real-time moderation within our system while maintaining output utility.

In closing, we thank you again for your valuable feedback. We plan to incorporate these suggestions by strengthening our theoretical analysis and clarifying implementation details in the revised version. Your comments have helped us identify areas where we can better communicate our method's contributions and limitations.

Best regards, The Authors