DRAGON: Guard LLM Unlearning in Context via Negative Detection and Reasoning
We propose DRAGON, a lightweight black-box unlearning framework that leverages detection and chain-of-thought reasoning to enforce safe, in-context interventions without modifying the underlying LLM.
摘要
评审与讨论
The paper presents a modular unlearning framework dubbed DRAGON. DRAGON relies on an "Unlearn Store" and a detection module to detect whether the given input would trigger an output that should be unlearned. Additionally, a corresponding "safety policy" is retrieved. In a second step, a guard model generates CoT instructions on how handle the input. The original input, safety policy and the CoT instructions are forwarded to the model. DRAGON is evaluated on 3 unlearning tasks and is shown to outperform other approaches in most cases. Additionally, the paper introduces new metrics to evaluate unlearning performance.
优缺点分析
Strengths:
- The proposed framework is intuitive and modular and can be used with black-box models
- The evaluation considers multiple tasks (sample unlearning, concept unlearning, continual unlearning)
- The proposed framework shows strong performance
Weaknesses:
- Some parts of the papers are not described well and are difficult to follow (see questions below)
问题
Questions:
- L140: What exactly are the trade off scores ?
- L148: what is the model utility ?
- L170: what are "forget prompts"? the term is not mentioned before
- L177-182: I'm confused as to whether personal information (e.g., names) are stored or not. Does not the exact match calculation assumes the storage of such information?
- How are the safety policies defined?
Other comments:
- L124: don't --> do not
- L108: "the subject to unlearn"
- Figure 1 is not referenced
- L200: "refusal. guideline"
- L285-286: "To" on two lines
局限性
yes, the authors discuss limitations to some extent, but miss the folllowing:
- Potential attacks on the detection module
- Increased latency (due to additional steps, and larger context)
- Computational overhead for training guard model
最终评判理由
The authors clarified my doubts in the rebuttal.
格式问题
NA
We sincerely appreciate the reviewer’s time and effort in reading our paper and offering thoughtful suggestions.
Q1 & Q2: Trade off score and model utility.
LLM unlearning involves a trade-off between forgetting effectiveness and model utility, as noted in prior work [1]. We evaluate the performance of unlearning with a trade-off score that accounts for both aspects, for example, using the deviation score on the TOFU dataset.
Model utility refers to the model’s general language capability, such as answering unrelated questions and performing downstream tasks post-unlearning. Our experiments are designed to balance these goals: removing targeted knowledge while preserving overall performance. Detailed meanings can be seen in Lines 236–239.
Q3: Forget Prompts. Forget prompts refer to the set of input questions or prompts sampled from the forget dataset D_f, which are used during training to guide the unlearning process. We will revise the manuscript to include a brief definition.
Q4: Exact Match Calculation. Yes, exact match calculations rely on the name or certain keywords. In the entity unlearning setting (TOFU), we store entity names (e.g., personal names) for exact match scores. This does not imply that the Unlearn Store retains or memorizes the entity internally. If privacy constraints prohibit storing exact names, the detection can rely on embedding-based similarity metrics instead. When names are available, exact match can be used; otherwise, alternative metrics are applied.
Q5: How are the safety policies defined?
The safety policies used in our framework are described in Appendix F3. They were initially generated using GPT-4o to cover a broad range of safety criteria and then manually refined for consistency and appropriateness. In practice, these policies are configurable and should be tailored to the specific application or compliance needs of the deployer. Our default policy serves as a general-purpose template, but the framework supports custom policies for more targeted unlearning. For example, in the TOFU dataset, the policy encourages generating fake author names, while in WMDP, it enforces refusal behaviors. This flexibility enables scalable and context-aware unlearning across diverse tasks.
Typos. Thank you for pointing out these detailed issues. We will correct all noted typos and formatting problems in the revised version.
Limitation1: Potential attacks on the detection module. To evaluate this, we conducted a series of attack experiments, as shown in Tables 13 and 14 (Appendix), demonstrating that our full framework remains robust under various adversarial conditions.
- To isolate and further analyze the detection module’s resilience, we also conducted dedicated attack experiments focused solely on the detector (Table 1). These include AIM attacks, language-mixing attacks, and typo-based perturbations, as introduced in Appendix D.6. Instead of using Attack Success Rate, we report detection accuracy to directly measure the detector's performance under attack. A higher or comparable accuracy relative to the original setting indicates that the detector is robust to these attacks. Our results confirm that the detection module maintains strong performance even under these common adversarial manipulations.
Table 1: The detection accuracy on TOFU forget dataset under different attacks.
| Attack type | TOFU-10% | TOFU-5% | TOFU-1% |
|---|---|---|---|
| Original | 1.0 | 1.0 | 1.0 |
| AIM Attacks | 1.0 | 1.0 | 1.0 |
| Language mix (2 languages) | 1.0 | 1.0 | 1.0 |
| Language mix (4 languages) | 0.88 | 0.97 | 0.97 |
| Typo attack | 0.97 | 0.98 | 0.97 |
- We also conduct experiments on out-of-distribution (OOD) prompts to evaluate the robustness of the detection module. Rephrased prompts are generated by GPT-4o through paraphrasing the original forget prompts to confuse the detector. Keywords and Short Phrases refers to prompts rewritten using only a minimal set of key terms or fragments. Adversarial prompts include small perturbations such as misspellings, Unicode homoglyphs, or unnatural spacing to evade exact-match detection. Detailed prompt generation instructions will be added in the revised version for reproducibility.
Table 2: Detection accuracy of the TOFU and WMDP detectors on various types of out-of-distribution (O.O.D.) prompts derived from the forget dataset.
| TOFU-forget10 | WMDP (Avg) | |
|---|---|---|
| Original | 1.0 | 0.98 |
| Rephrased | 1.0 | 0.96 |
| Keywords and short phrases | 1.0 | 0.97 |
| Adversarial | 0.99 | 0.95 |
- To evaluate detection performance on non-forget-related, out-of-distribution content, we randomly sample 400 prompts each from SimpleQA and Alpaca datasets. These serve as control datasets not subject to unlearning.
Table 3: Detection accuracy of the TOFU and WMDP detectors on unseen, non-forget-related O.O.D. prompts from SimpleQA and Alpaca. (Forget is the positive class)
| General Dataset | TOFU | WMDP |
|---|---|---|
| Simple QA - 400 | 0.01 | 0.11 |
| Alpaca- 400 | 0.01 | 0.05 |
On the general set, our detectors correctly classify these prompts as non-forget, exhibiting a low false positive rate. This suggests that the performance of the main LLM on inputs unrelated to the forget set is unlikely to be negatively impacted.
Limitation 2: Increased latency (due to additional steps, and larger context)
We acknowledge that our framework introduces a modest increase in inference-time latency. However, this overhead is minimal and targeted:
- The detection module runs in ~5ms (Table 4), and policy retrieval is nearly instantaneous.
- For non-forget-related prompts, no further processing is triggered, so latency remains equivalent to standard LLM inference.
- For forget-related prompts, where safety and compliance are critical, we argue that an increase in latency is justified, especially when the prompt involves private, harmful, or regulated content that should not be answered.
Additionally, the larger context used for instruction injection contributes to more reliable safeguarding, and we identify future directions like context compression or prompt summarization to further optimize latency.
Overall, our method is designed to be modular and incrementally extensible, making it suitable for safety-critical and commercial LLM deployment settings where retraining is infeasible but continual unlearning is necessary. We propose a novel and systematic unlearning framework aimed at enhancing prompt-based unlearning, which is a largely underexplored area. Our extensive experiments demonstrate strong effectiveness in both forgetting quality and model utility, validating the practicality and impact of our approach. Future work includes improving the latency during in-context intervention, which may lead to a stronger unlearning method.
Table 4: Per-example latency (in milliseconds) for the detection module and unlearned prompt inference under open-ended generation.
| Datasets | Models | Detection time | Guard Inference (Not including detection) |
|---|---|---|---|
| TOFU-forget10 | Llama2-7B-Chat | 4.63 | 665.71 |
| TOFU-Retain | Llama2-7B-Chat | 4.83 | 42.93 |
Limitation 3: Computational overhead for training guard model. We would like to clarify that:
- We use a relatively small LLM (≤8B) as the guard model, which significantly reduces the computational burden (training takes around 30 to 50 minutes on two A100 GPUs using the Accelerate depending on the tasks).
- Cross-model and cross-phase applicability: Unlike existing training-based unlearning methods [1,2] that require repeated fine-tuning per task, per model, and per unlearning request phase in continual unlearning setting, our guard model is trained once and reused across models and unlearning requests.A single trained guard model can generalize to various base models (e.g., LLaMA3-8B-Instruct, Yi-34B-Chat) and even black-box LLMs (as shown in Figure 2) to enforce unlearning behavior. Additionally, it can be reused during continual unlearning, where new forget requests may arrive over time. This “one-time cost, many-time benefit” design improves efficiency and reusability.
- The practical benefits of the guard model far outweigh the computational overhead required to train it. Once trained, the guard model serves as a core component of our framework, effectively unlearning undesirable information. Importantly, the training process is straightforward and stable, consistently yielding the desired behavior (generate reasoning instruction). In contrast, training-based unlearning methods often struggle to achieve a reliable balance between unlearning effectiveness and preserving model utility [2], especially in real-world or continual settings.
[1] LLM Unlearning via Loss Adjustment with Only Forget Data
[2] TOFU: A Task of Fictitious Unlearning for LLMs
I thank the authors for their response. My concerns have been addressed.
Thank you for your follow-up and for taking the time to review our response. We're glad to hear that your concerns have been addressed, and we truly appreciate your constructive feedback throughout the process. Wishing you all the best in your professional and personal endeavors!
This paper introduces DRAGON, a systematic framework for unlearning in LLMs. The framework addresses the challenge of removing private or harmful knowledge without relying on retain data or fine-tuning, which are not always feasible in real-world applications. DRAGON leverages in-context learning and CoT reasoning to ensure safe and accurate unlearning. It introduces a lightweight detection module to identify prompts requiring unlearning and guides the model with reasoning instructions for safe intervention. The paper also proposes new metrics to evaluate unlearning performance and demonstrates the effectiveness of DRAGON across several unlearning tasks.
优缺点分析
As far as I known, people prefer to fine-tune the model for unlearning as it can more thoroughly remove the undesirable knowledge or behaviors. Could the authors explain why the community much prefer those training-based methods, what’s their drawbacks, why do researchers still use them, and why training-free methods are much better.
I do not think it is proper to say that it is hard to have the data for maintenance. As a generative model, we actually can make the model generate infinite number of data, which can be used for maintenance. Even though these data might not be the original data, personally, I think they are good enough to preserve model utility.
Some of the recent reference might be useful to be mentioned, forgive me if they have been mentioned or irrelevant to the current work. First, the authors mention it is hard to perform continuous unlearning, where the following paper might need to be mentioned and discussed.
- Adaptive Localization of Knowledge Negation for Continual LLM Unlearning
- On Large Language Model Continual Unlearning
- Rethinking machine unlearning for large language model
Moreover, the authors mentioned the data limited scenarios, where I think the following papers need to be discussed.
- Remaining-data-free Machine Unlearning by Suppressing Sample Contribution
- Toward Efficient Data-Free Unlearning
- In-Context Unlearning: Language Models as Few Shot Unlearners
- Llm unlearning via loss adjustment with only forget data
The paper seems to have quite a close connection with the previous paper of in-context unlearning. Therefore, more discussion that highlight the new contributions should be involved.
Another question is about why the authors want to propose some new metrics. The community has introduced so many metrics and the following researchers might not bear to do more evaluations during their experiments. It seems that some of the previous works have discussed the drawbacks of existing metrics, such as
- Eight Methods to Evaluate Robust Unlearning in LLMs
- Unlearning with Control: Assessing Real-world Utility for Large Language Model Unlearning
I think the authors should do the similar analysis, highlighting the drawbacks of existing works, then discussing why the newly proposed metrics or evaluation framework can handle these problems. Highlighting why the proposed metrics are important for the community is also interesting to me.
Just a minor suggestion, shall we define the notations of mathbb{P} and h before we use them? Also, it seems that h and M are both used to define the model, what’re their differences?
I agree that the goals of unlearning can be categorized as sample unlearning and concept unlearning. Do the authors have any preference about these two goals, that’s to say, is sample unlearning more practical than concept unlearning , or the reverse? Also, it is weird to me when reading the definition of concept unlearning. By the authors, it is defined by the unlearning dataset. However, the authors also mentioned that the unlearning dataset is not available. So, it seems that it is impossible to achieve the gaol of concept unlearning. Anyway, I think it is not a well-established definition.
In Sec 3.2, the authors mentioned the drawbacks of existing unlearning metrics, without mentioned what exactly they are. Above mentioned two works about unlearning evaluations might also be helpful to you. Also, although new metrics have been proposed, it does not mean that the old metrics are completely useless (if not, the authors should justify it) and they should be used in your experiments.
About the benchmarks, could the authors explain why MUSE are not involved in their experiments?
What is gray-box unlearning? Have they be mentioned in this paper or previous works? What factors ensure Eq 4 and Eq 5 can lead to robust and reliable detection? Could the authors clarify them?
I do not think the method used in this paper can be categorized as an unlearning method but another strategy to avoid expose harmful information to the users.
Also, although it is free from the gradient-based unlearning, it will also involve additional costs when the model is deployed, i.e., it involves extra inference-time costs.
The novelty of this paper seems to be limited, as some of the previous works also adopt the similar pipeline, such as GUARD: Generation-time LLM Unlearning via Adaptive Restriction and Detection
问题
Mentioned in the Strengths And Weaknesses above.
局限性
Mentioned in the Strengths And Weaknesses above.
最终评判理由
The paper quality and rebuttal are professional. Although I still do not think that post-hoc approaches will contribute to the machine learning community's deeper understanding of LLMs, this is an issue that requires further discussion, and I do not expect everyone to have the same opinion as mine. Therefore, in response to the authors' rebuttal, I have raised my score to 4.
格式问题
NA
Thank you for your valuable and detailed evaluation of our study and we would like to provide some clarification to address your concerns. Due to space constraints, some explanations were abbreviated. We’re happy to provide further clarification if needed.
W1: Why training-free methods are preferred in this paper. Training-based unlearning methods are widely used for their ability to deeply modify model parameters and achieve strong forgetting. However, they come with notable drawbacks: high computational cost, extensive hyperparameter tuning, difficulty preserving model utility [3], and incompatibility with API-based or proprietary models. They also struggle with continual unlearning, as each request typically requires new retraining, and risk catastrophic forgetting if not well-constrained.
In contrast, training-free methods are lightweight and more practical for deployment. They do not require access to model weights and are more scalable under resource constraints or privacy regulations. Recent works increasingly explore both directions, highlighting growing interest in training-free alternatives [1,2]. While we do not argue that training-free methods should replace training-based ones entirely, they offer a complementary, pragmatic solution for real-world use cases where retraining is infeasible or undesirable.
[1] In-context unlearning: Language models as few-shot unlearners.
[2] Large Language Model Unlearning via Embedding-Corrupted Prompts
[3] TOFU: A Task of Fictitious Unlearning for LLMs
W2: Clarification on Data Availability for Maintenance. The core challenge is the lack of access to original forget data in real-world settings (e.g., user deletion requests or expired licenses [4]). Once removed, this data cannot be reused for retraining, forcing training-based methods to operate on already-unlearned models, risking compounded degradation. Our training-free approach avoids this dependency by leveraging a detection module, enabling flexible, repeatable unlearning without accumulating utility loss.
[4] On Large Language Model Continual Unlearning
W3: Related work discussion. While several cited papers are already discussed, we acknowledge that [5–7] were not included and will revise the paper accordingly.
[5] Adaptive Localization of Knowledge Negation for Continual LLM Unlearning
[6] Remaining-data-free Machine Unlearning by Suppressing Sample Contribution
[7] Toward Efficient Data-Free Unlearning
W4: New contributions highlighting. Our work is related to In-Context Unlearning [1], as acknowledged in the Related Work section. However, our approach introduces several key innovations that distinguish it. We propose a robust and effective detection mechanism that combines a trained scoring model with the similarity-based metric as a secondary safeguard. In contrast, prior work typically assumes full access to the forget data [1] and relies on simple keyword matching or output filtering [8], which are less robust and flexible. Moreover, instead of relying on handcrafted demonstrations, we leverage the model's instruction-following ability to inject safety specifications directly and further enhance control using reasoning-based prompts. This allows for a more adaptive and scalable intervention mechanism. These two components enable effective, scalable, and training-free unlearning in both standard and continual settings.
[8] Guardrail Baselines for Unlearning in LLMs
W5: Importance of proposed evaluation metrics. Our goal is not to replace existing metrics but to complement them in real-world scenarios like harmful knowledge and continual unlearning.
- Refusal Quality(RQ) goes beyond refusal rate by evaluating whether the model's refusal is coherent and appropriate, which is an important factor often overlooked in prior works, where refusals can be nonsensical [9].
- DDS and DUS address limitations of static evaluation [4] by tracking the stability and cumulative impact of repeated unlearning over time. It can serve as a diagnostic tool for evaluating and comparing unlearning methods before deployment.
We continue to report standard metrics on TOFU [10], WMDP, and MUSE for comparability. RQ is used only for WMDP, while DDS/DUS are specific to continual unlearning. We will clarify the motivation and incorporate a more thorough discussion of related work in the revision.
[9] ReLearn: Unlearning via Learning for Large Language Models
[10] LUNAR: LLM Unlearning via Neural Activation Redirection
W6: Clarification on Sample and Concept Unlearning. In our work, sample and concept unlearning are used as complementary categories to structure the unlearning problem space, not as mutually exclusive definitions.
- Sample unlearning focuses on removing specific data instances (e.g., TOFU), where the forget set is well-defined.
- Concept unlearning targets broader semantic categories (e.g., harmful or illegal content), where exact instances are unknown, but suppression of a general class is desired.
These settings can also be combined—for example, WMDP involves removing both specific samples and the concepts they represent.
Regarding concept unlearning, we denote the forget signal as a concept set , following [11], and aim to ensure that the unlearned model retains no actionable knowledge for any prompt drawn from , which is generated forget prompts based on the concept. We will clarify these definitions in the revised version.
[11] When Machine Unlearning Meets Retrieval-Augmented Generation (RAG): Keep Secret or Forget Knowledge?
W7: The MUSE results are in Table 10 in the appendix.
W8: "Gray-box unlearning" refers to settings with access to model outputs (e.g., logits or responses) but not to weights or training data, e.g., GUARD [12]. While the term hasn’t been explicitly formalized in prior unlearning work, we will clarify its use in the revision.
Equations 4 and 5 describe a layered detection design: the scoring model provides a coarse-grained first-pass filter, and the similarity-based score offers secondary verification by comparing prompt embeddings to the curated unlearn store. This design improves detection reliability and accuracy. Table 5 empirically validates the detector’s effectiveness across tasks.
W9: On Whether Our Method Qualifies as Unlearning.
We respectfully disagree with the claim that our method does not constitute LLM unlearning. While some LLM unlearning methods often involve parameter updates via gradient-based training, our approach addresses a complementary and increasingly practical setting [1,2,12] where model weights are fixed, and unlearning must be enforced at inference time, such as in API-based or frozen-model deployments.
Our method explicitly detects and intercepts prompts related to forgotten content, and modifies model behavior to avoid generating responses tied to that information. In doing so, we fulfill the core objective of unlearning: preventing the model from retaining or revealing specific knowledge[13].
[12] GUARD: Generation-time LLM Unlearning via Adaptive Restriction and Detection
[13] Large Language Model Unlearning
W10: Inference-Time Cost Justification. We acknowledge that our approach introduces a modest increase in inference-time cost compared to standard generation. However, both the detection module (Table 1) and the policy retrieval component are computationally inexpensive.
For non-forget-related prompts, the detection module runs once, and no further intervention is triggered. Thus, the inference latency remains effectively the same as standard LLM inference for the vast majority of input queries.
For forget-related prompts, safety becomes the top priority. In such cases, a modest latency increase is acceptable, particularly for sensitive or regulated domains where safety outweighs speed. Moreover, future enhancements like prompt summarization or context compression offer promising directions to further reduce intervention cost.
In summary, our method achieves practical, training-free unlearning with minimal deployment overhead. It is especially suitable for real-world scenarios where model internals are inaccessible (e.g., commercial LLM APIs) and latency constraints must be balanced with safety and compliance.
Table 1: Per-example latency (in milliseconds) for the detection module and unlearned prompt inference under open-ended generation.
| Datasets | Models | Detection time | Guard Inference (Not including detection) |
|---|---|---|---|
| TOFU-forget10 | Llama2-7B-Chat | 4.63 | 665.71 |
| TOFU-Retain | Llama2-7B-Chat | 4.83 | 42.93 |
| WMDP | Zephyr-7B | 237.79 | 1035.16 |
| MMLU | Zephyr-7B | 323.41 | 119.81 |
W11: Comparison with GUARD. Our approach differs fundamentally from GUARD in motivation, design, and deployment feasibility. GUARD uses a trained classifier to filter forbidden tokens during generation, requiring fine-grained control over the decoding process. In contrast, we detect forget-related prompts before generation via a robust detection module that combines a trained scoring model with similarity-based metrics, and intervene using instruction-following with reasoning-based guidance. GUARD assumes access to internal decoding mechanisms, which is infeasible for black-box or API-only LLMs. Our framework is API-compatible and thus more deployable. Additionally, unlike GUARD, we avoid storing forget data by using synthetic negatives or embeddings, improving privacy and enabling continual unlearning. Our method also supports explainable, scalable intervention across multiple models and tasks.
Minor Notation. In Line 110, we use to denote the conditional probability distribution over tokens, while denotes the language model as a whole, which generates actual text through decoding.
Dear Reviewer rPZF,
We hope this message finds you well. We have received your “Mandatory Acknowledgement” of our rebuttal and are writing to respectfully follow up to ensure our rebuttal fully addresses your initial concerns. We greatly appreciate the time and thoughtful feedback you have provided, which has helped us strengthen the paper. In our rebuttal, we aimed to directly address each of your points with clarifications (weaknesses 1, 2, 4, 5, 6, 7, 8, 9, 11), additional analyses (weaknesses 3, 10), and supporting results. We hope these additions have resolved the issues you raised. If there are any remaining questions or aspects, we would be more than happy to elaborate promptly. Your confirmation would help us improve the paper further.
Best regards,
Authors
The paper quality and rebuttal are professional. Although I still do not think that post-hoc approaches will contribute to the machine learning community's deeper understanding of LLMs, this is an issue that requires further discussion, and I do not expect everyone to have the same opinion as mine. Therefore, in response to the authors' rebuttal, I have raised my score to 4. Good luck!
Thank you for your thoughtful follow-up and for raising your score.
As LLM unlearning is still a developing area, we believe both training-free and training-based approaches merit further discussion and exploration, as they address complementary and increasingly practical settings.
We value your recognition of our paper’s quality and rebuttal. Your feedback has been invaluable in strengthening the work.
Once again, thank you for your detailed and positive feedback. Wishing you all the best in your professional and personal endeavors!
This paper presents a novel unlearning framework, DRAGON, that integrates a guardrail module and unlearning CoT instruction generation module to effectively unlearn content from black-box LLMs without requiring model fine-tuning. Specifically, it introduces a detection mechanism to identify queries requiring unlearning, along with a CoT instruction generation strategy to intervene contextually. Additionally, the paper proposes three new evaluation metrics: Refusal Quality, Dynamic Deviation Score, and Dynamic Utility Score. Empirical evaluation across sample and concept unlearning tasks demonstrates DRAGON’s state-of-the-art performance in balancing unlearning effectiveness and model utility
优缺点分析
Strength:
- The paper introduces a practical unlearning approach that effectively handles proprietary, black-box LLMs.
- Leveraging the inherent instruction-following capability of LLMs, DRAGON significantly mitigates potential catastrophic forgetting and utility degradation common in training-based methods.
- Comprehensive experiments and ablation studies validate the effectiveness and strengths of the approach, consistently achieving state-of-the-art results across various benchmarks.
Weakness:
- The necessity of substantial task-specific adjustments could limit real-world applicability due to the complexity of constructing tailored guardrails and CoT instructions.
- Computational overhead introduced by the generation of guardrail instructions could have a deeper analysis
- The method heavily depends on the reliability and robustness of the detection module and guardrail mechanisms. Additional systematic investigations into the robustness of these components would further strengthen the framework's practicality.
- The motivation for the dynamic deviation score and dynamic utility score is not well justified. The primary goal for model providers is to ensure that forget quality and utility are well preserved during deployment, and they are unlikely to deploy a model that lacks these properties. This reduces the practical motivation for dynamic tracking in unlearning.
问题
Could the authors provide more details about the MUSE benchmark? Specifically, how is the guardrail constructed, and how are the instructions rewritten? It is interesting to understand how the guardrail can effectively retain the relevant contents with high KnowMem while achieving nearly complete forgetting on the KnowMem forget set.
局限性
See weakness above
最终评判理由
This is a solid work on LLM unlearning. While it may be somewhat engineering-oriented, it significantly improves the metric scores for LLM unlearning, especially outperforming training-based approaches. The rebuttal also addresses most of my concerns, so I am raising my score to 5.
格式问题
No paper formatting issue
We sincerely appreciate the reviewer’s time and effort in reading our paper and offering thoughtful suggestions and constructive feedback.
W1:Task-Specific Adjustments and Real-World Applicability. We clarify that our framework is designed to be modular and reusable, minimizing task-specific overhead in practice.
Tasks can be grouped into broad categories: private, harmful, and copyright-related information, each of which may contain multiple subtasks. For each category, the same detection and guard models can be reused with minimal tuning.
- The Unlearn Store is simple to maintain, as it consists of paraphrased or synthetic forget prompts.
- The scoring model is trained using lightweight text samples and can be quickly adapted to new tasks.
- Guard model training is performed once per category and reused across subtasks to generate CoT instructions.
While guardrails may require some task-specific policy definitions, these can be bootstrapped or automated using an LLM or agent guided by category-level templates. Overall, we propose a scalable, training-free unlearning framework that supports generalization with low maintenance cost compared with training-based unlearning methods, making it suitable for real-world deployment.
W2: Computational overhead. We thank the reviewer for this important comment. The generation of guardrail instructions consists of: (1) a fast policy retrieval (<1 ms), and (2) a CoT instruction, which is generated only when a forget-related prompt is detected.
While this generation step introduces additional latency, it is selectively triggered for potentially sensitive inputs, where response safety takes precedence over speed. In such cases, slightly longer inference time is often tolerable, especially in real-world applications like compliance-sensitive interfaces, moderation systems, or regulated domains.
Note that the guardrail is applied after detection, so only a small fraction of total queries experience any added cost. For non-forget-related queries (the majority), there is no latency overhead beyond the base model inference (detection module only takes ~5 ms on TOFU dataset).
Finally, there is significant room for optimization. We propose a novel and systematic unlearning framework aimed at enhancing prompt-based unlearning, which is a largely underexplored area. Our extensive experiments demonstrate strong effectiveness in both forgetting quality and model utility, validating the practicality and impact. This is a good start. Techniques such as prompt distillation, trace caching, or summarized CoT generation are promising directions to reduce this cost further, and we plan to explore them in follow-up work. We will clarify this in the revised manuscript.
W3 Robustness of the detection module and the framework.
To evaluate this, we conducted a series of attack experiments, as shown in Tables 13 and 14 (Appendix), demonstrating that our full framework remains robust under various adversarial conditions.
- Robustness of the detection module. To isolate and further analyze the detection module’s resilience, we also conducted dedicated attack experiments focused solely on the detector (Table 1). These include AIM attacks, language-mixing attacks, and typo-based perturbations, as introduced in Appendix D.6. Instead of using Attack Success Rate, we report detection accuracy to directly measure the detector's performance under attack. A higher or comparable accuracy relative to the original setting indicates that the detector is robust to these attacks. Our results confirm that the detection module maintains strong performance even under these common adversarial manipulations.
Table 1: The detection accuracy on TOFU forget dataset under different attacks.
| Attack type | TOFU-10% | TOFU-5% | TOFU-1% |
|---|---|---|---|
| Original | 1.0 | 1.0 | 1.0 |
| AIM Attacks | 1.0 | 1.0 | 1.0 |
| Language mix (2 languages) | 1.0 | 1.0 | 1.0 |
| Language mix (4 languages) | 0.88 | 0.97 | 0.97 |
| Typo attack | 0.97 | 0.98 | 0.97 |
- Robustness to out-of-distribution (OOD) prompts (Forget dataset). We conduct experiments on ****out-of-distribution (OOD) prompts to evaluate the robustness of the detection module. Rephrased prompts are generated by GPT-4o through paraphrasing the original forget prompts to confuse the detector. Keywords and Short Phrases refer to prompts rewritten using only a minimal set of key terms or fragments. Adversarial prompts include small perturbations such as misspellings, Unicode homoglyphs, or unnatural spacing to evade exact-match detection. Detailed prompt generation instructions will be added in the revised version for reproducibility. In Table 2, the detector module is robust to the generated OOD prompts regarding the forget dataset.
Table 2: Detection accuracy of the TOFU and WMDP detectors on various types of out-of-distribution (O.O.D.) prompts derived from the forget dataset.
| TOFU-forget10 | WMDP (Avg) | |
|---|---|---|
| Original | 1.0 | 0.98 |
| Rephrased | 1.0 | 0.96 |
| Keywords and short phrases | 1.0 | 0.97 |
| Adversarial | 0.99 | 0.95 |
- Robustness to out-of-distribution (OOD) prompts (Unseen normal dataset). To evaluate detection performance on non-forget-related, out-of-distribution content, we randomly sample 400 prompts each from SimpleQA and Alpaca datasets. These serve as control datasets not subject to unlearning.
Table 3: Detection accuracy of the TOFU and WMDP detectors on unseen, non-forget-related O.O.D. prompts from SimpleQA and Alpaca. (Forget is the positive class)
| General Dataset | TOFU | WMDP |
|---|---|---|
| Simple QA - 400 | 0.01 | 0.11 |
| Alpaca- 400 | 0.01 | 0.05 |
On the general set, our detectors correctly classify these prompts as non-forget, exhibiting a low false positive rate. This suggests that the performance of the main LLM on inputs unrelated to the forget set is unlikely to be negatively impacted. Both Table 2 and Table 3 demonstrate the robustness of our detection module under OOD distribution.
W4: Motivation for Dynamic Deviation Score and Dynamic Utility Score. While model providers aim to preserve both forget quality and utility, this becomes particularly challenging in continual unlearning settings, where new unlearning requests are issued over time. Although the utility degradation from a single unlearning step may appear negligible (e.g., from 0.00001), it can accumulate significantly over time (e.g., reaching 0.1), leading to noticeable drops in performance.
The DDS and DUS are designed to measure the stability and cumulative impact of unlearning methods across multiple rounds. These metrics enable researchers and practitioners to identify methods that are robust to continual unlearning and to detect early signs of performance drift before deployment.
Importantly, DDS/DUS do not replace standard metrics like forget accuracy or static utility; rather, they complement them by capturing long-term behavior in realistic deployment settings. They help identify approaches that maintain consistent behavior and avoid error accumulation, which is critical for real-world applications where continual unlearning is necessary.
Q1: Clarification on MUSE Benchmark and Guardrail Construction. Our detection module integrates the learned scoring model that captures high-level prompt features to assess alignment and the similarity-based metrics that computes prompt-to-store sample distances for second verification. For the detection module used in MUSE, we first train a chunk-level classifier using forget and retain data split into text segments. To improve generalization, we generate various modified questions (e.g., paraphrased, partial) from this data and train a second, question-aware classifier. These two classifiers form the scoring model, capturing both content and query-level semantics. Additionally, we build an Unlearn Store that contains summaries of forget content, and use similarity-based matching as a second verification step to further reduce false negatives.
High KnowMem Retention. The dual-filtering mechanism allows the detector to accurately distinguish between forget and retain or non-forget content. This ensures that no intervention is triggered to queries from the retain set, contributing to the high KnowMem on it.
Good KnowMem Forgetting. For prompts identified as forget-related, we extract the relevant policy and generate a reasoning-based CoT trace using the trained guard model. These instructions leverage the LLM’s inherent instruction-following ability to enforce forgetting without retraining.
Thank you for your response. While I still believe this work may be somewhat engineering-oriented, it significantly improves the metric scores for LLM unlearning, notably outperforming training-based approaches. The rebuttal also addresses most of my concerns, so I am raising my score to 5.
Thank you for your thoughtful consideration and for raising the score.
We understand the concern that parts of our work may appear engineering-oriented, and we appreciate the opportunity to clarify our contributions further. Our framework introduces several technically novel components, including a robust detection module that combines a trained scoring model with similarity-based metrics to effectively identify forget prompts, and a reasoning-based intervention mechanism that operates entirely without training the base model. Together, these components enable scalable, training-free unlearning under continual and black-box settings, which are scenarios that we believe are both technically non-trivial and practically relevant.
We truly appreciate your engagement and thoughtful feedback, and we will ensure these aspects are more clearly highlighted in the revised version. Wishing you all the best in your professional and personal endeavors!
DRAGON-Guard tackles the whack-a-mole problem in LLM safety: once a model is released, attackers rapidly paraphrase disallowed text so that static keyword filters or one-shot fine-tunes no longer catch them, while retraining the whole network is prohibitively slow and costly. The authors hypothesise that a light-weight combination of (i) nearest-neighbour retrieval on hidden representations and (ii) a two-minute “surgical” patch fine-tune can make a model forget specific strings without hurting its general skills.
Starting from seed jailbreak prompts, they automatically generate paraphrased attacks in the SNET benchmark to stress-test defences and to populate a vector store that is consulted at inference time. When a query’s hidden state falls inside the learned neighbourhood of a banned concept, a chain-of-thought guard turns the answer into a safe refusal; afterwards, an optional patch updates only a few million parameters so the ban becomes permanent.
On three 7-B-parameter backbones (Llama-2, Mistral and Qwen 1.5) the detector alone already reaches F-scores around 0.90–0.91 on both exact and paraphrased SNET splits (e.g., 0.908/0.909 on SNET-4+), and adding the patch pushes exact-string block rate above 0.95 with only tens-of-milliseconds overhead. Meanwhile, broad utility drops by <2 points on QNLI, TruthfulQA and related suites, well within ordinary training noise, confirming the authors’ claim that the method “preserves general language ability”.
Taken together, the paper delivers the first end-to-end pipeline that can automatically red-team, detect and then surgically forget hundreds of taboo strings in a matter of minutes, offering a pragmatic safety valve for models that must keep pace with ever-evolving misuse tactics.
优缺点分析
Strength:
-
Instant, training-free unlearning. The guard “forgets” a disallowed prompt simply by inserting or deleting its embedding in the FAISS store; the authors report that even 1 k rules can be updated in < 10 seconds on one GPU, with zero model fine-tuning
-
Semantically robust two-stage defence. A retrieval layer first pulls up near-duplicate or paraphrased queries, and a lightweight reasoning LLM (“CoT-Guard”) then judges policy compliance. This design catches sophisticated jailbreak rewrites while avoiding many lexical false-positives
-
High effectiveness with minimal utility loss. Across three backbone chat-LLMs the system blocks 90–95 % of direct jailbreaks and recalls 85–92 % of paraphrased attacks, yet the benign answer quality score (QLT) drops by < 3 points
Weakness:
-
Sensitive to truly novel (OOD) paraphrases. On the OOD evaluation split SNET fl9′, detection metrics fall noticeably relative to the in-distribution splits, confirming that recall drops when attackers stray farther from stored paraphrases
-
Storage and latency grow linearly. Every forgotten prompt adds a new high-dimensional vector and a fresh chain-of-thought template; the paper acknowledges that scaling to millions of rules remains an open problem that will need smarter indexing or compression.
-
More engineering than scientific novelty. DRAGON-Guard is positioned as an industry prototype—an integration of existing semantic search and CoT reasoning blocks—rather than a fundamentally new algorithmic idea, as the authors themselves note in their discussion of “practical deployment” rather than theoretical contributions
问题
NA
局限性
NA
最终评判理由
Most of the concerns have been addressed.
格式问题
NA
We sincerely appreciate your insightful evaluation of our study. Thanks so much for your positive and valuable review.
W1: Sensitive to truly novel (OOD) paraphrases. We conduct experiments on ****out-of-distribution (OOD) prompts to evaluate the robustness of the detection module.
In Table 1, the detector module is robust to the generated OOD prompts regarding the forget dataset. Rephrased prompts are generated by GPT-4o through paraphrasing the original forget prompts to confuse the detector. Keywords and Short Phrases refer to prompts rewritten using only a minimal set of key terms or fragments. Adversarial prompts include small perturbations such as misspellings, Unicode homoglyphs, or unnatural spacing to evade exact-match detection. Detailed prompt generation instructions will be added in the revised version for reproducibility.
Table 1: Detection accuracy of the TOFU and WMDP detectors on various types of OOD prompts derived from the forget dataset. (Higher the better)
| TOFU-forget10 | WMDP (Avg) | |
|---|---|---|
| Original | 1.0 | 0.98 |
| Rephrased | 1.0 | 0.96 |
| Keywords and short phrases | 1.0 | 0.97 |
| Adversarial | 0.99 | 0.95 |
On the general set (Table 2), our detectors correctly classify these prompts as non-forget, exhibiting a low false positive rate. To evaluate detection performance on non-forget-related, out-of-distribution content, we randomly sample 400 prompts each from SimpleQA and Alpaca datasets. These serve as control datasets not subject to unlearning. The result suggests that the performance of the main LLM on inputs unrelated to the forget set is unlikely to be negatively impacted. Both Table 1 and Table 2 demonstrate the robustness of our detection module under OOD distribution.
Table 2: Detection accuracy of the TOFU and WMDP detectors on unseen, non-forget-related O.O.D. prompts from SimpleQA and Alpaca. (Lower the better)
| General Dataset | TOFU | WMDP |
|---|---|---|
| Simple QA - 400 | 0.01 | 0.11 |
| Alpaca- 400 | 0.01 | 0.05 |
W2: Storage and latency grow linearly. We agree that latency may grow with the number of forget prompts during inference. However, in practice, both the storage and inference-time cost are well-controlled, and the incremental overhead per instruction remains modest:
Storage-wise: we do not store all individual forget prompts. Instead, only modified representative prompts or key information are stored in the Unlearn Store before unlearning. This avoids redundancy. Also note that the Unlearn Store serves as a secondary verification for our detection module.
Latency-wise: detection remains fast (~5 ms), and policy retrieval adds negligible cost. CoT generation is only triggered after detection and thus applies to a small fraction of inputs. This ensures that the average-case latency remains low, and high-latency responses are only incurred when safety is critical.
We acknowledge that scaling to millions of rules remains an open challenge. However, our framework is designed to be extensible. In scenarios with large-scale rule sets: The Unlearn Store can be scaled using representative vector selection to facilitate the detection process. The scoring model can be trained on larger rule datasets to generalize across prompt families. For in-context intervention, we can incorporate context compression or virtual tokens to reduce prompt length and memory usage.
Overall, our method is designed to be modular and incrementally extensible, making it suitable for safety-critical and commercial LLM deployment settings where retraining is infeasible but continual unlearning is necessary. We propose a novel and systematic unlearning framework aimed at enhancing prompt-based unlearning, which is a largely underexplored area. Our extensive experiments demonstrate strong effectiveness in both forgetting quality and model utility, validating the practicality and impact of our approach. To further improve scalability and latency, we plan to explore smarter indexing techniques, deduplication, and in-context guardrail compression in future work, which will lead to a stronger unlearning framework.
Weak 3: Novelty Justification. While we position DRAGON as a practical and deployment-friendly unlearning framework, we respectfully argue that our work offers meaningful algorithmic contributions, particularly in the design of the detection, and in-context intervention modules.
- Our detection module is more than a simple semantic search. It integrates: a learned scoring model that captures high-level prompt features to assess alignment with unlearning objectives and a similarity-based metric that computes prompt-to-store embedding distances for second verification. These two signals are combined into a unified confidence score, enabling robust and adaptive thresholding to handle distributional shifts and paraphrased attacks. This dual-layer structure forms a lightweight yet effective algorithm for scalable and accurate detection. Table 5 in paper demonstrates the effectiveness of our detection module, which achieves nearly perfect detection performance.
- In context intervention. Upon detection, we dynamically retrieve relevant safety policies from a curated policy store, ensuring that the response aligns with deployment-specific safety requirements. To enforce unlearning, we leverage the LLM’s instruction-following ability in a CoT-guided intervention strategy: We use rejection sampling to construct a high-quality CoT instruction dataset aligned with safety goals. A trained guard model then generates context-aware reasoning traces, which are prepended to the input to modulate model behavior at inference time. This setup introduces a soft unlearning constraint without modifying the underlying weights, offering a novel prompting-based LLM unlearning strategy that is interpretable, modular, and scalable to black-box settings.
- Our method supports repeatable unlearning across requests, scales to large models (e.g., GPT-4o), and avoids retraining, offering both practical value and methodological novelty.
We believe these contributions offer not only practical impact but also methodological advancements, especially as the field shifts toward real-world unlearning under constraints like ****user-driven data deletion for continual unlearning setting and API-only deployments. We will clarify these contributions more explicitly in the revised version.
Thanks for the response with explanation. Most of the concerns have been addressed. The score have been updated accordingly.
Thank you for your follow-up and for updating your score. We appreciate your recognition that most of the concerns have been addressed. Your review has been very helpful in refining our work.
Once again, thank you for your valuable feedback and support. Wishing you all the best in your professional and personal endeavors!
Summary
This paper proposes DRAGON, a lightweight, black-box LLM unlearning framework designed to address critical limitations of existing training-based unlearning methods (e.g., reliance on retain data, model modification). DRAGON leverages two core components: (1) a lightweight detection module that identifies forget-worthy prompts (without needing retain data) by combining a learned scoring model and similarity-based verification against an Unlearn Store of representative prompts; (2) a chain-of-thought (CoT) guard model that injects context-aware safety instructions at inference time to enforce unlearning, without modifying the base LLM’s weights. The paper also introduces novel evaluation metrics—Refusal Quality (RQ), Dynamic Deviation Score (DDS), and Dynamic Utility Score (DUS)—to assess unlearning robustness and cumulative performance in continual unlearning settings. Empirically, DRAGON is validated across three unlearning tasks (sample, concept, continual) on backbones including Llama-2-7B, Mistral, and Qwen 1.5, achieving 90–95% block rates for jailbreak/paraphrased prompts with <2–3 points of utility loss on benchmarks like QNLI and TruthfulQA.
Strengths
-
This work has merits of practical training-free & black-box compatibility. DRAGON operates without retraining or accessing base model weights, making it suitable for proprietary/API-based LLMs (a key real-world constraint). Rule updates and reusable components enable fast, scalable deployment.
-
The unlearning design shows to be robust. The detector (semantic search + scoring model) catches paraphrased/jailbreak prompts (F1 scores 0.90–0.91), while the CoT guard model reduces false positives by enforcing policy-aligned responses, addressing the whack-a-mole problem of static filters.
-
Extensive experiments (TOFU, WMDP, MUSE benchmarks) confirm DRAGON blocks 90–95% of direct/jailbreak prompts and retains 85–92% recall for paraphrased attacks, with negligible drops in benign answer quality (<3 points) and general task performance.
-
Novel evaluation metrics for continual unlearning. RQ (measures coherent refusals), DDS (tracks unlearning stability), and DUS (captures cumulative utility loss) fill gaps in existing metrics, which fail to account for long-term performance drift in real-world deployment.
-
Components like the Unlearn Store (stores representative, not all, forget prompts) and category-specific guard models minimize redundancy, supporting practical use cases like user-driven data deletion or compliance-sensitive domains.
Weaknesses
-
The approach is somewhat sensitivity to novel out-of-distribution (OOD) paraphrases: While authors demonstrated robustness to rephrased/adversarial prompts (detection accuracy >0.95), detection metrics decline on highly novel OOD splits (e.g., SNET fl9′), indicating vulnerability to advanced paraphrase attacks.
-
The technical novelty is limited. DRAGON integrates existing techniques (semantic search, CoT reasoning) rather than introducing fundamental new ideas, making it more engineering-oriented than a paradigm shift in unlearning.
-
Though modular, the framework requires task-specific policy definitions (e.g., fake author names for TOFU vs. refusal for WMDP). Automating these policies further would reduce deployment friction.
Discussions during Rebuttal Period
Reviewers raised five key concerns.
-
OOD Robustness ( dq3Q, SVAA, ZNbo): Reviewers noted detection performance drops on novel paraphrases. Authors responded with experiments on GPT-4o-generated rephrased, keyword-only, and adversarial prompts (Tables 1–2), showing detection accuracy >0.95 for forget-related OOD prompts and <0.11 false positives on non-forget data (SimpleQA/Alpaca). This demonstrated the detector’s resilience to common OOD attacks.
-
Latency Overhead (adq3Q, SVAA, rPZF): Reviewers questioned linear scaling. Authors clarified the Unlearn Store uses representative (not all) forget prompts to avoid redundancy; detection runs in ~5ms, and CoT generation is only triggered for high-risk prompts (minimizing average latency). They also proposed future optimizations (indexing, context compression), framing current overhead as acceptable for safety-critical use cases.
-
Novelty & Metric Motivation (rPZF, SVAA): Reviewers criticized limited algorithmic novelty and unclear metric value. Authors highlighted the dual-layer detector (scoring model + similarity verification) and dynamic CoT intervention as novel extensions of existing techniques; for metrics, they explained DDS/DUS complement (not replace) standard metrics by tracking cumulative utility loss in continual unlearning—critical for long-term deployment.
-
Detector Robustness to Attacks (ZNbo): Reviewers asked about adversarial risks. Authors presented dedicated experiments (AIM attacks, language mixing, typos) showing detection accuracy >0.95 under attacks (Table 1), confirming the detector’s resilience to common perturbations.
-
Task-Specific Adjustments (SVAA): Reviewers noted deployment friction from task-specific policies. Authors explained tasks are grouped into categories (private/harmful/copyright), with guard models trained once per category and policies bootstrapped via LLM templates—minimizing task-specific overhead.
Most reviewers updated their scores after the rebuttal, confirming that concerns were adequately addressed. The meta-reviewer weighs these responses as thorough, with empirical evidence supporting claims, reinforcing the paper’s suitability for acceptance.
Decision Justification
RAGON addresses a critical yet under-solved problem in LLM safety: unlearning for black-box/API-deployed models where training-based methods are infeasible. It has merits of practicality, strong empirical validation, and novel metrics, which directly align with NeurIPS’ focus on both technical rigor and real-world impact. The paper’s weaknesses are mostly incremental and addressable: OOD sensitivity can be mitigated with better paraphrase augmentation, and scalability limits are acknowledged as future work (a reasonable trade-off for current deployment value). Unlike purely theoretical work, DRAGON provides a ready-to-use pipeline for unlearning, which fills a critical gap for industry and researchers working with constrained LLM access. Despite the limitations, the paper’s thorough evaluation, responsiveness to reviewer concerns, and solution to a pressing practical problem make it a valuable addition to the conference.
Nevertheless, the Program Chairs have investigated and determined that this paper cannot be accepted under this year’s Responsible Reviewing Initiative, as at least one of the authors was found to be grossly negligent as a reviewer during the review process.