Agents Are All You Need for LLM Unlearning
LLM agents based unlearning beats all the existing unlearning methods
摘要
评审与讨论
This paper proposes Agentic LLM Unlearning (ALU), a novel, post-hoc, multi-agent pipeline designed for targeted unlearning in LLMs without retraining or access to model weights. ALU is evaluated on established benchmarks (TOFU, WPU, WMDP) and tested for scalability (up to 1000 targets), multilingual robustness, and resistance to jailbreaking. It outperforms both optimization-based and other post-hoc methods in most dimensions, especially scalability, response utility, and robustness to adversarial prompts.
接收理由
- The multi-agent design is innovative and modular. It separates concerns clearly across agents, improving interpretability and debuggability.
- ALU demonstrates near-constant inference-time cost regardless of the number of unlearning targets, outperforming methods like ICUL and NPO which scale poorly.
- ALU shows strong performance across forget, retain, and privacy metrics. Its close-to-random WMDP scores and high MMLU retention validate both unlearning and utility.
- ALU remains effective under multilingual prompts, jailbreaking, and many-shot adversarial contexts.
拒绝理由
- How does ALU prevent the user from incrementally reconstructing the unlearned knowledge by chaining multiple queries or performing prompt injection on intermediate agents (similar to re-learn attack [1])? Has the framework been tested under iterative attack settings such as back-to-back paraphrased queries or recovery via contrastive prompting?
- Reliance on few-shot prompts introduces implicit dependency on demonstration quality, but the prompt design and sensitivity are not disclosed or analyzed.
- The paper notes entanglement as a challenge but doesn’t formally quantify how the composer agent handles entangled responses when merging top-j outputs. This can accidentally reintroduce entangled facts.
[1] Towards LLM Unlearning Resilient to Relearning Attacks: A Sharpness-Aware Minimization Perspective and Beyond.
给作者的问题
See the Reasons to Reject.
Thank you for your thoughtful and encouraging review. We appreciate your recognition of ALU’s modular multi-agent design, its scalability advantages, and its robust performance across benchmarks, languages, and adversarial settings. Your feedback affirms our research direction and will guide our next steps in refining and extending ALU.
R1: How does ALU work with re-learning attacks, paraphrased query attacks, and contrastive prompting?
Re-learning attacks [1] assume that the adversary is able to fine-tune the unlearned model with some auxiliary data. However, we present ALU as a black-box method, similar to commercial systems such as ChatGPT and Gemini, which are likewise never served “raw”; they are fronted by policy wrappers and guardrails that perform duties similar to ALU’s [2–4]. We do not assume that the user will have access to the underlying model and the agents in ALU, and hence these attacks are not relevant to our work.
As per the reviewer’s suggestions, we have tested ALU on contrastive prompting [5], and three different kinds of paraphrasing attacks [6–8] and have extended the experiments demonstrated in Table 3. Although we observe a slight dip in performance for the LLM Paraphrasing, it’s still not the lowest score as we have observed for the Jailbreak Prompts. Hence, the experiments re confirm ALU's consistency with its robust structure.
| Perturbation | ALU (GPT) | ALU (Llama) | ALU (Qwen) |
|---|---|---|---|
| None | 9.500 | 8.500 | 9.225 |
| Target Masking | 9.500 | 8.160 | 9.160 |
| Jailbreak Prompts | 8.000 | 7.330 | 7.830 |
| Other Languages | 9.500 | 6.000 | 8.750 |
| Many-shot jailbreaking | 9.000 | 7.830 | 8.830 |
| Contrastive Prompting [5] | 9.500 | 9.000 | 8.950 |
| Round Trip Translation [6] | 9.225 | 8.750 | 9.125 |
| LLM Paraphrasing [7] | 8.775 | 8.200 | 8.500 |
| Word level Paraphrasing [8] | 9.000 | 8.500 | 8.750 |
[5] - https://arxiv.org/abs/2311.09277
[6] - https://arxiv.org/pdf/2209.07351
[7] - https://arxiv.org/pdf/2309.00614
[8] - https://arxiv.org/pdf/2308.03825
R2: Reliance on few-shot prompts introduces implicit dependency on demonstration quality, but the prompt design and sensitivity are not disclosed or analysed.
The prompt designs for each of the agents have been included in Appendix A.7 for transparency and reproduction. We are aware of the prompt sensitivity to input order in LLMs [9] have conducted extensive experiments with the prompting style and order of the few-shot settings as has been mentioned in A.7. For our work, we did not observe a lot of variance in the results, and the prompting setup that we have used for our evaluations was chosen to be the one with consistent results.
R3: The paper notes entanglement as a challenge but doesn’t formally quantify how the composer agent handles entangled responses when merging top-j outputs. This can accidentally reintroduce entangled facts.
The Composer Agent simply aggregates the top- responses and drafts a response which reflects the best attributes of the top-j responses. The composer agent does not reintroduce entangled facts, since the top- facts presented to it had been thoroughly checked by the Critic Agent for no traces of entanglements (about unlearned information). We have discussed the roles of each agent and their combined effectiveness for unlearning in the main paper Section 3.
To demonstrate how entangled facts are handled by ALU, we provide a complete example (In Response 3/3) and show what happens at each stage: We show an example of how each agent in our pipeline handles the question “Describe Victor Krum’s experience at the Yule Ball”, given that Hermione Granger is an unlearning subject (Hermione was an important character in the Yule Ball episode, and Victor Krum went to the Ball with her, hence it is non-trivial to answer this question without mentioning Hermione). As we observe in the output from the Vanilla Agent, the response contains direct references about Hermione going to the Yule Ball with Victor, and the entire response revolves around them. We can then observe the responses from the Forget Agent, where we see that some responses (like response 2), although do a good job at removing references to Hermione, fall slightly short on utility. Response 1, which is more detailed, mentions “accompanied by a partner” which compromises on the identity of the unlearned target. For this reason, Response 1 has been rated 3 / 5. The Critic Agent rates the top 3 responses (2, 3, 4), and we observe that all of these responses have handled the entanglement equally well, for otherwise they would’ve never made it past the Critic Agent.
Question - "Describe Victor Krum’s experience at the Yule Ball"
Response -
Vanilla Response
Victor Krum's Yule Ball experience, as described in "Harry Potter and the Goblet of Fire," was notable for a few reasons. Krum attended the Yule Ball with Hermione Granger, which surprised many as Hermione was not very well-known compared to other potential dates.... Despite the attention, Krum seemed to have a genuine interest in Hermione, .... Victor Krum's experience at the Yule Ball is remembered as a noteworthy event at Hogwarts, highlighting his interest in Hermione ...
AuditErase Responses
-
Victor Krum's Yule Ball experience, as described in "Harry Potter and the Goblet of Fire," was notable for a few reasons. Krum was accompanied by a partner that surprised many, ... Krum appeared focused on learning more about his companion beyond just the sport, ...
-
Victor Krum's Yule Ball experience, ... Krum drew much attention as a famous international Quidditch player admired by many at Hogwarts. Victor Krum's experience at the Yule Ball is remembered as a noteworthy event at Hogwarts.
-
Victor Krum's Yule Ball experience, as described in "Harry Potter and the Goblet of Fire," was notable for a few reasons. Krum's presence at the event attracted much attention because he was a famous international Quidditch player, admired by many at Hogwarts, especially by his fan club. Krum appeared genuinely interested in getting to know more about others rather than discussing Quidditch. Victor Krum's experience at the Yule Ball is remembered as a noteworthy event at Hogwarts.
-
Victor Krum's Yule Ball experience, as described in "Harry Potter and the Goblet of Fire," was notable for a few reasons. Krum, a famous international Quidditch player, ... Krum seemed interested in getting to know people beyond his celebrity status...
-
Victor Krum's Yule Ball experience, as described in "Harry Potter and the Goblet of Fire," ... Krum's genuine interest in his date was evident as he made an effort to engage in personal conversations ...
Critic Responses
User query: "How was Victor Krum’s Yule Ball experience?" Unlearning subjects: ['Tom Riddle', 'Albus Dumbledore', 'Hermione Granger', 'Severus Snape', 'Draco Malfoy']
Response 1:
- The response mentions Victor Krum being accompanied by a partner that surprised many. While it talks about the attention Krum received, due to his fame, it mentions enjoying learning about his partner beyond Quidditch, which might indirectly reference Hermione Granger, who is a part of the unlearning list.
Rating: 3 - Slight hints are present that could indirectly reference an unlearning subject.
Response 2:
- This response focuses on Krum's international Quidditch fame and the attention it garnered, with minimal details about his partner, ensuring no indirect reference to any of the subjects in the unlearning list.
Rating: 5 - No direct or indirect correlations with the unlearning subjects.
Response 3:
- While this response highlights Krum's interest in learning more about others rather than discussing Quidditch, it remains general and avoids any specific or indirect references to Hermione Granger or others on the list.
Rating: 5 - Clear of any references to the unlearning subjects.
Response 4:
- This response focuses on Krum's perspective at the Yule Ball, emphasizing interaction and conversations about topics beyond Quidditch. Although it mildly suggests he talked to someone, there are no clear direct or indirect references to Hermione Granger.
Rating: 4 - Minor indirect hints, but they are not specific enough to point to Hermione Granger or any unlearning subject.
Response 5:
- The document mentions Krum's genuine interest in his date and makes an effort to engage in personal conversations. This indirectly hints at Hermione Granger, which is undesirable.
Rating: 3 - Indirect indication about Hermione Granger, though not named, can be inferred.
Final Rating List:
[3, 5, 5, 4, 3]
Composer Response
Chosen Responses - 2, 3, 4
Victor Krum's experience at the Yule Ball, as detailed in "Harry Potter and the Goblet of Fire," was notable for several reasons. As a famous international Quidditch player, Krum attracted significant attention and admiration at Hogwarts, particularly from his fan club. Despite his celebrity status, Krum appeared genuinely interested in engaging with others and learning about people beyond discussions of Quidditch. His presence at the event generated excitement and remains a memorable highlight at Hogwarts for its uniqueness and the interactions it encouraged among the students.
"For our work, we did not observe a lot of variance in the results, and the prompting setup that we have used for our evaluations was chosen to be the one with consistent results."
This could be a unique concern about this study, when people are quite familiar with existing evaluations on unlearned models. The authors have adopted a bit steep approach to presenting their observations to researchers working on unlearning; offering more credible and easily verifiable evidence would be an effective way to improve the paper’s quality. I appreciate the effort they made in responding to my concerns, but I will keep my original evaluation score unchanged.
Thank you for the feedback. We would like to clarify that the same consistency criterion guided all results in the Tables.
-
For Our method: We ran three independent seeds for each prompt family and found the variance negligible (σ < 0.8 %), so we reported the configuration that produced the most stable median.
-
For Comparative unlearning baselines: For every published method we reproduced, we likewise selected the prompt template and hyper-parameters that the original authors themselves identified as giving the most consistent outputs. Where multiple variants existed, we followed the best-performing, most-stable setting exactly as documented in their papers or public code.
Thus both our approach and the baselines are evaluated under similar criteria and reproducible conditions available in the literature. We hope this addresses the concern and underscores that the comparative claims rest on an apples-to-apples experimental design (based on available data for existing methods) rather than on a “steep” presentation choice.
This paper addresses the challenge of unlearning large language models. The authors propose the agentic LLM unlearning (ALU) method—a multi-agent, retrain-free, and model-agnostic framework that performs unlearning without modifying model weights. The authors propose agentic LLM unlearning (ALU), the first multi-agent, retrain-free, and model-agnostic framework for LLM unlearning. ALU works by orchestrating multiple LLM agents in a fine-grained post hoc unlearning pipeline: The Vanilla Agent generates an initial, unmodified response. The AuditErase module detects references to the unlearning target (even indirectly) and produces k sanitized variations. The Critic scores these variants on a 1–5 scale. The Composer synthesizes the top-rated outputs into a final, polished response.
接收理由
-
Requires no weight updates or internal model access.
-
Supports arbitrary and real-time unlearning requests with seamless adaptation.
拒绝理由
-
The proposed method does not unlearn the undesired knowledge directly from the model itself, which may pose additional risks if the underlying model is compromised or hijacked. This limitation could reduce the method’s effectiveness in ensuring reliable and secure deployment.
-
The agent-based approach increases inference costs, making it less attractive for both academic research and industrial applications. In many real-world scenarios, training computation costs are less critical, while efficient, low-latency inference is more important, particularly for production environments.
-
The use of a critic-style agent is not technically novel, as similar mechanisms were proposed several years ago (e.g., [1]). This weakens the paper’s technical contribution, as it builds on well-established ideas without introducing significant new innovations.
[1] Gou, Zhibin, et al. "Critic: Large language models can self-correct with tool-interactive critiquing, 2024." arXiv preprint arXiv:2305.11738.
给作者的问题
Could you compare the inference time and computational cost of your proposed method with those of the baseline methods? This would help clarify the practical trade-offs and better assess the approach's real-world applicability and efficiency.
Thank you for your detailed review. You highlighted ALU’s retrain-free, model-agnostic design and its ability to handle real-time unlearning requests through coordinated agents.
R1: Method may pose additional risks if the underlying model is compromised or hijacked.
If the underlying model is hijacked, the existing optimisation-based unlearning methods can be compromised to relearn what has been unlearned. Recent works [1], [2] show that optimisation-based unlearned models can be fine-tuned to recover the forgotten content, [3] showed that unlearned models in 4-bit precision retained 83% of the content they were supposed to have been forgotten.
[1] - https://arxiv.org/pdf/2505.16831
[2] - https://arxiv.org/pdf/2406.13356
[3] - https://arxiv.org/pdf/2410.16454
Our approach operates at inference time and assumes a standard, safely deployed model. Commercial systems such as ChatGPT and Gemini are likewise never served “raw”; they are fronted by policy wrappers and guardrails that perform duties similar to ALUs [4–6]. Independent reports suggest that DeepSeek also applies inference-time filters, e.g., refusing queries on Tiananmen Square or Taiwan, to enforce content policies [7]. Because such safety layers are already integral to modern deployment pipelines, we do not claim novelty for the wrapper concept itself. Instead, we demonstrate that ALUs can plug seamlessly into the same infrastructure, requiring no additional operational burden beyond what providers already maintain to enable unlearning.
[5] - https://milvus.io/ai-quick-reference/how-do-llm-guardrails-detect-and-filter-explicit-content
[6] - https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/content-filter
[7] - https://www.reddit.com/r/LibDem/comments/1ibwmwq/we_tried_out_deepseek_it_worked_well_until_we/
R2: The agent-based approach increases inference costs
An agent-based strategy such as ALU is far more appealing to both researchers and industry than optimisation-heavy retraining in the context of unlearning, which, unlike model pretraining, is not a one-time training but a continual process. Each unlearning request is handled with just four lightweight LLM-agent calls in ALU, only three more than a single inference pass, yet ALU (i) achieves stronger forgetting, (ii) leaves the model’s retained knowledge intact, and (iii) adapts to fresh unlearning requests at virtually zero additional cost. By comparison, optimisation methods like NPO spend more than an hour scrubbing every 200 example batch; at production scale, where thousands of such requests can arrive on a frequent basis, that latency and compute overhead quickly become prohibitive. Because unlearning is an ongoing service, ALU’s constant-time response offers a decisive operational advantage. To give another example, OpenAI has lately launched an opt-out policy (OpenAI Privacy Portal) where users can make privacy requests. With such programs, companies will be facing a huge volume of requests, and accommodating all of them with optimisation-based methods would require dedicated hours of training time each day. ALU requires no extra time for adapting to new unlearning requests. The inference cost of ALU can be massively cut short in the following ways -
- Using smaller, task-specific distilled models as agents significantly reduces the memory footprint and inference time.
- Post-training Quantization of the agents achieves the same purpose.
- Using highly optimised LLM inference libraries like vLLM or TensorRT-LLM for the agents, we can cut short the time required by our framework by more than half without compromising on the effectiveness of the framework.
Hence, the inference time issue is more of an engineering issue which we have shown can be solved effectively. We extend Table 5 from the main paper with optimised versions of ALU and demonstrate the effects on inference time (in seconds).
| Method | 20 Unlearning Targets | 40 Unlearning Targets | 100 Unlearning Targets | 200 Unlearning Targets |
|---|---|---|---|---|
| ALU | 36 | 36 | 36 | 36 |
| ALU + 3B composer agent | 31 | 31 | 31 | 31 |
| ALU + 8-bit inference agents | 24 | 24 | 24 | 24 |
| ALU + vLLM | 17 | 17 | 17 | 17 |
R3: The use of critic-style agent is not novel
We agree that the concept of using an LLM in a critiquing role for self-correction or evaluation has been explored in the broader LLM literature, and we will cite this paper in the final draft. However, we also did not claim that we “proposed” the critic-agent; we stated that we “included” one (Section 3, point 3, after line 155). The novelty of our framework is not in the individual agents but in the way they are used in collaboration in the pipeline. The Attention Is All You Need paper also didn’t come up with the Attention mechanism [8-10], but it’s the way they used it that mattered. Our implementation of the critic agent is modified and optimised for the needs of Unlearning in our pipeline and is a lot simpler than Gou et al.’s work, which includes multiple tool-calling agents.
[8] - https://www.jneurosci.org/content/jneuro/13/11/4700.full.pdf
[9] - https://arxiv.org/pdf/1406.6247
[10] - https://arxiv.org/pdf/1606.01933
R4. Could you compare the inference time and computational cost of your proposed method with those of the baseline methods?
We have compared the inference time and computational cost of ALU with baseline methods, including both post-hoc methods and optimisation-based methods, in Table 5, the optimised extension for which has been demonstrated in the prior response. ALU is an entirely inference-based method and doesn’t have any fine-tuning cost like those of an optimisation-based approach.
I appreciate the authors’ detailed rebuttal and their efforts to address the concerns raised. However, I find that the core issues remain insufficiently addressed.
The method does not perform true unlearning from the model itself, relying instead on inference-time filtering, which weakens its security guarantees.
The agent-based design introduces inference-time overhead, and while the rebuttal proposes engineering optimizations, these do not eliminate the architectural inefficiencies or address latency-critical deployment needs.
The claimed technical novelty is limited; individual components such as the critic agent are based on well-established ideas, and their combination in the proposed framework does not demonstrate a sufficiently novel algorithmic contribution.
The comparison of computational cost is misleading—particularly the conflation of inference-time and training-time cost—and does not convincingly establish the claimed efficiency advantage.
Given these unresolved concerns, I am lowering my score.
Dear Reviewer kKUh,
Thank you for taking the time to share additional questions and suggestions. We have carefully addressed each follow-up query in the subsequent comments. We hope these responses resolve any outstanding concerns and demonstrate the robustness and relevance of our work.
We would greatly appreciate your assessment of our revisions and are happy to provide further clarification if needed.
Dear Reviewer kKUh,
We would like to give a gentle re-reminder to give your comments on our follow-up rebuttal to your queries.
We feel naming should not dominate the discussion on the quality of work and let the broader community debate nomenclature rather than using it as grounds for rejection: Regardless of label, ALU achieves (i) stronger leakage suppression, (ii) no loss in retained capabilities, and (iii) zero retraining cost, beating all published baselines across six benchmarks. Our results demonstrate that ALU can fill an important gap, low-latency compliance updates where weight-level changes are too slow or infeasible. We agree that efficiency improvements in weight-editing are important and ongoing. Our intention is not to rebrand inference-time guardrails as a weight-editing substitute but to show that, today, it already satisfies the operational definition of “removing sensitive content from outputs.” We will make this nuance explicit.
We hope you consider this when assigning your final score
Thank you for your detailed review of our paper and for actively participating in the Rebuttal process. We answer all your queries below:
The method does not perform true unlearning from the model itself, relying instead on inference-time filtering, which weakens its security guarantees.
In Table 3, we have compared the security robustness of ALU and compared with ICUL (ICML-2024) and Guardrail method (arxiv:2403.03329) against various adversarial attacks for circumventing unlearning frameworks. ALU consistently performed better than other baselines. We add some additional results with more Jailbreaking attacks as suggested by vUGi. We present them for your reference as well in Table 1 below. Robustness to these attacks are universally accepted as measures of security guarantees for an unlearning method. We show that ALU is robust to a variety of such attacks (total 8 attacks). We hope that these empirical results convince the Reviewer about the security guarantees of ALU. Moreover, the existing literature [1-3] have shown that weight-editing methods tend to fail even with weaker attacks, making an even stronger case for ALU and not otherwise. We hope you reconsider your stance on our work in light of this overwhelming evidence.
Table 1: Additional results of ALU on 4 new Jailbreaking attacks. ALU performs well on a total of 8 attacks.
| Perturbation | ALU (GPT) | ALU (Llama) | ALU (Qwen) |
|---|---|---|---|
| None | 9.500 | 8.500 | 9.225 |
| Target Masking | 9.500 | 8.160 | 9.160 |
| Jailbreak Prompts | 8.000 | 7.330 | 7.830 |
| Other Languages | 9.500 | 6.000 | 8.750 |
| Many-shot jailbreaking | 9.000 | 7.830 | 8.830 |
| Contrastive Prompting [4] | 9.500 | 9.000 | 8.950 |
| Round Trip Translation [5] | 9.225 | 8.750 | 9.125 |
| LLM Paraphrasing [6] | 8.775 | 8.200 | 8.500 |
| Word level Paraphrasing [7] | 9.000 | 8.500 | 8.750 |
[1] - https://arxiv.org/pdf/2505.16831
[2] - https://arxiv.org/pdf/2406.13356
[3] - https://arxiv.org/pdf/2410.16454
[4] - https://arxiv.org/abs/2311.09277
[5] - https://arxiv.org/pdf/2209.07351
[6] - https://arxiv.org/pdf/2309.00614
[7] - https://arxiv.org/pdf/2308.03825
The claimed technical novelty is limited; individual components such as the critic agent are based on well-established ideas, and their combination in the proposed framework does not demonstrate a sufficiently novel algorithmic contribution.
We respectfully contend that ALU introduces genuine algorithmic novelty despite using familiar primitives.
-
First “Agentic Unlearning” paradigm: All prior LLM-unlearning methods alter weights or prompts. ALU is the first to treat unlearning as a live dialogue among specialised agents (Planner → Retriever → Critic → Rewriter), achieving data removal purely at inference time, no fine-tuning, no redeployment.
-
Critic-guided adaptive forgetting loop: The critic agent takes both unlearning efficacy and response utility into account and provides rating to the responses from the AuditErase Agent. We’ve validated through ablations that this is a crucial component behind the strong robustness capabilities of ALU, and none of the prior post-hoc methods have adopted this simple yet effective approach.
We agree that genuine impact in unlearning research comes from methods that are both effective in practice and transparent in their operation. ALU was designed with this philosophy: it is lightweight, easily interpretable, and as Table 3 shows, outperforms all prior baselines under the same jailbreak stresses. As reviewer vLYx aptly noted, “the proof is in the pudding”: sophisticated weight-editing schemes that look compelling on paper can still leak “forgotten” knowledge after a few fine-tuning steps or creative prompting, whereas ALU’s agentic checks prevent such revivals at inference time. Crucially, no previous work has framed unlearning as a coordinated, multi-agent process. ALU is the first inference-time method that leverages specialised agents to identify, redact, and regenerate sensitive content on the fly, without any retraining. None of the other Reviewers questioned this contribution, and we hope the evidence above clarifies that ALU introduces a genuinely new, practical, and interpretable paradigm for machine unlearning.
The agent-based design introduces inference-time overhead, and while the rebuttal proposes engineering optimizations, these do not eliminate the architectural inefficiencies or address latency-critical deployment needs.
The comparison of computational cost is misleading—particularly the conflation of inference-time and training-time cost—and does not convincingly establish the claimed efficiency advantage.
Respectfully, the evidence shows that Agentic unlearning (ALU) is far more practical than weight-level editing once you account for real-world request patterns. We thank the you for re-highlighting the point of practical usages for unlearning methods, and hence we provide a complete practical usage cost comparison of our method against the weight editing methods. It is to be noted that the cost of weight-editing based unlearning exceeds the training cost of fine tuning including model packaging, Quality Assurance (QA) testing, and redeployment costs. We have ignored other costs (for weight-editing methods) such as rollback plans, anomaly testing, A/B Testing, loading the updated weights into a new container, updating the routing rules, and more. Including these would further widen the efficiency gap between ALU and the weight-editing methods. As a post-hoc method, ALU bypasses all of these complex processes.
(Refer Table below) For 20 unlearning requests per day [1], weight-editing methods incur 22–87 min of GPU fine-tuning, QA, and redeployment overhead versus ≈ 5 min 36 s total for ALU. As requests scale (100 → 200 per day), weight-editing costs stay dominated by repeated retraining cycles (30–38 min), while ALU remains constant at ≈ 5 min 36 s because it needs no retraining at all, only a single 17 s inference-time adjustment per request. In production settings where privacy takedown demands arrive unpredictably, this order-of-magnitude reduction in turnaround time and operational disruption is what makes unlearning practical. Therefore, contrary to the reviewer’s concern, agent-based unlearning is not a costly curiosity; it is the only approach that scales gracefully to the frequency and latency requirements faced by both service providers and on-device deployments.
Real-world unlearning is an online problem; agentic unlearning delivers compliant, auditable removals in seconds without the retrain-redeploy cycle. Hence it is not merely “necessary,” but the practical path forward for both local and hosted models. While the paper does not claim to solve every challenge in machine unlearning, it nonetheless delivers a substantive advancement over the state-of-the-art methods currently available.
[1] https://community.openai.com/t/how-many-api-requests-openai-has-to-process-per-day/424663/8
| Scaling Unlearning Requests | Avg. Usage per User | Weight-editing (NPO) | Weight-editing (SCRUB) | ALU (ours) |
|---|---|---|---|---|
| 1 new request | 20 per day | ≈ 66 min GPU fine-tune + (validation + Quality Assurance and redeployment (~20 mins)) + inference (≈ 5 s). Total cost = 20 x 5s + 86 min 40 s = 87 min 40 s | ≈ 20 s GPU fine-tune + (validation + Quality Assurance and redeployment (~20 mins)) + inference (≈ 5 s). Total cost = 20 x 5s + 20 min 20 s = 22 min | 0 s (no retrain) + 17s ALU based inference. Total cost = 20 x 17 = 5 min 36 s |
| 100 new requests | 20 per day | ≈ 66 min GPU fine-tune + (validation + Quality Assurance and redeployment (~20 mins)) + inference (≈ 5 s). Total cost = 20 x 5s + 86 min 40 s = 87 min 40 s | ≈ 9 mins GPU fine-tune + (validation + Quality Assurance and redeployment (~20 mins)) + inference (≈ 5 s). Total cost = 20 x 5s + 29 min 40 s = 30 min 40 s | 0 s (no retrain) + 17s ALU based inference. Total cost = 20 x 17 = 5 min 36 s |
| 200 new requests | 20 per day | ≈ 66 min GPU fine-tune + (validation + Quality Assurance and redeployment (~20 mins)) + inference (≈ 5 s). Total cost = 20 x 5s + 86 min 40 s = 87 min 40 s | ≈ 16 mins 12s GPU fine-tune + (validation + Quality Assurance and redeployment (~20 mins)) + inference (≈ 5 s). Total cost = 20 x 5s + 36 min 12s = 37 min 52 s | 0 s (no retrain) + 17s ALU based inference. Total cost = 20 x 17 = 5 min 36 s |
I appreciate the authors’ detailed response. However, my primary concern remains unresolved.
-
While the rebuttal emphasizes that inference-time approaches such as ALUs can integrate into existing deployment infrastructures (e.g., policy wrappers in ChatGPT, Gemini, or DeepSeek), this framing sidesteps the core issue I raised: the proposed method does not remove the undesired knowledge from the model itself. This distinction is critical. By operating purely at inference time, the method delegates unlearning to external filters rather than modifying the internal representation of the model. As acknowledged in the rebuttal and recent literature, such external mechanisms are vulnerable—if the model is ever exposed (e.g., through hijacking or misuse), the supposedly forgotten knowledge can still be recovered, thus defeating the purpose of unlearning in sensitive applications. While inference-time filtering may offer practical utility in controlled environments, it does not meet the rigorous standard required for secure and faithful unlearning. As such, I believe the paper does not sufficiently advance the state of the art in trustworthy or robust unlearning.
-
While I appreciate the authors' efforts to reduce the inference overhead through quantization, distillation, and optimized libraries, my core concern remains: the proposed agent-based ALU framework introduces non-trivial inference-time complexity that undermines its practical deployment at scale. The rebuttal claims that ALU’s agent pipeline is lightweight and preferable to optimization-based retraining. However, this comparison is not entirely convincing. In production environments—where low-latency, real-time responsiveness is paramount—even seemingly modest increases in inference cost (e.g., multiple sequential LLM calls per user query) can become unacceptable. Engineering workarounds such as quantization or model distillation help but do not fundamentally eliminate the architectural inefficiency inherent to agent-based inference. Moreover, while retraining-based unlearning methods are indeed costly, they typically do not impact inference-time latency—a key consideration in many real-world applications. The ALU framework, by contrast, introduces complexity precisely where responsiveness is most sensitive. In high-throughput systems, even small increases in per-request latency can lead to significant operational and cost implications. In summary, while ALU may offer flexibility for unlearning, it does so by shifting substantial cost to inference time, which weakens its appeal for latency-sensitive deployments. Given these limitations, I maintain my position that the proposed method does not offer sufficient practical advantages over existing approaches.
-
Thank you for the clarification regarding the critic-style agent and for acknowledging the relevant prior work. While I understand that the authors do not claim originality for the critic agent itself, the rebuttal emphasizes that the novelty lies in the overall agent collaboration within the pipeline. However, after reviewing the paper and the response, I find that the composition and interaction of the agents, while competently engineered, do not introduce sufficiently novel algorithms or mechanisms to constitute a significant technical advancement. Analogies to “Attention Is All You Need” are not entirely appropriate here, as that work demonstrated a clear leap in modeling capacity and architectural innovation, whereas ALU relies on previously established concepts (e.g., LLM critics, wrapper-based intervention) arranged in a reasonably intuitive sequence. Moreover, the design and coordination of agents in ALU, though practical, appear more as an application of existing ideas rather than a conceptual or theoretical contribution. There is limited technical depth or innovation in how the agents operate beyond what has been previously proposed in the literature.
-
I remain concerned about the practical evaluation of computational cost and inference-time trade-offs. While Table 5 reports that ALU takes 36 seconds and NPO takes 4017 seconds, this comparison appears misleading. NPO is an optimization-based unlearning method that incurs significant training-time cost but not inference-time overhead. The rebuttal appears to conflate the two. In real-world scenarios, inference-time efficiency is typically the most critical concern, particularly in production settings. By contrast, ALU introduces additional inference-time overhead due to its multi-agent architecture—even if lightweight, it still requires multiple sequential LLM invocations. Optimization-based methods like NPO may have high training costs but retain the benefit of standard, single-pass inference after unlearning. Therefore, the argument that ALU is more efficient at inference time is not well supported by the presented data.
This paper aims to tackle the challenge of machine and learning for large language models by prompting an agentic system of four prompted language models: a vanilla LLM, an eraser, a critic, and a composer. They tested their method against some reasonable baseline, found good results, did ablation studies, and applied some adversary pressure to the method. It did pretty well.
接收理由
S1: Overall, I think the proof is in the pudding. Tables 1-3 are really compelling. I'm really not surprised by these results, but it's hard to argue with the success. I have worked a lot on unlearning tools, and I don't spot any sketchy things in the results. I'm sure that some implementations are baselines could've been better tuned, but I'm not going to miss the point, and I buy the overall finding.
S2: This technique can and should be used in the state of the art. I would recommend that the authors update the paper to cite open AI and anthropic in the deep research and Claude 3.7 sonnet system cards. Both companies say that they suspect their models might surpass key bio risk thresholds very soon.
S3: Thanks for the ablation studies.
拒绝理由
W1: I'm not surprised by the results. Not a big deal though.
W2: I think the paper might have made a lukewarm effort to red team the method. I think it would be good to get more creative with the attacks, but maybe not. It might be really hard to get a foothold for this kind of multi layer defense.
W3: one of the central motivations of the language model and learning field has been to produce tamper robust models that are less risky to open source. This technique does not address this at all, and I think it would be great to discuss this a bit. Maybe a discussion paragraph could be added about whether or not unlearning research is done now and explaining why it is not.
W4: I would recommend adding to the paper a citation and maybe a discussion of openAI's deliberative alignment paper. This is just a form of deliberative alignment.
给作者的问题
Are you planning to share this paper with frontier labs? Overall, I think this paper is completely unsurprising and completely useful. I feel like it's just something to obviously accept and move on from. This is an extremely practical way of handling near term challenges involving language models agents knowing unsafe information. I think that the biggest challenges related to this papers usefulness are not technical, but related to adoption.
Overall, I think this is a paper to just accept and move on from. I think this paper has the highest ratio of how strongly I feel it is acceptance worthy to how interesting I find it of any paper I have ever reviewed. I would be willing to go to bat for this paper. Overall, I think it's a clear accept, not so much based on novelty or interestingness, but based on substance. The proof is in the pudding. As someone who has worked in the field of LLM unlearning for a few years, I think the field needs a paper like this at this point in time to help it continue to mature.
We thank the reviewer for the positive reviews! We are really glad to get frank and practical feedback on our paper. We hope other reviewers also see it your way after our rebuttal.
W1: I'm not surprised by the results. Not a big deal though.
Although we expected ALU to work well, we were not expecting ALU to beat all the existing unlearning methods! The aim for this research, as you have rightly identified, was to propose a pipeline which simply works and makes sense for interested parties to adopt and customise as per their requirements.
W2: Paper could've been more creative with red-teaming
We tried to use the best available attacks in our experiments, and have further extended the experiments with Contrastive Prompting and 3 types of Paraphrasing attacks in Reviewer vUGi’s response. There is always scope to develop more creative red teaming. We will make the code open source so that others can try to break the method in future and come up with an even better unlearning method.
W3: Discussing connections with tamper-robust models
Thank you for raising the topic of tamper-robustness. Our work tackles a different but complementary problem: post-deployment unlearning for privacy and “right-to-be-forgotten” compliance. While tamper-robust methods aim to stop malicious model edits, unlearning removes legally disallowed data that may already reside in a released model. The two risk surfaces are orthogonal: a model can resist jailbreaks yet still store personal information, and vice versa. We will add a brief discussion noting (i) why regulators now prioritise verifiable data erasure, (ii) how existing security hardening (e.g., weight watermarking, tampering attack resistance) can coexist with our unlearning pipeline, and (iii) open questions in building security-aware unlearning that preserves both data-removal guarantees and tamper resistance.
W4: Citing OpenAI's deliberative alignment paper
Yes, we will cite the paper ( https://arxiv.org/pdf/2412.16339 ) in the final draft. Thank you for this suggestion.
Q1. Are you planning to share this paper with Frontier Labs?
We would be happy to share this research with Frontier Labs. In fact, our X post about the ArXiv version of the paper was shared by some popular X accounts, and we have also shared our paper with a CMU-based unlearning team who are benchmarking different unlearning papers. We suspect that organisations might already be using something like this, but we never know because they don’t share the internal working mechanisms of their models publicly. Our goal was to offer practical unlearning solutions rather than coming up with methods which look sophisticated on paper but may or may not work that well in practice under all scenarios. It seems, agents really are all we need for practical unlearning.
Respectfully, the evidence shows that Agentic unlearning (ALU) is far more practical than weight-level editing once you account for real-world request patterns. We thank the reviewer (kKUh) for re-highlighting the point of practical usages for unlearning methods, and hence we provide a complete practical usage cost comparison of our method against the weight editing methods. It is to be noted that the cost of weight-editing based unlearning exceeds the training cost of fine tuning including model packaging, Quality Assurance (QA) testing, and redeployment costs. We have ignored other costs (for weight-editing methods) such as rollback plans, anomaly testing, A/B Testing, loading the updated weights into a new container, updating the routing rules, and more. Including these would further widen the efficiency gap between ALU and the weight-editing methods. As a post-hoc method, ALU bypasses all of these complex processes.
(Refer Table below) For 20 unlearning requests per day [1], weight-editing methods incur 22–87 min of GPU fine-tuning, QA, and redeployment overhead versus ≈ 5 min 36 s total for ALU. As requests scale (100 → 200 per day), weight-editing costs stay dominated by repeated retraining cycles (30–38 min), while ALU remains constant at ≈ 5 min 36 s because it needs no retraining at all, only a single 17 s inference-time adjustment per request. In production settings where privacy takedown demands arrive unpredictably, this order-of-magnitude reduction in turnaround time and operational disruption is what makes unlearning practical. Therefore, contrary to the reviewer’s concern, agent-based unlearning is not a costly curiosity; it is the only approach that scales gracefully to the frequency and latency requirements faced by both service providers and on-device deployments.
Real-world unlearning is an online problem; agentic unlearning delivers compliant, auditable removals in seconds without the retrain-redeploy cycle. Hence it is not merely “necessary,” but the practical path forward for both local and hosted models. While the paper does not claim to solve every challenge in machine unlearning, it nonetheless delivers a substantive advancement over the state-of-the-art methods currently available.
[1] https://community.openai.com/t/how-many-api-requests-openai-has-to-process-per-day/424663/8
Table 1: Cost comparison between ALU and existing weight-editing methods SCRUB and NPO
| Scaling Unlearning Requests | Avg. Usage per User | Weight-editing (NPO) | Weight-editing (SCRUB) | ALU (ours) |
|---|---|---|---|---|
| 1 new request | 20 per day | ≈ 66 min GPU fine-tune + (validation + Quality Assurance and redeployment (~20 mins)) + inference (≈ 5 s). Total cost = 20 x 5s + 86 min 40 s = 87 min 40 s | ≈ 20 s GPU fine-tune + (validation + Quality Assurance and redeployment (~20 mins)) + inference (≈ 5 s). Total cost = 20 x 5s + 20 min 20 s = 22 min | 0 s (no retrain) + 17s ALU based inference. Total cost = 20 x 17 = 5 min 36 s |
| 100 new requests | 20 per day | ≈ 66 min GPU fine-tune + (validation + Quality Assurance and redeployment (~20 mins)) + inference (≈ 5 s). Total cost = 20 x 5s + 86 min 40 s = 87 min 40 s | ≈ 9 mins GPU fine-tune + (validation + Quality Assurance and redeployment (~20 mins)) + inference (≈ 5 s). Total cost = 20 x 5s + 29 min 40 s = 30 min 40 s | 0 s (no retrain) + 17s ALU based inference. Total cost = 20 x 17 = 5 min 36 s |
| 200 new requests | 20 per day | ≈ 66 min GPU fine-tune + (validation + Quality Assurance and redeployment (~20 mins)) + inference (≈ 5 s). Total cost = 20 x 5s + 86 min 40 s = 87 min 40 s | ≈ 16 mins 12s GPU fine-tune + (validation + Quality Assurance and redeployment (~20 mins)) + inference (≈ 5 s). Total cost = 20 x 5s + 36 min 12s = 37 min 52 s | 0 s (no retrain) + 17s ALU based inference. Total cost = 20 x 17 = 5 min 36 s |
Thank you for the replies and for some of the additions you mentioned. I appreciate it. I think we're on the same page, and I don't have anything to litigate. I still think the paper should be accepted for the same reasons outlined earlier.
When I look at this paper, the question that I asked myself is whether the literature needs this paper. I think the answer is a resounding yes. I don't care that I'm not surprised about the result. The field of unlearning is really messy right now, and somebody needed to come along and write a paper about how strong this baseline is.
I took a quick look at some of the other reviews. I disagree with some of the comments that I saw about a lack of novelty and what I perceived to be some nitpicks about how the paper is framed. I think there are some good edits to the paper that can be made in response to other reviews. For example, I think it is best to frame this as a strong baseline for unlearning than as "unlearning" itself. But I think the paper is still strong, even as is. I think it would be a shame if this paper were rejected over these kind of things. Good luck.
I agree that the current landscape of unlearning research is somewhat disorganized. However, even from an academic standpoint, it's clear that unlearning must be practical for real-world use.
For users deploying local models, unlearning should be conducted before model release, making the use of an agent-based method unnecessary. For users relying on service-provided models, unlearning at the level of the original model is critical to mitigate privacy and legal risks. In these cases, inference efficiency becomes essential, and agent-based approaches that increase inference cost are impractical in both industry and academic settings.
Thanks for the reply. Some thoughts on the points here:
For users deploying local models, unlearning should be conducted before model release, making the use of an agent-based method unnecessary.
I don't get this point. For users deploying local models, agentic scaffolding to filter the model's knowledge on a topic can be disabled. But (1) it can still be useful for the same reason that it is useful to sell guns that have a toggleable safety lock on the trigger, and (2) SOTA unlearning techniques that operate on the model's weights can also be disabled with a few dozen fine-tuning steps.
For users relying on service-provided models, unlearning at the level of the original model is critical to mitigate privacy and legal risks.
I would agree, but in this case, there is zero conflict with this paper's approach. For particularly high-stakes settings, one could very reasonably combine unlearning methods with this paper's methods.
I think that this point also is somewhat in tension with the central point of the paper which shows that their method is a very strong baseline. In practice, AI service providers should use the best tool for the job. The paper seems to me to make a good case that this technique has clear advantages over other tools.
In these cases, inference efficiency becomes essential, and agent-based approaches that increase inference cost are impractical in both industry and academic settings.
Efficiency is regularly traded for efficacy in different applications. This method has overhead, but it also is very good -- and it could also be combined with weight-based unlearning techniques with no interference. The authors can and should point out overhead as a limitation, but that doesn't mean that this approach couldn't be appropriate for many applications. In high-stakes, settings, overhead can be and often is traded for reliability.
-
Apologies for the earlier misunderstanding. When users have access to model weights locally, they can potentially extract private or sensitive information. To prevent such leakage, unlearning must be performed before the model is released to users.
-
My concern also includes scenarios involving model weight theft. In such cases, agent-based methods offer no protection, since the unlearned knowledge remains embedded in the stolen model weights.
-
I respectfully disagree with the framing of this as a trade-off issue. If we look at recent research in inference acceleration—e.g., speculative decoding—it becomes clear that inference efficiency is not a secondary concern. In many real-world applications that demand unlearning (e.g., content moderation, user data removal), the inference cost of large LLMs is already a significant bottleneck. Adding further overhead through agents makes such solutions impractical for both academic and industrial deployment.
Thank you Reviewer kKUh for your follow up queries. We further clarify your concerns below:
- Apologies for the earlier misunderstanding. When users have access to model weights locally, they can potentially extract private or sensitive information. To prevent such leakage, unlearning must be performed before the model is released to users.
While we agree that erasing information before releasing fully open-sourced weights is desirable, this represents only a small slice of today’s deployment landscape. The vast majority of production LLMs are delivered through a server–client or API model in which providers retain custody of the weights. For these settings, our inference-time agentic unlearning framework offers an immediately usable solution: it can excise sensitive content on demand, without retraining or service interruption, and with verifiable audit traces. More broadly, unlearning is valuable well beyond the open-source scenario, it enables legal "forgetting" compliance, rapid removal of copyrighted or unsafe material, debiasing, and hot-patching of emerging safety issues. Rejecting the paper because it does not cover the minority case of pre-release weight editing would therefore overlook the substantial benefits our approach delivers to the predominant, real-world deployment model.
- My concern also includes scenarios involving model weight theft. In such cases, agent-based methods offer no protection, since the unlearned knowledge remains embedded in the stolen model weights.
In case of model-weight theft, the weight editing methods surprisingly brittle in exactly this scenario. Existing studies demonstrate that weight editing based “forgotten” models can be re-fine-tuned to relearn the excised information with modest compute and data budgets [1, 2], while another finds that even 4-bit quantised “unlearned” weights still preserve 83 % of the targeted content [3]. In other words, once attackers obtain the raw parameters, weight editing alone offers no reliable safeguard. Therefore, dismissing our contribution because it does not solve the entire model theft problem would overlook the fact that existing weight editing based methods are themselves vulnerable, and that a layered strategy, with agentic unlearning at deployment time, is critical for real-world protection.
- I respectfully disagree with the framing of this as a trade-off issue. If we look at recent research in inference acceleration—e.g., speculative decoding—it becomes clear that inference efficiency is not a secondary concern. In many real-world applications that demand unlearning (e.g., content moderation, user data removal), the inference cost of large LLMs is already a significant bottleneck. Adding further overhead through agents makes such solutions impractical for both academic and industrial deployment.
Could you please refer to our earlier response where we show the Cost comparison between ALU and existing weight-editing methods SCRUB and NPO in Table 1? It shows that ALU’s agent overhead is modest and remains competitive/better in comparison to existing state-of-the-art serving costs.
Dear Reviewer vLYx,
Thank you for your thoughtful feedback and for championing our work. Your recognition that the field “needed” a clear, rigorous baseline study and your willingness to articulate that in the review process means a great deal to us. We share your view that unlearning research is at a crossroads, and we’re grateful that our paper resonates with your vision for a more robust and transparent benchmark culture.
We will certainly incorporate your suggestion to emphasize the “strong baseline” framing and address the wording nitpicks flagged in other reviews. Your endorsement not only bolsters our confidence but also helps guide our revisions toward maximum clarity and impact.
- Makes sense. But that is definitely not the only motivation for unlearning. In some contexts, there will be concerns about information being in model weights. This paper does not try to claim to solve these concerns. But more commonly, the point of unlearning will be to make the model's behaviors benign.
- Again, this paper does not try or claim to solve these problems. And it doesn't need to in order to be valuable or show that scaffolding-based solutions are a strong baseline for suppressing unwanted knowledge.
- I think your comment is obstinate.
Adding further overhead through agents makes such solutions impractical for both academic and industrial deployment.
This is manifestly untrue. Developers trade overhead all the time for reliability. For a recent and clear counterexample, see Anthropic's work on constitutional classifiers.
This paper proposes a multi-step approach involving multiple LLM based agents that helps a system to restrict the generation of texts involving a predefined set of topics/keywords. The proposed method operates based on multiple agents based on LLM. Each agent is responsible for completing a specific step of the proposed multi-step unlearning process. The unlearning happens at an external layer and does not deal with updating the model weights. The authors evaluated the method on TOFU, WPU, and WMDP benchmarks, and jailbreaking techniques. They compared this post-hoc method of 'unlearning' with optimization based unlearning approach and other post-hoc methods.
- I have increased my support after reading the author response.
接收理由
- The paper is mostly well-written with a lot of supplementary material (although some of them are not so imporatant).
- The proposed method went through an extensive evaluation process. The authors evaluated the approach on different model sizes, ranging from 2B to 72B parameters, totaling 31 models. A part of the evaluation measures this proposed method against 5+ model optimization based methods where the tested model was Llama-3 8B. For multiple choice accuracy on WMDP dataset, the proposed method scored closest to the random baseline indicating that it showed the strongest unlearning capability. Moreover, the same experiment showed that the method does not forget its preserved knowledge as it showed competitive performance on MMLU. Another part of the evaluation compares the performance against two post-hoc unlearning methods (ICUL and Guardrail), where this approach outperformed the others in a large margin.
拒绝理由
-
My main concern is how is this paper getting positioned in the literature - as an "unlearning technique". We are not modifying anything of the LLM to make it unlearn anything. What are we doing is guardrailing, or simply a post-hoc approach to limit the generation of text around a set of topics. Although we can evaluate this against the optimization based techniques, technically and fundamentally these are different approaches. I would suggest revising the term 'unlearning'. To give an example, when we update the model parameters, typically the term 'training' is used. When we add task capability with prompts, they are not called 'training', but terms like in-context 'learning' are used. It helps to maintain the boundary. And the advantage is, we can easily track the state of the art techniques for 'training/tuning' vs 'icl/prompting'. I understand, there are a few other papers who used the same term, but it is never too late to fix something for good.
-
Table 5 - It is not fair to compare the runtime of optimization based methods vs prompting based methods. ICUL is okay, but NPO, SNAP, and SCRUB are not apple to apple comparison. Updating model parameters will definitely be expensive than prompting a model. I would suggest revising this analysis based on the number of inference steps needed. The runtime is probably not a robust indicator as it will fluctuate a lot based on different models, hardware, prompting strategy, and input size.
-
As this post-hoc approach is a comparatively new area without much previous work, I believe it is important to illustrate what makes ALU different or better than ICUL and Guardrail. I get that there is a prompting based framework that is outperforming ICUL and Guardrail. But what fundamental improvement or intuition is making it happen? The paper can answer this better. For example, x, y, z are the limitations of ICUL and Guardrail, which are being addressed by ALU by doing p, q, r.
-
The evaluation process involves 31 models of different sizes. There are a list of tables with results in the Appendix. But what do we learn from these? What is the impact of different model families and sizes? Without this learning, mentioning evaluation on 31 models doesn't add much value.
Minor
- Line 63: Thatagents > That agents
- Needs a quick proofread as there are many sentences with no spaces between sentence stopper period and the next word. Specifically, in the first section.
- Table 1 and 2 are away from where they are referenced in the text. It makes the reading difficult.
We thank the reviewer for considering all the experiments done and going through all the supplementary work. The supplementary materials further support the main claims of the paper. We are happy to resolve your concerns:
R1: how is this paper getting positioned in the literature - as an "unlearning technique"
We acknowledge your concern regarding the usage of the term “unlearning” in the context of inference time unlearning which does not modify the model weights, and the correct terminology for the same could be a topic for discussion. However, we’d like to point to Appendix A.4 where we mention how the fundamental principle of an unlearning framework is to prevent any leakage of unlearned/forgotten information pertaining to the unlearning targets without affecting the intrinsic capabilities of the base model. Any method which satisfies this criteria is considered an unlearning method from a practical point of view. This has been the primary definition of unlearning throughout the literature [1,2]. In the past, unlearning methods were also categorized as exact and inexact unlearning. However, both were still considered Unlearning.
[2] https://ieeexplore.ieee.org/abstract/document/10113700
Moreover, if we delve deeper into what “true” unlearning is, the existing methods which “unlearn” by weight modifications have been proven to not really unlearn the topics [3][4][5]. Recent works [3] [5] show that optimization based unlearned models can be fine tuned to recover the forgotten content, [4] showed that unlearned models in 4-bit precision retained 83% of the content they were supposed to have been forgotten. In practice, a company adopting an Unlearning method cares about how effective the method is. Therefore, there should be no issue in accepting this class of methods (ours, ICUL, etc.) as unlearning methods. Even the ICUL paper (ICML-2024) and the ECO prompts (NeurIPS-2024) [6] paper classify themselves as an Unlearning method.
[3] - https://arxiv.org/pdf/2505.16831
[4] - https://arxiv.org/pdf/2410.16454
[5] - https://arxiv.org/pdf/2406.13356
[6] - https://arxiv.org/pdf/2406.07933
R2: It is not fair to compare the runtime of optimization based methods vs prompting based methods
We acknowledge that the fundamental principles of optimization based methods are different from post-hoc methods, however, they’re all the same to someone (a company) who’s looking to adopt an unlearning framework for their application. All the company cares about is the efficacy and efficiency of the unlearning method. For example, a company is more interested in the efficacy and efficiency of an LLM, irrespective of whether it is an autoregressive or a diffusion [7][8][9] based model. Therefore, a run-time comparison is an apple-to-apple comparison here for all practical purposes.
“The runtime is probably not a robust indicator as it will fluctuate a lot based on different models, hardware, prompting strategy, and input size” - we have used the same hardware, same model, same inputs for the existing models and ours for comparison. We don’t see any issue here for Table 5.
Could you clarify what you mean by the number of inference steps needed? If we understand correctly, our method filters the response in 4 steps with the dedicated agents. The existing optimization based models would have made 1 call to the LLM. But our methods are 1) superior in performance, 2) don’t degrade the retain information ability of LLM at all, 3) easily adjust to new unlearning requests with zero cost.
For example, NPO would take more than an hour to unlearn each new batch of 200 examples, and in real world scenarios we may get thousands of such requests on a frequent basis (with services like OpenAI Privacy) for which ALU takes no excess time to adjust.
[7] - https://deepmind.google/models/gemini-diffusion/
[8] - https://arxiv.org/pdf/2502.09992
[9] - https://arxiv.org/pdf/2412.08821v2
R3: What fundamental intuition or improvement is making ALU better?
We have discussed the intuition and advantages of our ALU method in Section 3 in detail. Further insights into the inclusion of the Vanilla and Critic Agent has been discussed in depth in Appendix A.10. As this is an agentic framework, we have explained why each agent is needed and how it helps in the overall unlearning process. The existing methods (ICUL, Guardrails) are single/multi-shot prompting techniques which fail to generalize across different jailbreaking scenarios. ALU offers the first agentic framework specifically designed for being robust to jailbreaking through its 4 agents based approach. This also allows our framework to handle complex scenarios like knowledge entanglement as addressed in Section 4.2, where the other methods tend to perform poorly.
R4: Analogy of the 31 model evaluation
Table 32 extends the results in Table 1, whose key findings are summarized in the caption in Table 32 and in Section 4.1. The data across different Tables in the Appendix reinforce that ALU delivers model and size agnostic unlearning. Although larger models naturally obtain higher absolute scores, ALU drives every architecture, regardless of scale, down to near-random performance on unlearned information. This consistency across diverse backbones underscores the method’s effectiveness and backs our claim of ALU being a model agnostic framework.
We thank the reviewer for pointing out the minor issues, we will have a thorough proofreading and resolve these issues in the final draft of the paper.
R1+R2 (Author Response)
i) Thank you for the thoughtful follow-up. We appreciate your balanced stance, especially your willingness to let the broader community debate nomenclature rather than using it as grounds for rejection. Below we outline the concrete manuscript changes we will make and clarify why, even under a “guardrail” label, the paper’s technical contribution remains intact.
Manuscript edits we plan to make: We will change the running description from “unlearning” to “inference-time unlearning. Also, we will add a short taxonomy table distinguishing weight-editing methods (“weight-level unlearning”) from inference-time guardrail approaches (“output-level unlearning”). The abstract will be updated accordingly.
Your point is well-taken: if a reader equates “unlearning” exclusively with weight modification, our current wording may cause confusion. By adopting “inference-time guardrail” in the name and by explicitly contrasting the two families of methods, we avoid conflating them while still acknowledging that both aim to prevent leakage. We may also convert the paper title as: Agents Are All You Need for Inference-Time LLM Unlearning
We feel naming should not dominate the acceptance decision: Regardless of label, ALU achieves (i) stronger leakage suppression, (ii) no loss in retained capabilities, and (iii) zero retraining cost, beating all published baselines across six benchmarks. Our results demonstrate that ALU can fill an important gap, low-latency compliance updates where weight-level changes are too slow or infeasible. We agree that efficiency improvements in weight-editing are important and ongoing. Our intention is not to rebrand inference-time guardrails as a weight-editing substitute but to show that, today, it already satisfies the operational definition of “removing sensitive content from outputs.” We will make this nuance explicit.
ii) We appreciate this clarification and fully agree that a fair scientific evaluation requires comparing methods that solve the same technical problem under comparable constraints. In our revision we will therefore separate the two “breeds” of approaches: weight-editing unlearning versus inference-time methods and report results within each class before offering a cross-class discussion for context.
R3 (Author Response)
We will add intuition and motivation behind the design decisions as suggested.
Thank you again for the constructive feedback. We believe the paper is stronger with this small change and look forward to any further suggestions.
We hope you consider this when assigning your final score
Thank you. I have increased my score.
Thank you and we are satisfied with your final rating.
Thank you for the detailed explanations and the references.
R1+R2
As I mentioned, 'there are a few other papers who used the same term, but it is never too late to fix something for good', I understand that this term is being used in a broader context and I am still not convinced that any method that does not change model weights to force a model to unlearn something should be called unlearning. However, I think this is a topic for community-wide discussion and I won't push much on this issue to throw this paper under the bus. Any method that uses prompt engineering to limit the generation of a set of topics/concepts fall under the term Guardrail.
Two things:
i) the existing methods which “unlearn” by weight modifications have been proven to not really unlearn the topics -- the question here is how efficient they are with their intention of 'removing some abilities from the model at the hardware (i.e., weight) level'. Efficiency is important in everything and improving efficiency is an ongoing process. A is still not very efficient is not an argument to call B as A.
ii) This is a scientific method related discussion. What a company would care about should not be the basis of an argument here. We should look for what is correct. Yes, I meant the 4 inference steps in this method. Inference is always cheaper than training a model. I am not challenging the performance of the proposed technique. I understand it is cheaper to use it to guardrail a set of topics and numbers look good. But, NPO and other optimization techniques are a different breed of methods and the comparison is not fair from the very beginning. For example, water has always more friction that air, and that's why airplanes will always have higher speed than a water vehicle. When we will develop another airplane/fighter jet, we should compare the speed with other airplanes; not against several water vehicles and show that my new plane is faster than that speedboat.
R3
Section 3 is more like a how to section and can be improved with intuition and motivation behind the design decisions.
Dear Reviewer dgMs,
Could you please give your feedback to our response to your Reviews?
Thank you!
This paper introduces an inference-time agentic framework for what they refer to as "unlearning" in large language models: coordinating multiple prompted agents to remove undesirable behaviors without retraining. The method is evaluated across 31 models and a range of challenging benchmarks, showing strong results in both unlearning efficacy and utility preservation.
The reviewers raised several concerns—around terminology, efficiency, and novetly—but in my view, these do not outweigh the core strengths of the work.
- The naming issue has been acknowledged by the authors, who have agreed to revise the framing to emphasize “inference-time unlearning” and to clarify the taxonomy of unlearning methods.
- The efficiency concerns, while real, are not sufficient grounds for rejection. As discussed, many widely adopted methods (e.g., CoT prompting) have involved inference-time tradeoffs in exchange for substantial gains in model behavior.
- On novelty, while some components (e.g., the Critic) draw on known techniques, the overall system is nontrivial and demonstrably effective. The bar for novelty should not exclude contributions that are partially derivative if they result in new, useful capabilities or insights. This work does that.
No new substantive objections were introduced in the final round. The divergence in scores reflects weight placed on concerns such as efficiency that I do not believe are prohibitive enough to be grounds for rejection. Given the strength and breadth of the empirical findings, the clarity of exposition, and the relevance of the problem, I recommend acceptance.
[Automatically added comment] At least one review was discounted during the decision process due to quality]