7.0

/10

Poster4 位审稿人

最低6最高8标准差0.7

3.0

置信度

COLM 2025

The Unlearning Mirage: A Dynamic Framework for Evaluating LLM Unlearning

Raj Sanjay Shah,Jing Huang,Keerthiram Murugesan,Nathalie Baracaldo,Diyi Yang

OpenReview PDF

提交: 2025-03-20更新: 2025-08-26

TL;DR

A Dynamic Framework for Evaluating LLM Unlearning

摘要

关键词

Unlearning evaluationMulti-hop reasoning

评审与讨论

审稿意见

评分: 7置信度: 32025-04-18

The paper proposes a framework to automatically construct automatically a benchmark to assess unlearning efficiency. The framework elicits the knowledge from the baseline large language model to construct a knowledge graph. Using the knowledge graph information such as aliases, relations, etc. The framework constructs benchmarks to assess unlearning efficiency.

By testing to a large number of methods on the benchmark, the authors show that the automatically constructed benchmark agrees higly with high quality manual construction. The framework also enables discovery of new issues for unlearning.

接收理由

The paper has clearly exposed its motivation and contributions.

The extensive experiments shown in Table 2 and the appendix demonstrates the viability of the method. Not only the method shows its efficiency when compared with existing benchmarks, the papers show that such automatic construction might reveal new issues for the unlearning research, which makes the paper have potentially good impact.

The paper discusses the limitation of the proposed approach transparently and in an insightful way. The automatic construction has the limitation from the elicitation part and can give future research directions.

拒绝理由

I personally found Table 2 and the corresponding description in 5.2 a bit confusing. After reading carefully, I believe the results are produced under their constructed benchmark but seed with RWKU entities. If this is true, it needs to be clarified at several places:

Table 2 caption, it is then not on RWKU benchmark
line 250 as an example, these are not results for RWKU

The authors have more results in the Appendix and I think authors could refer to it more appropriately for curious readers.

评论- Response to reviewer 6xWi

2025-06-03

We thank you for your thoughtful evaluation and for recognizing the clarity of our motivation, the strength of our experimental results, and the potential impact of our automatically constructed evaluation method for unlearning.

On the description of table 2 and section 5.2, you are correct. Table 2 reports results generated using our constructed benchmark, seeded with RWKU entities, not the original RWKU benchmark itself. We agree that this distinction should be more clearly conveyed.

We will make the following changes in the revision:

Table 2 caption will be updated to be explicit and state that the evaluation is conducted using our dynamic benchmark instantiated with RWKU entities.
Line 250 and surrounding text in Section 5.2 will be revised to avoid any ambiguity and to clearly distinguish between evaluations on our benchmark vs. original RWKU.
We will add explicit references to Appendix Tables 4 to 7, which report results on both Phi-4-mini and on original static benchmarks like RWKU and TOFU, for readers seeking a broader view.

We appreciate your suggestion and will clarify the above points to improve overall readability.

审稿意见

评分: 7置信度: 32025-05-10

This work introduces a dynamic framework for benchmarking LLM unlearning. It begins by extracting seed knowledge from the model and constructs a graph-based expansion to discover related information. The proposed benchmark is then used to evaluate existing unlearning methods, showing comparable coverage and effectiveness to static evaluation methods. Moreover, it reveals new failure modes, particularly in multi-hop reasoning settings.

接收理由

The paper is well-written, clearly structured, and easy to follow.
This work presents a unified, dynamic evaluation framework for model unlearning that demonstrates flexibility and adaptability across different query types and reasoning complexities.

拒绝理由

The evaluation is primarily conducted on the Llama 3.3 8B model, limiting the generalizability of the findings.
Since the benchmark relies on knowledge extracted from each model prior to unlearning, comparisons across models become difficult. Each model is evaluated on a potentially different set of knowledge. This raises concerns about the consistency and fairness of cross-model comparisons, as variations in the constructed knowledge graphs may lead to differences in evaluation difficulty rather than reflecting true differences in unlearning performance.

评论- Response to reviewer sfKb

2025-06-03

We thank you for the positive assessment of our work and the recognition of our dynamic, model-specific evaluation approach. We are glad you found our results both rigorous and impactful in revealing residual knowledge missed by static benchmarks. We address each of your concerns below:

Feedback1: The evaluation is primarily conducted on the Llama 3.3 8B model, limiting the generalizability of the findings.

We appreciate the your concerns regarding generalizability. Our primary experiments use LLaMA 3.1-8B due to its wide adoption in unlearning research. Additionally, we also evaluate our benchmark on Phi-4-mini-Instruct (Table 4) and IBM Granite-3.2-8B-Instruct (see below) to ensure that the evaluation is effective in comparing unlearning methods regardless of the model architecture. Across models, we observe consistent relative performance rankings of unlearning methods when compared to previous evaluation methods (for e.g., Spearman’s ρ > 0.8 for IBM Granite 3.2-8B-Instruct), supporting the generalizability of our framework.

Method	Multi-hop 1-hop	Multi-hop 2-hop	Multi-hop 3-hop	Ret. Criteria 1-fact away	Ret. Criteria 2-facts away	Ret. Criteria Rel. Ret	Combined Multihop Forget score	Combined Avg. Retain	Overall Score
Target Model	98.2	96.9	80.4	98.0	97.3	97.5	91.8	97.6	15.1
ICL	13.1	17.6	27.9	35.6	50.3	91.8	19.5	59.2	68.2
GA	18.2	24.7	30.1	43.1	60.8	54.2	24.3	52.7	62.1
GDR	22.6	26.9	30.6	72.2	69.1	74.1	26.7	71.8	72.5
GKL	22.8	27.9	31.4	72.6	69.7	75.2	27.4	72.5	72.6
DPO	21.4	29.4	33.3	51.1	56.5	56.2	28.0	54.6	62.1
DPOD	24.3	33.7	34.5	63.2	69.2	81.4	30.8	71.3	70.2
DPOKL	27.5	31.9	35.0	64.7	66.5	81.8	31.5	71.0	69.7
NPO	14.4	24.1	29.6	48.7	59.3	59.2	22.7	55.7	64.8
NPOD	17.2	23.7	30.2	68.7	72.6	82.6	23.7	74.6	75.5
NPOKL	16.1	21.1	30.0	68.2	70.2	83.5	22.4	74.0	75.7
ULD	12.6	19.1	26.9	72.7	76.6	85.4	19.5	78.2	79.3
TV	26.8	45.9	52.6	75.4	79.5	89.2	41.8	81.4	67.9
Avg.	19.8	27.2	32.7	61.4	66.7	76.2	26.5	68.1	70.1
Spearman’s Rank Corr. with previous metrics							0.86	0.83

Feedback2: Since the benchmark relies on knowledge extracted from each model prior to unlearning, comparisons across models become difficult. This raises concerns about the consistency and fairness of cross-model comparisons, as variations in the constructed knowledge graphs may lead to differences in evaluation difficulty rather than reflecting true differences in unlearning performance.

Importantly, our benchmark evaluates each model relative to its own knowledge, i.e., we only include a probe if the pre-unlearning model correctly answers it. This ensures that we are testing whether a model has truly forgotten something it previously knew, rather than penalizing it for never having encoded the information in the first place. This approach directly addresses the issue raised by you: models differ in their training data, architecture, and knowledge cutoff dates, making a fixed set of static probes unfair and uninformative for cross-model comparisons.

By contrast, our dynamic, model-informed method avoids this issue. It adapts to what each model actually encodes before unlearning, providing a fairer and more accurate assessment of forgetting. In this sense, the concern about inconsistent knowledge graphs demonstrates a key strength of our approach: static benchmarks impose a one-size-fits-all test that fails to account for model-specific prior knowledge, whereas our method ensures evaluations are tailored, valid, and comparable within each model. In the revision, we will motivate how the goal of unlearning evaluation is not the “comparisons across models” but rather to compare across different unlearning methods in the same model, and emphasize how our design accounts for differences in model training data and knowledge cutoffs, ensuring that evaluations reflect forgetting rather than missing prior knowledge.

2025-06-10

Thank you for providing the additional analysis and explanation. I think they address my main concerns. I've updated my score to a 7.

评论- Authors' response

2025-06-09

Hi,

The discussion period ends tomorrow. Please take time to review the authors' response and update your review accordingly.

Thanks!

审稿意见

评分: 6置信度: 32025-05-12

This paper introduces a dynamic benchmark to evaluate unlearning robustness in LLMs.
The core idea is to construct a knowledge graph by querying the target model before unlearning, capturing what the model “knows” about a specific entity.
- This involves querying the model to extract facts and build a graph of single-hop relations.
- The graph is then expanded to include multi-hop paths and aliases.
- Using this graph, the framework generates various probing questions—single-hop, multi-hop, and alias-based—to evaluate whether unlearning was successful.
The key insight is that while most current unlearning methods pass basic single-hop tests, they often fail under multi-hop or alias-based queries—indicating that the knowledge isn't fully removed.

接收理由

The proposed benchmark is more rigorous than existing ones. Most prior work evaluates unlearning using static, single-hop questions, which are easy for current methods to pass. This framework exposes brittleness by generating multi-hop and alias-based queries that surface residual knowledge.
The framework is dynamic and model-specific, meaning it constructs the benchmark by extracting knowledge directly from the model. This avoids the need for manual construction or external LLMs like GPT-4, and helps ensure that the probing queries are aligned with what the model actually encoded pre-unlearning.
The empirical results are thorough. The authors show that their metric aligns with existing benchmarks on method ranking, but also captures failure cases that others miss (e.g., multi-hop alias queries where residual knowledge is still accessible).

拒绝理由

The graph construction cost is not well discussed. Since the graph is built by repeatedly querying the model and expanding nodes, how many model calls are needed per entity? This affects scalability, especially if this method is to be used in practice. There's mention of API budget and decay factor, but no concrete numbers or analysis of resource usage.
There’s a potential issue with using model outputs as ground truth for what should be forgotten. For instance, prompting (A, wrote, B) may succeed, but the reverse (B, was written by, A) may fail even before unlearning. This asymmetry (reversal curse) can introduce noise into the evaluation: are we really measuring unlearning failure, or just limits in how the model responds to different phrasings?
It’s not entirely clear whether multi-hop unlearning is practically necessary. Is this a real-world issue, or is it mainly an academic edge case? If we're constructing synthetic multi-hop queries, we may just end up with unlearning methods that overfit to synthetic query styles without actually improving robustness in realistic settings. Some discussion around the practical importance of this failure mode would be helpful.

给作者的问题

What is the “LLaMA-3.3-Instruct 8B” model? To my knowledge, Meta has only released 8B models on LLaMA-3.1. Is 3.3 an internal/private version, or a typo?

评论- Response to reviewer cADi: 1/N

2025-06-03

We thank you for the clear summary and positive assessment of our work. We appreciate your recognition of our dynamic, model-specific evaluation approach and the importance of probing with multi-hop and alias-based queries. We are glad you found our results both rigorous and impactful in revealing residual knowledge missed by static benchmarks. We address your concerns below:

Feedback1: The graph construction cost is not well discussed. Since the graph is built by repeatedly querying the model and expanding nodes, how many model calls are needed per entity? This affects scalability, especially if this method is to be used in practice. There's mention of API budget and decay factor, but no concrete numbers or analysis of resource usage.

We thank you for highlighting this important consideration regarding scalability in real-world deployments. We note that the knowledge graph needs to be constructed only once per model, knowledge domain (seed entity) pair. Once built, it can be reused to evaluate multiple unlearning methods, making the associated API cost/ model call a one-time overhead rather than a recurring burden.

We agree that a more detailed breakdown of model call usage during graph construction would improve clarity. In the revised version (Section 4.1, Appendix A.6), we will provide a comprehensive analysis of the API calls incurred under varying graph depths and decay factors.

Specifically, each node in the graph requires a minimum of three LLM queries: one for entity elicitation, one for atomic fact extraction, and one for alias resolution. Since the graph is expanded via a breadth-first search with an exponential decay factor (α), the number of nodes, and consequently, the number of model calls, it grows sub-exponentially based on equations 1 and 2.

For example, under a decay factor of α = 0.8, we empirically observe that the average number of nodes per seed entities in the RWKU dataset at depths 1, 2, and 3 are approximately 57.6, 103.7, and 140.5, respectively. This results in a total model call count ranging from 228 to 1,942 per entity, depending on graph density, alias resolution needs, and retries due to API response exceptions. We will include these statistics in the appendix and expand our discussion to include practical scalability strategies such as batching, early pruning of low-relevance nodes, and our implementation of parallelized graph construction.

Feedback2: There’s a potential issue with using model outputs as ground truth for what should be forgotten. For instance, prompting (A, wrote, B) may succeed, but the reverse (B, was written by, A) may fail even before unlearning. This asymmetry (reversal curse) can introduce noise into the evaluation: are we really measuring unlearning failure, or just limits in how the model responds to different phrasings?

Thanks for raising this point. While such surface form asymmetries can exist, we do not view them as a major confound in our evaluation. First, we include all variant probes, forward, reverse, and alias-based, which are correctly answered by the pre-unlearning model. This ensures that post-unlearning failures reflect forgetting quality independent of the phrasing artifacts. More importantly, we believe that regardless of phrasing, an effective unlearning method should prevent any elicitation of forgotten knowledge. If a model can still retrieve the target fact under a different surface form, it signals an unlearning failure. We will clarify both our filtering protocol and this conceptual stance in the revision.

评论- Response to reviewer cADi: 2/N

2025-06-03

Feedback3: It’s not entirely clear whether multi-hop unlearning is practically necessary. Is this a real-world issue, or is it mainly an academic edge case? If we're constructing synthetic multi-hop queries, we may just end up with unlearning methods that overfit to synthetic query styles without actually improving robustness in realistic settings. Some discussion around the practical importance of this failure mode would be helpful.

We believe multi-hop unlearning is not a theoretical edge case but a practical requirement. In real-world deployments, users engage with LLMs via natural, indirect, or compositional queries, whether through search assistants, RAG pipelines, or conversational interfaces. For instance, a user who forgets a book title may ask, “Who wrote the book whose main character is Katniss Everdeen?” instead of directly asking about The Hunger Games.

Our experiments demonstrate that even when direct (single-hop) queries show effective unlearning, residual knowledge often remains accessible through slightly rephrased or multi-hop questions, exposing the brittleness of current unlearning methods. From an end user's standpoint, who wants a model to forget their private information, how a question is phrased is immaterial; what matters is that the information is actually forgotten. If a model can retrieve sensitive or copyrighted content under paraphrased or multi-step reasoning, the unlearning mechanism has failed in its real-world obligation. Thus, we believe users, regulators, and stakeholders, care about outcome-level guarantees, not phrasing-specific ones. Our method reflects this risk by constructing multi-hop and alias-based probes directly from the model’s own knowledge structure, avoiding arbitrary synthetic templates. We will clarify this motivation and real-world relevance more explicitly in the revised paper.

Feedback4: What is the “LLaMA-3.3-Instruct 8B” model? To my knowledge, Meta has only released 8B models on LLaMA-3.1. Is 3.3 an internal/private version, or a typo?

Response: Thanks for catching the typo. Our experiments are based on the LlaMA-3.1-Instruct 8B model. We shall correct this error in the paper.

2025-06-09

Thank you for your response. I’m satisfied with your clarification and will maintain my positive assessment.

审稿意见

评分: 8置信度: 32025-05-13

The paper proposes an evaluation framework for LLM Unlearning which is claimed to be superior to existing unlearning evaluations. The focus is on dynamically creating multi-hop queries which are able to retrieve information which are missed by existing benchmarks. The multi-hop queries are created by probing the LLM's internal knowledge about unlearning targets before unlearning, and constructing knowledge graphs. These graphs can be automatically created without manual data curation.

接收理由

The paper provides a strong reasoning and supporting arguments as to why multi-hop queries might expose flaws in existing unlearning methods. Furthermore, it goes on to provide a framework to automatically probe LLMs to create multi-hop queries. It then does an extensive evaluation of most popular unlearning methods to show performance of unlearning. There is also a detailed discussion on how the proposed benchmark compares to existing benchmarks like TOFU and MUSE. The limitations sections is extremely transparent and provides ideas on how to drive unlearning research forward.

拒绝理由

No strong reasons for rejection.

评论- Response to reviewer zv4e

2025-06-03

We thank you for their thoughtful summary and positive feedback on our submission. We are glad that you found our dynamic evaluation framework for LLM unlearning to offer strong motivation, methodological rigor, and transparent discussion of limitations. Your recognition of our framework’s ability to automatically generate multi-hop queries and uncover new failure modes, while maintaining coverage and ranking consistency with established benchmarks, is much appreciated. We are happy to work with you to further improve the paper.

评论- Authors' response

2025-06-09

Hi,

The discussion period ends tomorrow. Please take time to review the authors' response and update your review accordingly.

Thanks!

2025-06-10

I have nothing more to add in the discussion period.

最终决定Accept

2025-07-08

The reviewers unanimously found this paper to be a valuable and timely contribution to the growing literature on LLM unlearning, praising its clear motivation, rigorous methodology, and thorough experiments (scores = 8, 7, 7, 6). They especially appreciated the dynamic model-specific benchmark that uncovers residual knowledge via multi-hop and alias-based probes, a weakness that static tests miss, while still aligning well with prior evaluations. All critical points raised during discussion (graph-construction cost, asymmetry in prompting, real-world relevance of multi-hop queries) were addressed convincingly. I think this is an impactful paper that should be accepted.