PaperHub
6.8
/10
Poster4 位审稿人
最低3最高5标准差0.8
4
5
5
3
3.8
置信度
创新性2.5
质量2.3
清晰度2.8
重要性2.5
NeurIPS 2025

Elastic Robust Unlearning of Specific Knowledge in Large Language Models

OpenReviewPDF
提交: 2025-05-06更新: 2025-10-29
TL;DR

A novel LLM unlearning optimization framework, namely Elastic Robust Unlearning (ERU), to efficiently and robustly remove specific knowledge from LLMs.

摘要

关键词
LLM Unlearning; Preference Optimization; Unlearning Robustness

评审与讨论

审稿意见
4

This paper proposes Elastic Robust Unlearning (ERU), an optimization framework for robust and effective unlearning in Large Language Models. ERU introduces Elastic Reward Setting to balance reference-based and reference-free reward signals, providing flexible and adaptive optimization during unlearning. It utilizes Refusal Feature Ablation, which simulates worst-case perturbations during unlearning to defend against knowledge recovery preemptively. The paper formulates unlearning as a max-min optimization problem, where the inner loop simulates adversarial conditions, and the outer loop removes harmful knowledge.

优缺点分析

Strengths:

  1. ERU models the unlearning process as a robust optimization problem with Elastic Reward Setting to overcome the rigidity.

Weaknesses:

  1. The proposed method incorporates a simplified adversarial training component, following a prior approach. However, it does not compare it to more advanced or state-of-the-art adversarial training techniques, instead of AdvNPO, to show the rationale for using Refusal Feature Ablation over others. Such comparisons are important for isolating the benefits of integrating adversarial mechanisms into the framework. Additionally, training time is typically not the primary concern in most machine unlearning scenarios.

  2. The current ablation study lacks a variant that isolates the contribution of the Elastic Reward component. Evaluating the model's performance with only Elastic Reward (and without other components) would help quantify its individual impact and clarify the source of improvements.

  3. The paper does not report statistical significance tests to support claims of performance improvement. Without such analysis, it is difficult to determine whether the observed gains are meaningful or within the margin of variability.

问题

Could the authors compare the proposed method against other sophisticated adversarial training techniques under the same Elastic Reward settings? This would help isolate the value of the framework's adversarial component and demonstrate the rationale and whether its integration offers a meaningful improvement

Could the authors conduct an ablation study that evaluates the method using only the Elastic Reward component? This would clarify its individual contribution to overall performance and better support the design choices.

Could the authors perform statistical significance tests to demonstrate that the proposed method yields significant improvements in unlearning effectiveness while preserving model utility as claimed? This would strengthen the empirical claims and enhance the credibility of the reported gains.

局限性

NA

最终评判理由

The responses address my concerns related to ablation studies and statistical significance. Hence, I increase my scores.

格式问题

NA

作者回复

Dear reviewer pWXE

Thank you for your positive assessment of novelty and motivation of our work. We hope to address your concerns in our reply below.

Weaknesses and Questions:

Weakness (1) and Question (1) :Could the authors compare the proposed method against other sophisticated adversarial training techniques under the same Elastic Reward settings? [...]

Thank you for bringing it to our attention! We supplement the comparison of the proposed method against Input-sapce Adversarial Training (IAT) [1], Continuous Adversarial Training (CAT) [2] and Latent-space Adversarial Training (LAT) [3] under the same Elastic Reward settings. The experimental results are shown in three dimensions as follows:

Unlearning Effectivenes

MethodRWKURWKURWKUMUSE-NewsMUSE-NewsMUSE-NewsTOFUTOFUWMDPWMDP
FB↓QA↓AA↓VerbMem↓KnowMem↓PrivLeak (∈ [−5%, 5%])Forget05-FQ↑Forget10-FQ↑AccBio↓AccCyber↓
EU-IAT34.732.230.515.714.827.80.580.3229.531.5
EU-CAT33.529.428.713.117.834.60.620.4129.231.2
EU-LAT32.829.625.812.511.831.50.710.4427.329.5
ERU29.227.125.510.49.212.30.730.4824.828.4

Utility Preservation

methodRWKURWKURWKURWKUMUSE-NewsMMLU
Rea↑Tru↑Fac↑Flu↑KnowMem↑Accuracy↑
EU-IAT24.328.937.5685.436.940.8
EU-CAT25.229.439.2697.340.842.5
EU-LAT24.828.738.8688.541.645.7
ERU26.230.540.5708.843.250.6

Unlearning Robustness (Performance recovery of various methods on the WMDP-Bio after fine-tuning with different numbers of retain set samples.)

Method0 samples5 samples10 samples50 samples100 samples250 samples500 samples1000 samples
EU-IAT29.533.534.735.436.237.938.639.3
EU-CAT29.233.834.634.935.035.837.638.4
EU-LAT27.329.530.632.833.634.234.534.7
ERU24.825.126.228.727.830.631.030.7

Experimental results show that ERU combined with RFAT significantly outperforms the methods combined with other adversarial training means. The best method is bolded, and the second-best method in highlight.


Weakness (2) and Question (2) : Could the authors conduct an ablation study that evaluates the method using only the Elastic Reward component? [...]

Thank you for your valuable suggestions. In fact, we have already conducted relevant discussions in Appendix F.3 of the paper. In this section, we adopted the approach of removing the key components of EUR to deeply analyze the specific impact of different components on performance. The situation you mentioned, "only using the Elastic Reward component to evaluate the method", actually corresponds to the state after removing the refusal feature adversarial training (RFAT) from the ERU. The specific results can be referred to in the following table (consistent with Table 8 in Appendix F.3 of the paper) :

MethodUnlearning EffectivenesUnlearning EffectivenesUnlearning EffectivenesUtility PreservationUnlearning RobustnessUnlearning Robustness
RWKURWKURWKUMMLUWMDPWMDP
FB↓QA↓AA↓Accuracy↑AccBio↓AccCyber↓
Original51.946.857.558.563.242.8
ERU29.227.125.550.624.828.4
w/o RFAT (only EU)29.628.435.852.351.437.9

We realize that this part of the content might not have been emphasized in the main text, which has raised your question. To make the presentation of viewpoints more clear, we plan to directly incorporate this part of the analysis into the main text rather than the appendix to ensure that readers can understand them more clearly.


Weakness (3) and Question (3) : Could the authors perform statistical significance tests to demonstrate that the proposed method yields significant improvements in unlearning effectiveness while preserving model utility as claimed? [...]

We sincerely appreciate the reviewer's valuable suggestion. To strictly evaluate the performance differences between the method we proposed and the baseline method, we conducted a statistical significance test on the metrics in the three evaluation dimensions of unlearng performance. A statistically significant result, indicated by a p-value less than 0.05, would confirm that the performance improvement of our proposed methods is meaningful and consistent. To compare these distributions, we employ the Wilcoxon Signed Ranks Test. We use bootstrapping to generate multiple samples from the original dataset through resampling with replacement. For each bootstrap sample, we calculate both metrics for both the proposed and baseline methods, resulting in distributions of metric values for each method. After conducting the significance analysis, all p-value for the two models (LLaMA-2-7B-Chat, LLaMA-3-8B-Instruct) across four datasets (RWKU, MUSE-News, TOFU, WMDP) are significantly below 0.05 when comparing each of our proposed methods against the baseline methods. The results of the statistical significance tests are as follows:

Unlearning Effectivenes

MetricsRWKURWKURWKUMUSE-NewsMUSE-NewsMUSE-NewsTOFUTOFUWMDPWMDP
FBQAAAVerbMemKnowMemPrivLeakForget05-FQForget10-FQAccBioAccCyber
p-value (LLaMA-2-7B-Chat) ​2.8e-31.4e-33.8e-34.2e-49.6e-57.3e-31.9e-23.1e-22.1e-41.7e-3
p-value (LLaMA-3-8B-Instruct) ​3.4e-31.6e-25.8e-32.4e-37.2e-42.4e-32.5e-21.3e-26.5e-45.8e-3

Utility Preservation

MetricsRWKURWKURWKURWKUMUSE-NewsMMLU
ReaTruFacFluKnowMemAccuracy
p-value (LLaMA-2-7B-Chat) ​​2.7e-38.9e-41.8e-33.1e-31.2e-21.3e-3
p-value (LLaMA-3-8B-Instruct) ​​1.8e-33.6e-31.9e-33.5e-42.2e-35.3e-4

Unlearning Robustness

MetricsRWKURWKURWKUWMDPWMDP
FBQAAAAccBioAccCyber
p-value​ (LLaMA-2-7B-Chat) ​1.7e-26.2e-31.9e-42.3e-32.2e-3
p-value (LLaMA-3-8B-Instruct)1.6e-39.2e-48.5e-41.3e-27.2e-3

Concluding remarks.

We would be grateful if you could let us know whether our explanations have addressed your concerns. Please let us know if you have any other questions or concerns.

References.

[1] Zou, et al. (2023) Universal and Transferable Adversarial Attacks on Aligned Language Models.

[2] Xhonneux, et al. (2024) Efficient Adversarial Training in LLMs with Continuous Attacks.

[3] Sheshadri, et al. (2024) Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs.

评论

The responses address my concerns. Hence, I increase my scores.

评论

Dear reviewer pWXE,

Thanks for taking the time to review our work, we have carefuly considered your comments and made every efort to respond to your concerns.

If you have any further questions or require additional clarification, please kindly let us know.

Best regards.

审稿意见
5

This paper proposes Elastic Robust Unlearning (ERU), a novel framework for improving the effectiveness and robustness of large language model (LLM) unlearning. Addressing limitations in existing preference optimization (PO)-based methods, ERU introduces two key innovations: an elastic reward mechanism that enhances unlearning flexibility, and refusal feature ablation, which induces targeted failure modes to boost robustness against unlearned knowledge relearning. Experimental results demonstrate that ERU achieves superior unlearning performance while preserving model utility, outperforming prior methods in a number of benchmark datasets and tasks.

优缺点分析

Strengths

  1. This work provides a detailed analysis of the shortcomings of existing LLM unlearning approaches.

  2. The proposed methods smartly balance the weights of reference-based and reference-free policy optimization during unlearning.

  3. Adopting the refusal feature ablation can greatly reduce the computation cost compared to using regular adversarial training to prevent unlearned knowledge relearning.

  4. The paper is well-organized, and it is easy for the readers to capture the main ideas.

Weaknesses

  1. The refusal feature ablation (RFA) is directly adopted to enhance the robustness of elastic policy optimization. Dedicated design is required for the scenarios of LLM unlearning. For example, the D_harmful and D_harmless can be constructed by the forget or retained datasets. In real-world scenarios, unlearning may be employed to delete outdated information and users' private information.

  2. More details of balancing the inner and outer optimization should be provided. For max-min bi-level optimization, the optimization pace impacts the final performance a lot.

  3. There is no ablation study of detaching the refusal feature ablation from ERU. The balancing effect between reference-based and reference-free optimization offered by the elastic reward margin needs validation.

  4. The robustness of defending against adaptive attacks should be discussed and validated. That is to say, if the adversaries have the prior knowledge that the target LLM has been protected by RFA and they adopt some dedicated measures to break up the protection before conducting unlearned knowledge relearning, how will ERU perform, how long, and how strong will the unlearning effect last?

问题

Please see the above weaknesses.

局限性

yes

最终评判理由

After careful reading of the authors' response, my concerns have been addressed. The authors should incorporate the additional experiments and analysis into the future revision. I am glad to raise my rating to Accept.

格式问题

None

作者回复

Dear reviewer uCL3

Thank you for your constructive comments and they are valuable for improving our paper. In the following, we will address your concerns one by one.

Weaknesses and Questions:

Weakness (1): The refusal feature ablation (RFA) is directly adopted to enhance the robustness of elastic policy optimization. Dedicated design is required for the scenarios of LLM unlearning. [...]

Thank you for your valuable suggestions. In our current design, our D_harmful and D_harmless are respectively sampled from AdvBench and Alpaca, which is based on the existing practice in the study of refusal features [1]. This choice stems from the primary goal of our RFAT, which is to simulate the worst-case adversarial attacks designed to bypass the model's security mechanisms and trigger harmful outputs.

We think your suggestion is very valuable. Constructing D_ harmful set directly from the forgotten set (D_f) and D_ harmless set from the retained set (D_r) is a very persuasive and conceptually elegant idea, which is particularly suitable for the unlearning task. This perfectly aligns with the core objective of this paper, that is, we hope that the model "refuses" to output or utilize the knowledge in D_f while retaining the knowledge in D_r and responding normally to it. We verify this suggestion by supplementing the following experiments:

Unlearning Effectiveness

MethodRWKURWKURWKUMUSE-NewsMUSE-NewsMUSE-NewsTOFUTOFUWMDPWMDPWMDP
FB↓QA↓AA↓VerbMem↓KnowMem↓PrivLeakForget05-FQ↑Forget10-FQ↑AccBio↓AccCyber↓AccChem↓
ERU29.227.125.510.49.212.30.730.4824.828.427.2
ERU (Sample from D_f and D_r)28.426.725.39.89.111.50.850.4924.427.526.3

Utility Preservation

MethodRWKURWKURWKURWKUMUSE-NewsMMLUTOFU-Forget05TOFU-Forget05TOFU-Forget10TOFU-Forget10
Rea↑Tru↑Fac↑Flu↑KnowMem↑Accuracy↑Probability↑ROUGE↑Probability↑ROUGE↑
ERU26.230.540.5708.843.250.60.590.560.740.53
ERU(Sample from D_f and D_r)26.431.040.3711.546.151.80.630.560.750.55

Unlearning Robustness

Method0 samples5 samples10 samples50 samples100 samples250 samples500 samples1000 samples
ERU24.825.126.228.727.830.631.030.7
ERU(Sample from D_f and D_r)24.424.725.827.928.229.729.630.1

Experimental results show that, for different benchmarks, constructing D_harmful and D_harmless by sampling from its forget set and retain set respectively can better improve the ability of ERU, including Unlearning Effectiveness, Utility Preservation, and Unlearning Robustness.The best method is bolded, and the second-best method in highlight.

Weakness (2): More details of balancing the inner and outer optimization should be provided. For max-min bi-level optimization, the optimization pace impacts the final performance a lot.

Thank you for pointing this out! To provide more details on balancing the inner and outer optimization, we supplement the following experiments here. Specifically, we explore the influence of the number of inner optimization steps. We follow the same experimental configurations as in the paper, but instead vary the number of inner optimization steps. The experimental results are shown in the following table:

MethodUnlearning EffectivenessUnlearning EffectivenessUnlearning EffectivenessUtility Preservation
RWKURWKURWKUMMLU
FB↓QA↓AA↓Accuracy↑
ERU(Step=1)29.628.330.445.3
ERU(Step=2)29.427.828.345.6
ERU(Step=3)29.127.526.645.9
ERU(Step=4)28.927.026.246.8
ERU(Step=5)29.127.125.348.5
ERU(Step=6)29.227.125.550.6
ERU(Step=7)30.528.525.845.8
ERU(Step=8)30.228.226.947.4

From the table, we can see that as the optimization steps increases, the Unlearing Effectiveness initially declines (this is what we hope to see), then rises. For Utility Preservation, this trend is rising first, and then decline. In addition, for the Unlearing Robustness, although the more sufficient inner loop will simulate the stronger adversaries and make the model obtain stronger robustness gain, it will greatly increase the burden of the outer loop and make the training time longer. We thereby conclude that both insufficient and excessive optimization steps are detrimental to unlearning performance. To better balance the performance of the ERU, we have set the inner optimization steps to 6 in the experiments of the paper.

Weakness (3): There is no ablation study of detaching the refusal feature ablation from ERU. The balancing effect [...]

We get your concern. In fact, we have already conducted relevant discussions in Appendix F.3 of the paper. In Appendix F.3, we adopted the approach of removing the key components of ERU to deeply analyze the specific impact of different components on performance, including the ablation study of detaching the refusal feature ablation from ERU. The specific results can be referred to in the following table (consistent with Table 8 in Appendix F.3) :

MethodUnlearning EffectivenessUnlearning EffectivenessUnlearning EffectivenessUtility PreservationUnlearning RobustnessUnlearning Robustness
RWKURWKURWKUMMLUWMDPWMDP
FB↓QA↓AA↓Accuracy↑AccBio↓AccCyber↓
Original51.946.857.558.563.242.8
ERU29.227.125.550.624.828.4
w/o RFAT29.628.435.852.351.437.9

Weakness (4): The robustness of defending against adaptive attacks should be discussed and validated. [...]

Thank you for putting forward this very valuable suggestion. To address your concerns, we add the following discussion on the robustness against adaptive attacks. Just as you said, "adopt some dedicated measures to break up the protection", we systematically weakened the ability of RFAT in the following two ways to simulate the degree of damage to this mechanism caused by different adaptive attacks.

Firstly, the paper mentions that we use probability p to perform RFA to approximate the different degrees of adversarial perturbations encountered by the model during the training process. Therefore, reducing p will decrease the chance of the model being exposed to the "worst-case perturbation", weakening the robustness gain. Different from setting p to 0.5 in the paper, we set p to [0.4, 0.3, 0.2] respectively here to weaken the ability of RFAT.

In addition, in our paper, following the research of Yu et al., we applied RFA to the last 75% layers (layers [8,32]) of the model to obtain the most stable fine-tuning results. Therefore, changing the layer where RFA is applied will also weaken the robustness gain of RFAT. We are in the following experiments respectively set of RFA application layer to [12,32],[16,32],[20,32],[24,32].

Consistent with the paper, We discuss the performance recovery of various methods on the WMDP-Bio after fine-tuning with different numbers of retain set samples. The experimental results are shown in the following table:

Method0 samples5 samples10 samples50 samples100 samples250 samples500 samples1000 samples
ERU(p=0.4)24.826.827.229.329.831.632.132.3
ERU(p=0.3)24.628.629.530.231.834.336.538.8
ERU(p=0.2)24.929.534.036.237.438.441.742.5
ERU(RFA layers [12,32])24.626.328.431.731.531.433.534.2
ERU(RFA layers [16,32])24.928.030.231.433.235.635.237.1
ERU(RFA layers [20,32])24.729.830.833.336.638.439.238.6
ERU(RFA layers [24,32])24.430.433.437.537.240.541.741.9
ERU24.825.126.228.727.830.631.030.7

It can be seen from the experimental results in the table that ERU can still maintain a certain degree of unlearning robustness after taking some special measures to break the protection to different degrees.

Concluding remarks.

We would be grateful if you could let us know whether our explanations have addressed your concerns. Please let us know if you have any other questions or concerns.

References.

[1] Arditi, et al. (2024) Refusal in Language Models Is Mediated by a Single Direction.

评论

Dear reviewer uCL3,

Thanks for taking the time to review our work, we have carefuly considered your comments and made every efort to respond to your concerns.

If you have any further questions or require additional clarification, please kindly let us know.

Best regards.

评论

After careful reading of the authors' response, my concerns have been addressed. The authors should incorporate the additional experiments and analysis into the future revision. I am glad to raise my rating to Accept.

审稿意见
5

This paper introduces Elastic Robust Unlearning (ERU), a novel framework designed to remkove specific knowledge from LLM more effectively and more robustly than the PO-based unlerarning algorithms like DPO, NPO, or the variants of NPO. They have two key designs: 1. the elastic reward setting, by using a reference model value combined between the uniforms distribution and the original model's output. This is in contrast with the rigid reward setting in prior works that either uses a uniform distribution as the reference value or use the original model's prediction probability. 2. They apply a refusal feature ablation (RFA)-based adversarial training procedure during the training process to simulate adversarial removal of the refusal featurein the model's hidden activation and formulate the robust unlearning as a max-min optimization problem. The experiments on multiple unlearning benchmarks, such as TOFU, WMDP, MUSE, show that ERU outperforms PO-based algorithms by a good margin.

优缺点分析

Strengths:

  1. The paper evaluates ERU across multiple axes—effectiveness, utility preservation, and robustness. Experimental results are favorable across multiple benchmarks compared to the most widely used methods, such as GA and NPO, or their variants. This is convincing and shows the effectiveness of the designed algorithm.

  2. The combination of the RFA-based adversarial training is novel and improved the robustness of unlearning algorithms. Incorporating RFA into the unlearning loop is a clever way to approximate worst-case adversarial perturbations without costly inner-loop PGD. By randomly ablating learned refusal features in the residual streams with probability p, ERU achieves robustness comparable to latent-space adversarial training but at a lower computational cost.

Weakenesses and Questions:

  1. The novelty of this elastic reward setting: Given that ERU’s elastic reward formulation collapses to the rigid reference (only using uniform distribution or only using reference models) when alpha = 0 or alpha = 1, why should elastic and rigid reward methods be considered fundamentally distinct classes rather than simply two points on the same spectrum? In particular, since every rigid scheme can be recovered by fixing alpha at an extreme, what principled argument or theoretical insight supports treating ERU as a qualitatively new paradigm?

  2. You did some ablation studies about the effects of two main components of ERU: the elastic reward design and the RFA-based adversarial training in the appendix, but this ablation is only on the RWKU dataset for comparing the effectiveness. I wonder how they perform on the other three datasets.

  3. Figure 6 (PS: It does not have a caption) shows the effect of alpha and shows that setting alpha away from zero will improve the performance. It seems that the optimal alpha highly depends on the model (what about across datasets? What are the optimal alpha?) Is there anyway to explain this? Do you have any more intuition about what the hyperparameter alpha is and what it controls? And the most importantly, in what case would you expect a larger optimal alpha?

问题

See the Strength and Weakness Section.

局限性

See the Strength and Weakness Section.

最终评判理由

Thanks the authors for their responses in detail. I read your response and I really appreciate it, especially for clarifying your contribution in more detail and providing more ablation studies to show the strength of the method, so I raise my scores. I also read other reviewers' responses and find them helpful.

格式问题

/

作者回复

Dear reviewer NnGx

Thank you for your positive evaluation of the novelty of our work and the comprehensiveness of our experiments. We are pleased to address your concerns.

Weaknesses and Questions:

Weakness and Question (1): The novelty of this elastic reward setting [...] why should elastic and rigid reward methods be considered fundamentally distinct classes rather than simply two points on the same spectrum? [...] what principled argument or theoretical insight supports treating ERU as a qualitatively new paradigm?

Thank you for your profound insights into the elastic reward settings in the ERU. You have correctly noticed that when α = 0, the elastic reward setting of ERU has been transformed into reference-free reward. However, what we need to clarify is that when α = 1, the elastic reward setting of ERU does not completely collapse into reference-based reward. At this time, the influence of uniform distribution U(yx)U (y | x) still exists (Equation 13 of the paper).

Therefore, what we want to emphasize is that the elastic reward setting of ERU is not merely a simple hyperparameter interpolation. Instead, starting from the limitations of the rigid reward setting (Appendix B), and addressing these limitations by dynamically balancing the complementary advantages of reference-based reward and reference-free reward (Appendix D.2). The elastic reward setting is precisely the new category defined to distinguish this rigid setting. ERU converts rigid reward setting into a continuous and adjustable spectrum, in which α controls the degree of influence between reference and non-reference components, a paradigm that was lacking in previous work.


Weakness and Question (2): You did some ablation studies about the effects of two main components of ERU [...] I wonder how they perform on the other three datasets.

Thank you for your insightful observation regarding the scope of our ablation studies. We appreciate the opportunity to address this limitation and provide comprehensive cross-dataset ablation results below:

Ablation results of Unlearning Effectiveness

MethodRWKURWKURWKUMUSE-NewsMUSE-NewsMUSE-NewsTOFUTOFUWMDPWMDPWMDP
FB↓QA↓AA↓VerbMem↓KnowMem↓PrivLeakForget05-FQ↑Forget10-FQ↑AccBio↓AccCyber↓AccChem↓
Original51.946.857.558.363.7-99.83.2e-162.1e-1963.242.852.4
ERU29.227.125.510.49.212.30.730.4824.828.427.2
w/o RFAT29.628.435.810.29.321.70.710.4926.528.329.8
w/o ERM33.431.232.512.311.118.40.640.4428.229.430.5

Ablation results of Utility Preservation

MethodRWKURWKURWKURWKUMUSE-NewsMMLUTOFU-Forget05TOFU-Forget05TOFU-Forget10TOFU-Forget10
Rea↑Tru↑Fac↑Flu↑KnowMem↑Accuracy↑Probability↑ROUGE↑Probability↑ROUGE↑
Original26.930.441.5704.255.258.50.990.980.990.98
ERU26.230.540.5708.843.250.60.590.560.740.53
w/o RFAT27.132.141.5710.544.352.30.620.570.760.54
w/o ERM25.729.840.2709.842.551.50.570.560.720.53

Ablation results of Unlearning Robustness

MethodRWKURWKURWKUWMDPWMDP
FB↓QA↓AA↓AccBio↓AccCyber↓
Original51.946.857.563.242.8
ERU29.227.125.524.828.4
w/o RFAT44.439.646.851.437.9
w/o ERM29.828.626.126.530.4

By analyzing the above three tables, it can be seen that after removing the two core components of ERU, it shows significant differences in the three dimensions of unlearning (Unlearning Effectiveness, Utility Preservation, Unlearning Robustness). The best method is bolded, and the second-best method in highlight.

Specifically, removing RFAT does not significantly affect Unlearning effectiveness, but improves utility Preservation(Rea: 26.2 \to 27.1, KnowMem: 43.2 \to 44.3), but causes a near collapse of robustness (FB:29.2 \to 44.4, AccBio:24.8 \to 51.4). This indicates that there is an inherent contradiction between adversarial training and model utility, and seeking a balance between them is the key to construct a robust unlearning mechanism.

The removal of the elastic reward setting directly weakened the Unlearning Effectiveness (FB: 29.2\to 33.4, QA: 27.1\to 31.2), confirming the core supporting role of this component for Unlearning Effectiveness. At the same time, due to the failure of early-stage gradient weight smoothing (see Appendix D.2 for details), its removal also impairing the performance of Utility Preservation.


Weakness and Question (2.1): Figure 6 (PS: It does not have a caption) shows the effect of alpha and shows that setting alpha away from zero will improve the performance.

Thank you for the reminder. To clarify, Figure 6 illustrates the impact of different α\alpha (the hyperparameter controlling the influence of the reference model in our elastic reward setting) on unlearning effectiveness across two LLMs: LLaMA-2-7B-Chat and LLaMA-3-8B-Instruct.


Weakness and Question (2.2): It seems that the optimal alpha highly depends on the model (what about across datasets? What are the optimal alpha?) Is there anyway to explain this?

Your observation that the optimal α depends on the model is correct and consistent with our findings. We explain this below.

According to the loss function of NPO (Equation 5 in the paper), as unlearing proceeds in the expected direction, the prediction probability of the current model πθ\pi_\theta for the knowledge to be forgotten will continue to decrease and gradually deviate from the prediction probability of the reference model πref\pi_{ref} for this part of the knowledge. This is the basic principle of reference-based reward for unlearing. Correspondingly, the reference-free reward(Equation 12 in the paper) directly replaces the "ruler" role of the reference model in a rigid way by uniformly distributing U(yx)U (y | x).

The parameter α\alpha in the elastic reward setting we proposed is precisely the valve that balances and regulates these two "Ruler". When α\alpha decreases, the elastic reward setting tends towards reference-free reward. Conversely, it tends towards reference-based reward. Therefore, if users have sufficient trust in the reference model and believe that it can accurately provide a large enough prediction probability for forgetting knowledge during the unlearing process, it is beneficial to appropriately increase α\alpha to bias the reference-based reward. This is why in Figure 6(b) of the paper, with the increase of α\alpha, the Unlearing Effectiveness is better (with a lower accuracy rate in the forget set). On the contrary, when the reference model is incompetent (LLaMA 2-7B-Chat vs. LLaMA 3-8B-Instruct), it is necessary to reduce reliance on the reference model and lean more towards reference-free rewards (Figure 6(a) in the paper).

In addition, your concern about the influence of different datasets on the optimal α is also worth delving into. We have supplemented the experiments on multiple benchmarks as follows:

LLaMA-2-7B-Chat

MethodRWKURWKURWKUTOFUTOFU
FB↓QA↓AA↓Forget05-FQ↑Forget10-FQ↑
α=0.0033.831.135.60.920.47
α=0.0232.030.831.80.850.45
α=0.0529.227.125.50.730.48
α=0.1030.927.427.30.690.46
α=0.1531.328.128.50.680.42
α=0.2031.528.226.70.650.45

LLaMA-3-8B-Instruct

MethodRWKURWKURWKUTOFUTOFU
FB↓QA↓AA↓Forget05-FQ↑Forget10-FQ↑
α=0.0033.123.423.10.620.49
α=0.0233.422.621.60.590.48
α=0.0532.222.820.90.680.50
α=0.1032.522.419.20.730.49
α=0.1531.821.619.40.780.52
α=0.2031.221.118.50.840.56

From the table, we can see that the trends of most datasets are basically consistent with the aforementioned analysis. However, there are differences in specific instances (such as Lama-2-7B-chat in RWKU), which may be due to the high prediction probability of the model for forget knowledge in this dataset. Therefore, we conclude that although both the model and the dataset can influence the optimal value of α, the essence still depends on the prediction accuracy of the model for forget knowledge.


Weakness and Question (2.3): Do you have any more intuition about what the hyperparameter alpha is and what it controls? And the most importantly, in what case would you expect a larger optimal alpha?

As we replied in Weakness and Question (2.2), when the reference model can output a relatively high prediction probability for forget knowledge, the α\alpha can be correspondingly increased. On the contrary, if the reference model πref\pi_{ref} cannot provide a large enough prediction probability for the forget knowledge, and the prediction probability ratio of the current model πθ\pi_\theta and the reference model πref\pi_{ref} is too large, the ERU loss function (see Equation 16 of the paper) will be difficult to converge, and α\alpha should be reduced to enhance the effect of uniform distribution U(yx)U (y | x). In addition, too high α\alpha will make the model utility decline rapidly in the early stage of unlearning as in NPO, resulting in over-unlearning. Based on the above considerations, we suggest setting α\alpha within the smaller range of [0, 0.2], which has been verified in the experiments of the paper.

Concluding remarks.

We would be grateful if you could let us know whether our explanations have addressed your concerns. Please let us know if you have any other questions or concerns.

评论

Dear reviewer NnGx,

Thanks for taking the time to review our work, We have carefuly considered your comments and made every effort to respond to your concerns.

If you have any further questions or require additional clarification, please kindly let us know.

Best regards.

审稿意见
3

This paper proposes Elastic Robust Unlearning. This method combines an "elastic reward" which regularizes the unlearning algorithm by adding a parameter to vary the influence of the reference model, with the technique of refusal feature adversarial training to make the model more robust to relearning and jailbreaking attacks.

优缺点分析

Strengths:

  • The method comprehensively covers unlearning effectiveness as well as robustness, incorporating recent techniques to improve performance on both.
  • The evaluation thoroughly covers several benchmarks that are currently state of the art.
  • The proposed method of an "elastic" reward is interesting and intuitive, to interpolate between the reference-based and reference-free setting.

Weaknesses:

  • The results seem generally inconclusive. ERU does not do uniformly better than other methods across benchmarks, and as such it seems to be another point on the forget/retain tradeoff curve rather than pushing the Pareto frontier forward.
  • The elastic reward is interesting but the reasoning for it seems fairly heuristic. NPO itself is a version of regularized gradient ascent, and the elastic reward seems to be an alternative regularizer that incorporates the reference model. The arguments for why this should lead to better unlearning seem to be based on a fairly hand-wavy explanation of training dynamics (Appendix B). As such it's hard to predict when this method will be effective.
  • Some results seem cherry picked: why is ERU regularized not included in Table 1? Why is WMDP evaluated only on bio and cyber, but not chemistry? Why is TOFU not included in Table 2? Additionally, presenting forget and retain in separate tables rather than side by side makes it difficult to determine which methods perform the best.

问题

  • Why is TOFU not included in the utility analysis in 4.4? It does not seem to appear in the appendix either.
  • The ablation study in table 8 is a useful study but the results are presented in a confusing way. Why is unlearning effectiveness evaluated on a different dataset compared to unlearning robustness? Ideally the ablation would be on a single consistent benchmark.
  • How do the regularized ERU models (in table 2) perform on the forget tasks (table 1)? This is key to understanding the forget/retain tradeoff - otherwise the comparisons are apples and oranges.

局限性

Yes

最终评判理由

I appreciate the authors' response. Due to the concerns listed in my responses, I have maintained my score.

格式问题

n/a

作者回复

Dear reviewer gzkG

Thank you for a thoughtful and constructive review. We are pleased to hear your positive assessment of the novelty of our work and think our "elastic" reward is interesting and intuitive. We hope to address your concerns and questions in our response below.

Weaknesses:

Weakness (1) : The results seem generally inconclusive. ERU does not do uniformly better than other methods across benchmarks [...]

Most baseline methods (except advNPO) only deal with the two-dimensional trade-off of unlearning effectiveness and model utility (such as GA sacrificing utility for unlearning effectiveness). By jointly optimizing three key dimensions, namely unlearning effectiveness, utility preservation and unlearning robustness, ERU achieves a coordinated improvement among them and advances the Pareto boundary. ERU has significant advantages in any dimension:

Unlearning Effectiveness: The results in table 1 of the paper show that ERU significantly outperforms suboptimal methods on three of the four datasets (RWKU,TOFU,WMDP). Only in the MUSE-News dataset is it inferior to GA. However, it should be pointed out that GA achieves this advantage at the cost of significantly reducing the model's utility.

Utility Preservation: It can be seen from the results in table 2 of the paper that the original ERU outperforms the baseline unlearning method without regularization in most evaluation metrics, and some metrics even approach or exceed the methods enhanced by regularization. Taking the KnowMem metric as an example, ERU's score (43.2) surpassed GAKLRGA_{KLR} (41.8), NPOGDRNPO_{GDR} (40.5), and was close to NPOKLRNPO_{KLR}(46.4). Particularly, when ERU also incorporates the regularizer, its utility maintenance ability is significantly improved, specifically manifested as the outstanding improvement of ERUKLRERU_{KLR} on the MMLU dataset and ERUGDRERU_{GDR} on the MUSE-News dataset.

Unlearning Robustness: As can be seen from the results in Figure 2 and Figure 3 of the paper, ERU significantly outperforms the suboptimal method in terms of Unlearning Robustness. For example, under 1,000 samples of fine-tuned retraining attacks, ERU can still maintain 83% unlearning performance, while the suboptimal method advNPO can only maintain 72%.

Furthermore, in terms of time efficiency, ERU saves approximately half of the training time compared to the advNPO method, which also falls within the category of adversarial training.

Weakness (2) : The elastic reward is interesting but the reasoning for it seems fairly heuristic. NPO itself is a version of regularized gradient ascent [...]

The elastic reward we proposed is not a heuristic design but a principle-based solution to address the limitations of the Rigid Reward Setting (Appendix B): reference-based reward can lead to instability in early training, while reference-free reward lose instance-specific signals. Therefore, we redefined the reference model as the joint reference model, unifying the reference-based and reference-free rewards (Equation 13). Then, after substituting it into the objective function of NPO, we can derive the objective function of EU (Appendix D.1). We derived the adaptive adaptive smoothing weight from the objective function of EU through gradient analysis (Equation 37) and theoretically proved that it can avoid early-stage gradient weight smoothing ineffective (Appendix D.2).

Weakness (3.1) : Some results seem cherry picked: why is ERU regularized not included in Table 1? Why is WMDP evaluated only on bio and cyber, but not chemistry?

Thank you for your valuable suggestions! We promise All experiments follow established protocols from RWKU, MUSE, TOFU, and WMDP benchmark. Since our experimental setup on WMDP followed simNPO [1], we only covered bio and cyber. In order to address concerns, we now supplement the regularized ERU experimental results and the experimental results of the chemistry on WDMP benchmark as follows:

MethodRWKURWKURWKUMUSE-NewsMUSE-NewsMUSE-NewsTOFUTOFUWMDPWMDPWMDP
FB↓QA↓AA↓VerbMem↓KnowMem↓PrivLeakForget05-FQ↑Forget10-FQ↑AccBio↓AccCyber↓AccChem↓
Original51.946.857.558.363.7-99.83.2e-162.1e-1963.242.852.4
DPO38.940.741.533.237.2109.61.2e-43.5e-728.933.535.2
IDK40.540.645.435.639.1104.34e-55e-829.334.234.8
GA44.539.647.30.00.020.80.058.1e-1037.430.131.4
GradDiff46.442.248.625.931.0105.30.097.9e-338.633.532.7
GA_KLR46.841.444.327.458.6-51.60.113.4e-537.933.233.5
NPO33.631.332.810.813.430.40.660.1929.632.731.8
NPO_GDR34.834.738.113.248.6101.30.440.2431.833.032.2
NPO_KLR37.634.538.516.638.6-56.70.430.1732.432.932.4
SimNPO34.231.837.512.611.314.90.970.4528.629.831.7
AdvNPO35.933.225.213.712.824.60.630.2629.933.232.8
ERU29.227.125.510.49.212.30.730.4824.828.427.2
ERU_GDR29.828.625.911.410.814.10.690.4525.729.428.6
ERU_KLR31.427.927.511.611.215.30.670.4227.530.228.2

From the table, we can see that although the regularizer will slightly hurt the ERU unlearning effectiveness of ERU (this trend is consistent with other methods), it can still maintain good performance. The best method is bolded, and the second-best method in highlight.

Weakness (3.2) : Why is TOFU not included in Table 2?

Thank you for pointing out this oversight. We now supplement the experimental results of the TOFU dataset in the utility preservation dimension as follows:

MethodRWKURWKURWKURWKUMUSE-NewsMMLUTOFU-Forget05TOFU-Forget05TOFU-Forget10TOFU-Forget10
Rea↑Tru↑Fac↑Flu↑KnowMem↑Accuracy↑Probability↑ROUGE↑Probability↑ROUGE↑
Original26.930.441.5704.255.258.50.990.980.990.98
DPO26.425.232.4710.632.846.80.740.530.760.54
IDK26.827.936.7712.537.350.20.760.550.780.55
GA25.830.740.2707.60.048.30.000.000.000.00
GradDiff24.830.441.1707.527.351.20.490.420.570.48
GA_KLR26.229.840.6708.341.851.80.480.440.530.49
NPO26.230.541.1694.627.547.60.510.470.460.44
NPO_GDR26.530.440.8705.240.551.70.560.550.650.53
NPO_KLR26.331.240.9703.846.450.50.560.540.710.55
SimNPO26.329.440.5691.343.550.20.560.540.720.53
AdvNPO24.326.539.8672.824.341.20.480.460.460.45
ERU26.230.540.5708.843.250.60.590.560.740.53
ERU_GDR26.130.741.2707.547.252.10.720.570.790.56
ERU_KLR26.630.840.9708.644.253.40.740.570.780.55

From the table, we can see that the regularizer can consistently enhance the model utility of ERU and outperforms the suboptimal method. The conclusion is consistent with the description in the paper.

Weakness (3.3) :Additionally, presenting forget and retain in separate tables rather than side by side makes it difficult to determine which methods perform the best.

Since our original intention was to discuss the performance of the unlearing method from three dimensions respectively, to avoid confusion, the corresponding experimental tables and images are also presented separately. To present the advantages of ERU more clearly, we will present the comprehensive results in the appendix.

Questions:

Question (1): :Why is TOFU not included in the utility analysis in 4.4? It does not seem to appear in the appendix either.

Please refer to our response to Weakness (3.2).

Question (2): The ablation study in table 8 is a useful study but the results are presented in a confusing way. [...]

Given that in the discussion of unlearing robustness in Section 4.3, we demonstrated the unlearing robustness of each method by evaluating the change of unlearing effectivenes on WMDP, we have followed this metric here. We apologize for any confusion this may have caused. To address your concerns, we now supplement the experimental results of unlearing robustness on RWKU as follows:

MethodUnlearning EffectivenessUnlearning EffectivenessUnlearning EffectivenessUnlearning RobustnessUnlearning RobustnessUnlearning RobustnessUtility Preservation
RWKURWKURWKURWKURWKURWKUMMLU
FB↓QA↓AA↓FB↓QA↓AA↓Accuracy↑
Original51.946.857.551.946.857.558.5
ERU29.227.125.529.227.125.550.6
w/o RFAT29.628.435.844.4 (15.2↑)39.6 (12.5↑)46.8 (21.3↑)52.3
w/o ERM33.431.232.529.828.626.151.5

From the table, we can see that after removing the refusal feature adversarial training (RFAT), the unlearing effectiveness of the unlearned model is significantly impaired (FB:29.2\rightarrow44.4, QA:27.1\rightarrow39.6, AA:25.5\rightarrow46.8), which highlights the importance of RFAT.

Question (3): How do the regularized ERU models (in table 2) perform on the forget tasks (table 1)? This is key to understanding the forget/retain tradeoff [...]

Please refer to our response to Weakness (3.3).

Concluding remarks.

We would be grateful if you could let us know whether our explanations have addressed your concerns. Please let us know if you have any other questions or concerns.

References.

[1] Fan , et al. (2024) Simplicity prevails: Rethinking negative preference optimization for llm unlearning.

评论

Dear reviewer gzkG,

Thanks for taking the time to review our work, We have carefuly considered your comments and made every efort to respond to your concerns.

If you have any further questions or require additional clarification, please kindly let us know.

Best regards.

评论

I appreciate the authors' work to report more extensive evaluation results. As of now, I will prefer to maintain my score.

In particular, while the authors do analyze training dynamics and the impact on model utility, it is still unclear to me why this algorithm achieves the point in the tradeoff space that it does. There are hints toward an understanding in Appendix B and lines 630-633 of the paper but these are still far from a clear explanation of how the improved training dynamics lead to more effective unlearning.

I do however appreciate the empirical effectiveness of the method and note that other reviewers have raised their scores. I am happy to discuss further during the reviewer-AC period and defer to the greater consensus.

评论

Dear reviewer gzkG,

We sincerely thank the reviewer for their constructive feedback and acknowledgment of ERU’s empirical strengths. We are very willing to clarify how the design of ERU achieves its performance trade-offs.

Why can ERU achieve a good unlearning-utility trade-off

The unlearning method with rigid reward setting has demonstrated its unlearning effectiveness[1][2]. We inherit its excellent unlearning effectiveness through the elastic reward setting (Equation 23) and seek the smallest possible utility loss. The elastic reward setting of ERU specifically alleviates the over-unlearning phenomenon that occurs in the early stage of training (Appendix B), avoids excessive loss of model utility in the early stage of unlearning, and does not completely lose the role of the reference model. Through the theoretical derivation of gradient analysis (Appendix D.2) and the experimental verification of ERU (Table 1 and Table 2), we have confirmed its ability to achieve a good trade-off.

A clearer understanding of Appendix B and D.2

In Appendix B, we analyze the reasons for the low model utility of the previous method through gradient analysis (Figure 4), and strive to avoid the situation of Equation 22.

Winit_θ(x,y)=2πθβ(yx)πθβ(yx)+πrefβ(yx)1W^{init}\_{\theta}(x, y)=\frac{2 \pi_{\theta}^{\beta}(y \mid x)}{\pi_{\theta}^{\beta}(y \mid x)+\pi_{\mathrm{ref}}^{\beta}(y \mid x)}\approx 1 (Equation 22)

Equation 37 in Appendix D.2 indicates that the elastic reward setting ensures that its weight gradient is no longer approximately 1 during the early training period but is related to the response length (Equation 37). The ablation experiment in Appendix F.3 further indicates that removing the elastic reward will lead to the failure of the overall trade-off.

We hope this structured explanation can clarify how the design of ERU achieves good performance in the trade-off space. We're happy to further discuss any remaining questions during the discussion period.

[1] Fan , et al. (2024) Simplicity prevails: Rethinking negative preference optimization for llm unlearning.

[2] Zhang , et al. (2024) Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning.

评论

Dear reviewer gzkG,

Thank you for your valuable discussion about our work, we have made every effort to respond to your concerns.

As the discussion period is coming to an end, if you have any further questions or need additional clarification, please let us know.

Best regards,

The authors of Paper 8095

最终决定

This paper introduces elastic robust unlearning (ERU), with two main contributions: (1) an elastic reward setting that interpolates between reference-based and reference-free rewards, and (2) refusal feature ablation and adversarial training. The problem of robust unlearning in LLMs is timely and important, and the design of ERU is well-motivated. Empirical evaluations are extensive, covering multiple benchmarks, and additional ablations and significance tests provided in the rebuttal further strengthened the case. Several reviewers highlighted the strong empirical results.

On the weakness side, the main contention lies in the novelty and theoretical justification of the elastic reward setting. During the discussion, one reviewer remained unconvinced, viewing it as little more than a heuristic interpolation between existing rigid schemes and finding the analysis insufficiently compelling, leading to a borderline stance. In contrast, other reviewers, while less active in discussion, evaluated the work more positively. The AC considers the combination of the elastic reward with refusal feature ablation to constitute a meaningful contribution overall, although the novelty concerns are valid: E.g., AC found the following work relevant to robust unlearning: Fan, et al. "Towards llm unlearning resilient to relearning attacks: A sharpness-aware minimization perspective and beyond." arXiv preprint arXiv:2502.05374 (2025).

Balancing these perspectives, I find the empirical strength of ERU outweighs the novelty concern. This work makes a useful contribution to the growing area of machine unlearning in LLMs.