From Dormant to Deleted: Tamper-Resistant Unlearning Through Weight-Space Regularization
We highlight the susceptibility of existing unlearning methods to relearning attacks and analyze the characteristics of robust methods by leveraging the weight-space perspective.
摘要
评审与讨论
The paper presents an analysis of relearning attacks. In these attacks a model has been forced for "forget" certain information which it has learned and the goal is to recover the models lost capability. This can be done with no or some subset of the examples which were removed from the model. The paper shows that in some cases, even with none of the forgotten examples, accuracy can be recovered. The subsequent analysis indicates that weight-space regularization, forcing the unlearned model weights to be far from the pre-trained model, can make the model more resistant to relearning attacks. The paper then proposes a method based on weight-space regularization which performs better than prior art.
优缺点分析
Overall I like this paper.
The presented analysis was fascinating: the idea that unlearned example accuracy can be recovered with only an orthogonal set of data is surprising and a very nice finding. That information by itself is extremely useful and the new method adds to the contributions of the paper. According to the analysis presented in the paper, this proposed method is effective at solving some of the issues identified with prior art.
Furthermore I see the potential for real-world impact. With models being released frequently by companies these days, I can imagine a scenario where a model was trained on sensitive, illegal, or privacy violating data. The influence of that data could be "scrubbed" in a later release of the model using a forgetting algorithm. However, if that information is easily recoverable by an attacker, there could be serious consequences.
The analysis does have some limitations, however. The biggest issue is with the dataset. CIFAR{10,100} are very simple and may have some unusual properties (see https://arxiv.org/abs/2111.00861 for example). While the method shows analysis on multiple models it does not consider data outside of CIFAR. The analysis needs to include a more complex dataset to be convincing. It was also not clear to me how many training iterations were being used for each model, I think it may be important to ensure that the models which were trained from scratch used the same number of total iterations as the models which were fine-tuned on the retain set. Lastly, and this is touched on in the introduction, the models here are very simple and may not be representative of modern LLM behavior, confounding as they are to study.
So my final thought is that there is definitely something good here but it may be too early to accept it as is without further analysis.
问题
- Evaluate on more datasets (like ImageNet) or justify why that wouldnt change the conclusion (may require another round of submission)
- Are the training iterations always fixed?
- How can we extend this to modern models and tasks?
局限性
yes
最终评判理由
After discussing this paper with the authors and reading their rebuttals/the other reviews I am increasing my rating to accept. I think the authors did a good job of clarifying the intent to study relearning in a controlled environment and I still find the analysis that models are able to re-learn with none of the original data to be a significant finding. While I stand by my primary issue: that the results are not directly applicable to real-world models, I am convinced now that this is not a major issue since the analysis itself is still useful. I think this paper could be the starting point of an expansion of the authors definition of the re-learning problem in follow-up work which will eventually lead us to an understanding of relearning in modern multi-modal LLMs. Since the potential future impact is high and the findings are interesting (and there are no methodological flaws that I could find) I think this paper should be published in NeurIPS 2025.
格式问题
- The "retrain" terminology was a little confusing: "retrain from scratch" vs. "fine-tuning" various models. Could "retrain" be referred to explicitly as "trained-from-scratch" or "trained-only-on-retain" (or something else)?
- In the Fig 2. captions i recommend replacing "abscissa" and "ordinate" with "x-" and "y-axis". This is a critical figure for the paper and I think modern readers may not be as familiar with the terminology (this is also consistent with other figures)
We are very thankful to the reviewer for their valuable comments. We are happy that the reviewer found our analysis fascinating, and we are looking forward to discussing more with the reviewer in light of our rebuttal and new experiments. We address the concerns and answer the questions raised by the reviewer below.
Evaluate on more datasets (like ImageNet) or justify why that wouldn’t change the conclusion (may require another round of submission)
Indeed, we focused on only simple vision datasets with smaller-scale architectures (mainly motivated by prior work which focused on CIFAR-10 and CIFAR-100 as the canonical datasets for unlearning in vision classifiers). Using these smaller-scale datasets allowed us to answer some fundamental questions about relearning attacks: why do they occur and how can we address them. This is a different direction that complements works on relearning attacks that were carried out in LLMs (where the possible exploration is limited due to the sheer cost associated with such experiments, and where the complex setup makes it difficult to draw clear conclusions).
However, we fully agree with the reviewer that showing results in only those specific datasets was limiting. Based on this suggestion, we present additional results here on the well-known Imagenette dataset [1] (a subset of ImageNet) to showcase the generality of our findings on more complex and higher-resolution images. Similar to other results in the main paper, we focused on using atypical examples for the forget set. The results for three different unlearning methods are presented below (note that tuning hyperparameters for all baselines is difficult in such a tight timeline, hence, we refrain from reporting sub-optimal numbers for other baselines – we will add all results in the next revision of the paper).
| Model | Label | Test Acc (%) | Forget Set Acc (%) |
|---|---|---|---|
| unlearned | Retrain from scratch | 87.34 | 67.53 |
| unlearned | SCRUB | 85.81 | 85.71 |
| unlearned | Weight Distortion | 83.16 | 74.03 |
| unlearned | Weight Dist Reg | 83.72 | 84.42 |
| --- | --- | --- | --- |
| 0 relearn | Retrain from scratch | 86.40 | 63.53 |
| 0 relearn | SCRUB | 86.47 | 93.40 |
| 0 relearn | Weight Distortion | 84.71 | 78.86 |
| 0 relearn | Weight Dist Reg | 84.31 | 73.87 |
| --- | --- | --- | --- |
| 10 relearn | Retrain from scratch | 86.73 | 64.16 |
| 10 relearn | SCRUB | 86.65 | 94.21 |
| 10 relearn | Weight Distortion | 84.62 | 81.09 |
| 10 relearn | Weight Dist Reg | 84.83 | 75.77 |
We see that the forget set accuracy of SCRUB shoots up upon relearning, even with 0 relearning examples, in line with our findings on CIFAR. On the other hand, variants of our tamper-resistant framework, weight distortion, and weight dist reg, similar to our previous results, are more resistant against these relearning attacks. These evaluations highlight that our results are general and hold across different datasets.
[1] https://github.com/fastai/imagenette
Are the training iterations always fixed?
Thanks for checking this. Yes, we fixed the number of training iterations in order to make sure that discount any effects of (un)learning efficiency for different methods. Hence, we assigned a large pretraining, unlearning, and relearning budget. We show that some methods are more computationally effective, and can get away with a significantly smaller training budget in response to reviewer JQcR.
How can we extend this to modern models and tasks?
Susceptibility to relearning has already been analyzed for language models as we list extensively in the paper. The fact that relearning is observed across these different models is a clear sign that it is a more general phenomenon, beyond a single model/task.
The focus of our paper is to design a controlled study to gain a deeper understanding of causes and remedies for vulnerability to relearning. Based on this, we have identified weight space distances as an important factor affecting the robustness of different unlearning algorithms and proposed different unlearning objectives to directly address this.
We consider it exciting future work to experiment with our unlearning objectives in LLMs too for increasing their robustness. But there is no doubt, based on the existing literature, that this phenomenon does also occur in LLMs. Hence, we believe that these results should directly extend to modern models and tasks.
The "retrain" terminology was a little confusing: "retrain from scratch" vs. "fine-tuning" various models.
Regarding terminology, thanks for raising this potentially confusing issue. We reserve the term “retrain from scratch” to refer to the ideal unlearning method: training from scratch on only the retain set (guaranteeing by construction that the unlearned model is not influenced by the forget set). This is denoted as the black star in our plots and is the reference point for how high or low we expect the forget set accuracy to be under ideal unlearning. This is an important reference point for studying relearning attacks: if the forget set accuracy of even this ideal reference point (black star) shoots up upon relearning attacks, then a higher forget set accuracy upon relearning is not due to imperfect unlearning in the first place. What we observe in our plots though, is that the forget set accuracy of various approximate unlearning algorithms shoots up much higher than the black star upon relearning attacks, allowing us to safely conclude that their susceptibility to relearning is due to not having fully erased the knowledge of the forget set during their unlearning phase. On the other hand, we use the term “finetuning” to refer to further post processing of unlearned (or attempted-to-be-unlearned) models.
We attempted to lay out the terminology clearly in Section 3 but we agree with the suggestion of ensuring we always refer to retrain-from-scratch as such, rather than shortening to “retrain”, to ensure clarity. Thank you for the suggestion.
In the Fig 2. captions i recommend replacing "abscissa" and "ordinate" with "x-" and "y-axis".
Thank you for the very useful suggestion. We will update the Fig. 2 caption to make it consistent with all other figures.
—-
Overall, we really appreciate the points raised by the reviewer. We also feel that the addition of a new dataset is very valuable to highlight the generality of our results, particularly on high-resolution images, and the clarifications and feedback about writing to improve the presentation of our work. We believe that we have addressed all concerns and questions raised and we are looking forward to hearing back from the reviewer.
I took a look at the other reviews and author responses and I really like the new additions that are provided here
First I want to address reviewer JnT6.
There were some really good points about prior work and the positioning of this paper, but I think the responses are quite convincing. In particular I think the important contribution of this paper is in discarding many of the confounding elements of LLM unlearning analysis and designing a completely controlled experiment. While its true that unlearning in the context of LLMs is the most important application I am convinced by the results presented both in the paper and in the responses here that various metrics and indeed the specific LLM task are likely masking a more severe unlearning effect. I want to specifically highlight the Imagnette results in this response as well as the MIA results in the JnT6 response as examples of this. In the former case, it is clear that the effect persists in more complex scenarios and in the latter case, where MIA would suggest no unlearning there is clearly unlearning happening. What I would look for on future work on this is some way to design experiments that test the effect in a similarly isolated way on LLMs and larger/multimodal datasets.
Next I think the efficiency point from JQcR was also important and the new table is partially addressing the concern there. What I think I am seeing there (please feel free to clarify) is what the forget set accuracy is after some number of unlearning epochs. Does this also include the time taken to retrain on the retain set and what the final forget set accuracy is?
We really appreciate and are thankful to the reviewer for taking the time to go through not just our responses to their own review, but to other reviewers too. We really appreciate the thoughtful discussion. We certainly believe that adding those results during rebuttal made our case significantly stronger, and we thank the reviewer again for nudging us to expand our experiments.
What I think I am seeing there (please feel free to clarify) is what the forget set accuracy is after some number of unlearning epochs. Does this also include the time taken to retrain on the retain set and what the final forget set accuracy is?
Yes, the table shows the accuracy of the unlearned model after some number of unlearning epochs (in the rows with model=’unlearned’) and the final accuracy after running the relearning attack on each of those unlearned models (in the corresponding rows with model=’0 relearn’). The number of epochs for the relearning attack (i.e. finetuning the unlearned model on the retain set) is always fixed to 100 (for this table and all results in our paper).
So, for example, in the row “unlearned; catastrophic_forgetting; 3”, we show the forget and test accuracies of the unlearned model obtained by 3 epochs of catastrophic forgetting. In the corresponding row “0 relearn; catastrophic_forgetting; 3”, we show the forget and test accuracies of the model obtained by running the relearning attack on top of the unlearned model obtained by 3 epochs of catastrophic forgetting. We hope this clarifies and we are happy to discuss more.
These results complement our initial results in interesting ways. For instance, we see that “catastrophic forgetting” requires a large number of epochs to even seemingly unlearn (forget set accuracy is high for small numbers of epochs), and subsequently, to be robust to relearning attacks. On the other hand, for weight distortion, we see that even after just a single unlearning epoch, the forget set accuracy gets to about 50%. Furthermore, even with just a single epoch, this is not just providing an illusion of unlearning, but rather, truly unlearning as indicated by the accuracy after relearning attack (model=’0 relearn’). The fixed large unlearning budget we chose for our initial experiments was to ensure that methods that are slow in unlearning are not penalized, since our goal here was to study robustness. But we will add these experiments and discussion to the paper as they reveal additional nuances.
We are happy to discuss further. Once more, we are grateful for the reviewer’s time, support and thoughtful discussions.
OK thanks for the clarification on that table, I think the explanation makes a lot of sense and I agree that the new results are quite interesting.
I definitely think the efficiency angle pointed out by JQcR is important since (paraphrasing the discussion) it shows not only where prior unlearning methods were ineffective but that unlearning is worth exploring since it can be more efficient than from-scratch training.
So to conclude the discussion I like the expanded results that came out of the review process and I appreciated the additional explanations of the experimental setup particularly as part of the response to JnT6. Pending reviewer discussion I plan to raise my rating to vote for accepting the paper.
We are thankful to the reviewer for their constructive feedback during both the review and discussion periods. We are very pleased to hear that the reviewer found our additional experiments useful and appreciate their support for the acceptance of our work.
This paper argues that existing studies on relearning attacks have been limited to large language models (LLMs), and that a more precise analysis requires investigation in well-defined settings such as vision classification models. In response, the paper presents the first analysis of relearning attacks targeting vision classification models that have undergone unlearning. Specifically, it shows that when a model is unlearned using existing methods on atypical instances, its performance on the forget set can be substantially recovered by fine-tuning the model using only the retain set. The authors analyze the cause of this phenomenon from a weight-space perspective and, based on this insight, propose new methods to enhance the tamper-resistance.
优缺点分析
Strengths
- This paper is the first to formally define the problem of relearning attacks in the context of vision classification tasks, and the motivation for addressing relearning attacks in this setting is generally clear.
- The problem definition and experimental setup for evaluating relearning attacks are well-designed, with appropriate methodologies and baselines selected for comparison.
- The empirical finding that the performance on the forget set can be recovered solely by fine-tuning the unlearned model on the retain set is particularly intriguing and provides a novel insight into the limitations of existing unlearning methods.
Weaknesses
This paper is generally strong in its problem formulation and experimental design. Also, it makes a novel and interesting empirical observation, that performance on the forget set, which includes atypical instances, can be suppressed solely by fine-tuning on the retain set. However, there are several important issues that should be addressed:
- Definition of atypical Instances: The paper lacks a clear explanation of how typical and atypical instances are distinguished. While it mentions that the score from [1] is used, it does not specify the exact threshold applied to categorize samples discretely. Moreover, it is unclear how many images are considered atypical and what proportion they represent out of the total. Providing such information would help readers better understand the experimental setup.
[1] Ziheng Jiang, Chiyuan Zhang, Kunal Talwar, and Michael C Mozer. Characterizing structural regularities of labeled data in overparameterized models. arXiv preprint arXiv:2002.03206, 2020.
-
Analysis of Weight Dist Reg: The proposed method, Weight Dist Reg, is reasonably derived from the analysis of weight distances and does show improved tamper-resistance. However, it exhibits noticeably lower test accuracy compared to other methods. This suggests that the model’s weights change substantially, leading to performance degradation on the retain set. The authors should discuss why this trade-off occurs and propose directions to address this issue in future work.
-
Figure Readability and Alignment: The overall readability of the figures needs improvement. Specific issues include:
- In Figure 1, the legend at the top assigns circular markers to all unlearning methods, but the plot legend only labels "Relearned," making the meaning of the markers ambiguous.
- In Figure 2, it is not clear what differentiates the top and bottom rows. The main text only describes the bottom plot, leaving the top plot unexplained.
- Figures 2, 3, 4, and 5 appear far from the text where they are discussed. For example, Figure 2 appears on page 5 but is not mentioned until page 7, which disrupts the flow of reading.
While these limitations do not undermine the main contribution of the paper (i.e., the discovery of the vulnerability in existing unlearning methods) they are important issues that require revision. Taking all of this into consideration, I give this paper a borderline accept recommendation.
问题
While reading this paper, I had the following questions, which are closely related to the weaknesses already mentioned:
-
What exactly is the criterion used to define atypical instances? If this criterion were changed (e.g., by modifying the threshold that separates typical and atypical samples), would the analysis and conclusions presented in the paper still hold?
-
What is the exact reason behind the drop in test accuracy observed with Weight Dist Reg? Additionally, is there a potential way to improve tamper-resistance without sacrificing test accuracy?
局限性
yes
最终评判理由
The authors provided clear clarifications on the definition of atypical instances, the trade-off in Weight Dist Reg, and figure readability. Their response adequately addressed my concerns, so I am satisfied with the current version and will maintain my borderline accept recommendation.
格式问题
There is no concern regarding the formatting of the paper.
We are very thankful to the reviewer for their valuable and encouraging comments. We address the concerns and answer the questions raised by the reviewer below.
Definition of atypical Instances
Thank you for highlighting this. The details about the selection of typical and atypical examples are important to understand the overall results, we will make sure to add them to the paper comprehensively.
We used the consistency scores from Jiang et al. (2020) that provides the notion of an example's typicality. We selected instances from the unlearning class with the highest typicality/atypicality scores in the class as our typical/atypical examples for the forget set, while using all the remaining examples as the retain set. Hence, we didn’t need to use a particular threshold, but rather, just sorted them w.r.t. the scores and took top-k/bottom-k examples for typical/atypical examples. We picked the instances with highest/lowest scores from the entire dataset for class-angostic unlearning, and only the target-class for specific-specific unlearning. We will update the manuscript to highlight all these details for curious readers.
Analysis of weight dist reg, drop in test accuracy and ideas for future work to improve it.
This is a great point, thank you for raising it.
As we discuss in our discussion section, and as the reviewer pointed out, our results reveal an interesting trade-off between robustness to relearning attacks and test accuracy. It’s natural to expect that methods that add noise to the weights (in attempts to push the unlearned model further away from the pretrained model) destroy useful information in the model that is valuable for generalization. In general, the pretrained model has good generalization capabilities (strong accuracy on the test set). Therefore, moving away from it risks losing those, in favor of increasing robustness to relearning. This is related to a fundamental trade-off in unlearning: removing unwanted information while maintaining permissible knowledge and good model utility (e.g., accuracy on the retain and test sets). We will discuss this further in the updated paper.
Moreover, the reviewer’s comment prompted us to think about actionable paths for addressing this and obtaining a better trade-off between generalization/utility and robustness. We outline some ideas here that future work can pursue.
Rather than adding noise to all parameters of the model, we can consider doing this selectively on only parameters that are determined to be disproportionately important for the forget set compared to the retain set (using criticality criteria, such as the one proposed in the SSD unlearning method [1] – such techniques are essential when considering extremely large models such as language models). More broadly, we can try to find a better balance between retaining desired knowledge of the pretrained model while moving away from it more selectively. This can be operationalized by freezing a subset of (cleverly chosen) parameters during the unlearning phase.
We would love to hear the reviewer’s thoughts on this and keep discussing.
[1] Foster, J., Schoepf, S. and Brintrup, A., 2024, March. Fast machine unlearning without retraining through selective synaptic dampening. In Proceedings of the AAAI conference on artificial intelligence (Vol. 38, No. 11, pp. 12043-12051).
Figure readability and alignment.
Thank you for these suggestions which we have adopted in the paper. Regarding Fig 2, we do actually describe top vs bottom. The caption says: “The forget set is comprised of atypical examples (from the ‘airplane’ class i.e., sub-class unlearning for the top row and all classes i.e., class-agnostic unlearning in the bottom row”. We will however work on improving readability, e.g. by writing “top” and “bottom” in bold to make it easier to spot. If the reviewer has other suggestions, we would be happy to apply them.
—
Overall, we believe that we have addressed all comments and questions brought up by the reviewer and we’re curious if any other concerns remain. If the reviewer is satisfied, we would be really grateful if the reviewer can show stronger support for acceptance by increasing their score.
Thank you for the detailed response. I am satisfied with your clarifications and will keep my initial score.
We are glad to hear that the reviewer is satisfied with the clarifications. We would appreciate if the reviewer could let us know any remaining concerns that they have so that we have an opportunity to address them.
This paper proposes a defense method against the relearning/tampering attack in machine unlearning. The authors find that keeping the model farther away from its original weights makes unlearning more robust. They introduce a weight-space regularization technique that pushes the model away in weight space, improving resistance to relearning on vision benchmarks.
优缺点分析
Strengths
- The motivation and methodology are written clearly and are easy to follow.
- The proposed weight-space regularization method is intuitive and empirically effective.
Weaknesses
- The method is evaluated only on vision datasets (CIFAR datasets and ResNet architectures); it remains unclear how well it generalizes to other domains or larger models.
- The unlearning problem can be unequivocally solved by training from scratch, so the computational overhead introduced by the regularizer is especially important. If the cost of unlearning exceeds that of training from scratch, the proposed approach may lose practical value.
问题
- The paper reports that the proposed unlearning method takes 2.5 hours, while training from scratch takes about 3 hours. Given that training from scratch guarantees complete forgetting, could the authors clarify why the proposed method would be preferable in practice when its computational cost is comparable?
- Do the authors have any insights or preliminary results on how well this method would generalize to other domains?
局限性
yes
最终评判理由
Based on the authors' response during the rebuttal phase, I have revised my assessment of the paper's significance, raising my score from 2 to 3. I now believe this paper will bring moderate impact to the machine unlearning sub-area of AI research, addressing meaningful challenges in this emerging field.
格式问题
No major formatting issues noticed.
We are very thankful to the reviewer for their valuable and encouraging comments. We address the concerns raised by the reviewer below.
The method is evaluated only on vision datasets (CIFAR datasets and ResNet architectures)
That’s a great remark. We indeed leveraged simple vision datasets with smaller-scale architectures for our exploration. This was by design, to allow us to answer some fundamental questions about relearning attacks: why do they occur and how can we address them. This is a different direction that complements work on relearning attacks that were carried out in LLMs.
However, we agree with the reviewer that presenting results in an additional larger dataset would lead to more convincingly showcasing the generality of our findings. Based on this suggestion, we present additional results here on the well-known Imagenette dataset [1] (a subset of ImageNet). Similar to other results in the main paper, we focused on using atypical examples for the forget set. The results for three different unlearning methods are presented below (note that tuning hyperparameters for all baselines is difficult in such a tight timeline, hence, we refrain from reporting sub-optimal numbers for other baselines – we will add all results in the next revision of the paper).
| Model | Label | Test Acc (%) | Forget Set Acc (%) |
|---|---|---|---|
| unlearned | Retrain from scratch | 87.34 | 67.53 |
| unlearned | SCRUB | 85.81 | 85.71 |
| unlearned | Weight Distortion | 83.16 | 74.03 |
| unlearned | Weight Dist Reg | 83.72 | 84.42 |
| --- | --- | --- | --- |
| 0 relearn | Retrain from scratch | 86.40 | 63.53 |
| 0 relearn | SCRUB | 86.47 | 93.40 |
| 0 relearn | Weight Distortion | 84.71 | 78.86 |
| 0 relearn | Weight Dist Reg | 84.31 | 73.87 |
| --- | --- | --- | --- |
| 10 relearn | Retrain from scratch | 86.73 | 64.16 |
| 10 relearn | SCRUB | 86.65 | 94.21 |
| 10 relearn | Weight Distortion | 84.62 | 81.09 |
| 10 relearn | Weight Dist Reg | 84.83 | 75.77 |
We see that the forget set accuracy of SCRUB, similar to our previous results, shoots up upon relearning, even with 0 relearning examples, in line with our findings on CIFAR with an increasing number of relearning examples. On the other hand, variants of our tamper-resistant framework, weight distortion, and weight dist reg, similar to our previous results, are more resistant against these relearning attacks.
[1] https://github.com/fastai/imagenette
The unlearning problem can be unequivocally solved by training from scratch, so the computational overhead introduced by the regularizer is especially important. If the cost of unlearning exceeds that of training from scratch, the proposed approach may lose practical value.
Thank you for raising this. We would like to highlight that our regularizer isn’t more expensive than any other methods commonly employed for approximate unlearning (that commonly rely on two different forward passes through the network), while being significantly more efficient at the unlearning task itself: see our response below where we can unlearn effectively with just a single epoch compared to catastrophic forgetting which significantly benefits from a larger number of unlearning epochs.
The paper reports that the proposed unlearning method takes 2.5 hours, while training from scratch takes about 3 hours. Given that training from scratch guarantees complete forgetting, could the authors clarify why the proposed method would be preferable in practice when its computational cost is comparable?
That’s a great question, and an important observation. We gave all methods a very large training budget in order to ensure that they have the best chance of succeeding at unlearning. However, note that not all methods are equally effective at the task. In order to make this concrete, we present the unlearning/relearning performance of the model with a different number of unlearning epochs. We see that forget set accuracy goes down with an increasing number of unlearning epochs as it only relies on weight decay for unlearning. On the other hand, for weight distortion, we see that the forget set accuracy starts quite low even after a single epoch, and doesn’t benefit much from even a 100 unlearning epochs (which is the consistent budget we used in all our previous experiments for comparison). We also highlight the accuracy after the relearning attack (using 0 relearning examples; larger number of relearning examples from the table were removed as the results were very similar) in order to establish that the unlearning in this case was truly successful. Therefore, this highlights that our methods are particularly efficient at unlearning. The large unlearning budget we chose (which is infeasible in practice due to the cost approaching that of retraining from scratch as the reviewer correctly pointed out) ensures that methods that are slow in unlearning are not penalized. Analyzing the efficiency of unlearning is an interesting question in its own right, and worth investigating in more detail separately.
| Model | Method | Unlearning Epochs | Test Acc (%) | Forget Set Acc (%) |
|---|---|---|---|---|
| unlearned | catastrophic_forgetting | 1 | 93.79 | 100.00 |
| unlearned | catastrophic_forgetting | 3 | 93.42 | 100.00 |
| unlearned | catastrophic_forgetting | 10 | 92.52 | 94.50 |
| unlearned | catastrophic_forgetting | 30 | 91.68 | 77.75 |
| unlearned | catastrophic_forgetting | 50 | 90.25 | 60.00 |
| unlearned | catastrophic_forgetting | 100 | 88.23 | 41.75 |
| unlearned | weight_distortion | 1 | 84.49 | 43.50 |
| unlearned | weight_distortion | 3 | 86.21 | 49.00 |
| unlearned | weight_distortion | 10 | 86.27 | 52.75 |
| unlearned | weight_distortion | 30 | 86.29 | 53.25 |
| unlearned | weight_distortion | 50 | 86.31 | 50.50 |
| unlearned | weight_distortion | 100 | 85.97 | 45.75 |
| 0 relearn | catastrophic_forgetting | 1 | 93.71 | 99.99 |
| 0 relearn | catastrophic_forgetting | 3 | 93.65 | 99.99 |
| 0 relearn | catastrophic_forgetting | 10 | 93.37 | 99.28 |
| 0 relearn | catastrophic_forgetting | 30 | 93.22 | 93.63 |
| 0 relearn | catastrophic_forgetting | 50 | 93.08 | 82.16 |
| 0 relearn | catastrophic_forgetting | 100 | 92.80 | 66.14 |
| 0 relearn | weight_distortion | 1 | 90.14 | 61.07 |
| 0 relearn | weight_distortion | 3 | 90.14 | 61.66 |
| 0 relearn | weight_distortion | 10 | 89.86 | 61.38 |
| 0 relearn | weight_distortion | 30 | 89.62 | 60.69 |
| 0 relearn | weight_distortion | 50 | 89.44 | 59.64 |
| 0 relearn | weight_distortion | 100 | 89.64 | 58.59 |
Do the authors have any insights or preliminary results on how well this method would generalize to other domains?
Susceptibility to relearning has already been analyzed for language models as we list extensively in the paper. The fact that relearning is observed across these different models is a clear sign that it is a more general phenomenon, beyond a single modality/domain.
The focus of our paper is to design a controlled study to gain a deeper understanding of causes and remedies for vulnerability to relearning. Based on this, we have identified weight space distances as an important factor affecting the robustness of different unlearning algorithms and proposed different unlearning objectives to directly address this.
We consider it exciting future work to experiment with our unlearning objectives in LLMs too for increasing their robustness. But there is no doubt, based on the existing literature, that this phenomenon does also occur in LLMs, making this an important and impactful direction.
—
Overall, we believe that we have addressed all comments and questions brought up by the reviewer and we’re curious if any other concerns remain. If the reviewer is satisfied, we would be really grateful if the reviewer can show stronger support for acceptance by increasing their score.
Thank you for your responses. After carefully reviewing your rebuttal and the clarifications provided to other reviewers, I find that my questions and concerns have been adequately addressed. I appreciate the additional context and explanations you've provided.
We are grateful to the reviewer for taking the time to go through our responses, including the ones that we wrote to other reviewers. We are also grateful to the reviewer for their constructive comments, which we believe further improved the quality of our work. We are pleased to hear that the reviewers' concerns and questions have been adequately addressed.
The paper studies the relearning attack phenomenon in example-level image classifier unlearning, where fine-tuning an unlearned model causes previously forgotten knowledge to re-emerge, even when the fine-tuning data contains no examples from the forget set. The authors analyze this phenomenon through the lens of L2 distances and linear mode connectivity in weight space, and find that models farther from the original pretrained model tend to be more robust to such attacks. Building on this finding, the paper proposes novel unlearning methods that involve either randomly distorting the model weights or directly maximizing the weight-wise distance from the original model. These methods are shown empirically to better resist relearning attacks.
优缺点分析
Strengths
- [S1] Interesting direction. The finding that fine-tuning an unlearned model on seemingly benign data can revive forgotten knowledge is an important concern in unlearning, given how easy such attacks are to perform. Developing methods that are resistant to this type of attack is of growing interest to the unlearning research community.
- [S2] Strong analyses. The empirical studies involving L2 distance and linear mode connectivity clearly support the paper’s core claims, and could become practical tools for evaluating post-unlearning models in the future.
Weaknesses
- [W1] Narrow novelty and scope. As mentioned in Lines 38-40, the vulnerability of unlearned models to relearning attacks has already been studied in the context of LLM unlearning. While the authors justify a focus on image classification by mentioning the difficulty of "drawing a clear line between forbidden and permissible knowledge" (Lines 45-46), this justification is not very compelling. In fact, existing LLM unlearning benchmarks do provide well-defined forget/retain splits and interpretable performance metrics [A, B, C]. Given that unlearning is arguably more pressing for LLMs and diffusion models, due to their generative nature and the high cost of pretraining, the contribution here feels somewhat narrow. Prior work has already shown that unlearned knowledge in LLMs can re-emerge under quantization, and that robustness can be improved by increasing weight distance [D].
- [W2] Evaluation limited to classification accuracy. Another major limitation is that all evaluations are based on classification accuracy. The study does not consider Membership Inference Attack (MIA) accuracy, which is a standard and widely accepted metric in the image classifier unlearning literature [E, F]. This makes it difficult to fully assess how well the proposed methods achieve the goal of unlearning.
- [W3] Confusing structure. The structure of the paper, particularly from the experimental section onward, is difficult to follow and would benefit from a major revision. The proposed methods (Weight Distortion, Weight Distance Regularization, and CBFT) are only detailed near the end, even though their results are discussed in earlier sections. Additionally, significant results currently buried in the conclusion should be moved up to the relevant sections. A more effective structure might present the preliminary and analytical findings first, then introduce the proposed methods, followed by the full experimental results. Section 4.1, which over-explains the baselines, could be substantially shortened to make room for a clearer explanation of the authors’ own methods.
[A] TOFU: A Task of Fictitious Unlearning for LLMs. COLM 2024.
[B] The WMDP Benchmark: Measuring and Reducing Malicious Use with Unlearning. ICML 2024.
[C] RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models. NeurIPS 2024.
[D] Catastrophic Failure of LLM Unlearning via Quantization. ICLR 2025.
[E] Towards Unbounded Machine Unlearning. NeurIPS 2023.
[F] SalUn: Empowering Machine Unlearning via Gradient-based Weight Saliency in Both Image Classification and Generation. ICLR 2024.
问题
- [Q1] I agree with the authors that distinguishing every single method in Figures 1 and 2 may not be necessary, but I still believe each method should be distinguishable for the sake of full readability. That being said, could those figures be revised to clarify the different methods? One suggestion would be to use colors together with different marker shapes for existing vs. new methods?
局限性
The authors have adequately addressed limitations of their work.
最终评判理由
Thank you authors for sharing further experimental results and insights. The results convincingly resolve most of my prior concerns. Though I am still a bit concerned of the paper's applicability especially since unlearning is much more crucial with larger models trained on larger datasets, I feel the paper's current positioning makes sense and would be of good interest to unlearning researchers. Hence I am leaning more towards an accept and raise my score.
格式问题
N/A
We are very thankful to the reviewer for their valuable and detailed comments. We address the concerns raised by the reviewer below.
Response to W1 (novelty and scope)
Although recovery of forgotten knowledge has been demonstrated in the LLM unlearning literature, we believe our work is the first to show, under controlled conditions, that fine-tuning on only the retain set can undo unlearning. Furthermore, the strength of this effect is quite striking, yielding almost complete recovery of the forget set in some cases (Figure 1). We’ve cited all relevant works we could find, including many works in the LLM setting, but we still consider our analysis to be novel and complementary to those interesting works.
Specifically, LLMs are not ideal for experimentation for two reasons: (i) computational cost, and (ii) it’s difficult to isolate factors at play, which limits conclusions about the causes and remedies of vulnerability to relearning. To elaborate:
- The goal of unlearning in LLMs is vaguely defined since the model defines a joint probability over the entire sequence (materialized via conditional probabilities of the next token conditioned on all preceding tokens). This means paraphrasing of text complicates evaluation since we would like to minimize the joint probability of any arbitrary paraphrasing of the text to be forgotten. In the case of classification, the goal is readily quantified.
- Powerful LLMs can in many cases acquire knowledge “anew” rapidly, even just at in-context learning time, by synthesizing existing knowledge in novel ways. This makes it harder to separate knowledge acquired by relearning from knowledge acquired/synthesized “on the fly”.
- The pretraining → finetuning methodology in LLMs further complicates matters. Having pretrained on a vast corpus of text leads to ambiguity about what knowledge was acquired at pretraining time versus at finetuning time, and when the knowledge was acquired can make a difference for unlearning/relearning dynamics, complicating the analysis.
To eliminate the above complicating factors, we study unlearning (and relearning) of a specific set of examples in a model that is trained (from random weights) on a dataset including those examples. In our setup (small models, no prior pretraining phase), we are also able to compute and compare against the oracle unlearning method of retraining from scratch without the forget set, and study relearning in that model as a reference point for how much information can be acquired “anew” about the forget set in a model that was truly never trained on that forget set before.
Our analysis in these controlled conditions is the first to show that, beyond doubt, relearning attacks succeeding are a failure of unlearning. Our analysis also led us to clearly explain this phenomenon from a weight space perspective and propose new methods to address it, both of which are important contributions that also showcase how fruitful it can be to conduct experiments in our setting.
We are grateful to the reviewer for highlighting reference D which we weren’t aware of at the time of submission. We found that our vision models aren’t vulnerable to these quantization attacks: forget set accuracy stays at the same level as the full-precision unlearned model regardless of the number of bits until it gets close to 8 bits at which point the model collapses. Thus, the spontaneous-recovery phenomenon we explore is distinct from the recovery obtained via quantization. Below are some sub-sampled quantitative results due to space constraints:
| Unlearning Method | Bits | Test Acc (%) | Forget Set Acc (%) |
|---|---|---|---|
| Catastrophic Forgetting | 2 | 10.00 | 0.00 |
| Catastrophic Forgetting | 4 | 10.00 | 0.00 |
| Catastrophic Forgetting | 8 | 13.19 | 0.00 |
| Catastrophic Forgetting | 16 | 88.23 | 39.60 |
| Catastrophic Forgetting | Full | 88.23 | 42.60 |
| Weight Dist Reg | 2 | 10.00 | 0.00 |
| Weight Dist Reg | 4 | 10.06 | 0.00 |
| Weight Dist Reg | 8 | 78.75 | 27.40 |
| Weight Dist Reg | 16 | 82.25 | 31.20 |
| Weight Dist Reg | Full | 82.26 | 32.20 |
We have two hypotheses for why they are not effective in our setup: (i) Language models consist of billions of parameters (compared to a few million in our case). Hence, the deltas are really small. This results in easily knocking off the weights to their previous pretrained values, which isn’t the case in our simple vision models. (ii) This problem is further exacerbated with models that are trained with quantization in mind. Such models can immediately latch onto the previous values which were regularized to be easily quantizable.
Furthermore, D showed that small weight distances (between the pretrained and unlearned model) are the cause of quantization attacks. We instead show this for relearning attacks for the first time. Given that not all models suffer from quantization attacks (an interesting finding in and of itself), it is important to understand how weight space distances affect each type of attack and this was not previously understood. Finally, the approach taken in D for increasing the distance between the unlearned and pretrained model is to increase the learning rate (in existing unlearning objectives/algorithms). While this is a very valuable first step in the exploration, the authors themselves admit that their framework is “highly sensitive to hyperparameter selection, leading to an unstable unlearned model”. Instead, we propose unlearning objectives that are designed to handle this natively, and we study the trade-offs of these new unlearning algorithms compared to the old ones.
Based on the above, we hope the reviewer agrees that our work is complementary to D. We added these new results and discussion of D to our paper.
Overall: our study of relearning in vision classifiers complements the existing findings in the LLM unlearning literature and is the first study to (i) show beyond doubt that the relearning phenomenon is due to imperfect unlearning (rather than due to other factors that remained entangled in LLM studies); (ii) show that relearning is possible without access to any examples from the forget set; (iii) show that vulnerability to relearning is strongly correlated with weight space distances between the pretrained and unlearned models, complementing D that shows this for quantization but not for relearning attacks (this is important since, as discussed, not all models that suffer from relearning attacks also suffer from quantization attacks); (iv) derive more robust algorithmic principles, going above and beyond prior work in D that tried to increase weight space distance by increasing learning rates, leading to an unstable system that is highly sensitive to hyperparameter selection.
We consider it important future work to translate these advancements to other modalities, architectures and different sizes of models and datasets. We have added results on a higher-resolution ImageNette dataset which further validates the generality of our findings; please see response to reviewer JQcR.
We hope that we have convinced the reviewer about the novelty and significance of our work, but we would be very happy to discuss this more during the upcoming reviewer-author discussion phase.
Response to W2 (MIA attacks)
We are thankful to the reviewer for the useful suggestion regarding MIA attacks. Based on the reviewer’s feedback, we present the results for MIA below. We chose the MIA from Kurmanji et al. (reference E) according to the reviewer’s suggestion. We observe that, as is the case in our initial findings with the forget set accuracy metric, this MIA can “fool us” into thinking that unlearning successfully erased all traces of unlearned data, even in cases where we know that the unlearned model is prone to relearning attacks. We will add these results in the appendix.
| Unlearning Method | MIA acc |
|---|---|
| l1_sparse | 54.2 |
| weight_dropout | 54.8 |
| weight_dist_reg | 48.4 |
| ssd | 53.8 |
| weight_attenuation | 53.6 |
| weight_distortion | 53.2 |
| alt_scrub | 51.6 |
| circuit_breakers | 55.0 |
| alt_gradient_ascent | 54.2 |
| random_relabeling | 94.8 |
| catastrophic_forgetting | 54.2 |
| retrain from scratch | 52.2 |
| ------------------------- | --------- |
Confusing structure
Thanks for raising this. We believe that this structure offers greater compactness compared to presenting “preliminary results” first without our proposed methods, and then repeating those plots later with the addition of more methods. We would be happy to hear any suggestions from the reviewer regarding how this can be more effectively communicated without replication.
Q: use colors together with different marker shapes for existing vs. new methods?
Thank you for the useful suggestion. We will add all the results in a tabular form in the appendix for curious readers who are interested in looking at precise values. Please let us know if this would be helpful.
—-
Overall, we really appreciate the points raised by the reviewer as we feel these are important discussions to have, and reflecting these in the paper has strengthened our contribution. We also feel that the addition of the quantization and MIA results is very valuable. We are looking forward to hearing any further thoughts from the reviewer and discussing more.
Dear reviewer,
We are very thankful for your useful comments on our paper. We have attempted to provide important clarifications and arguments about the novelty and scope of our work which we believe addresses your primary concern. Based on your comments, we have also added new results on membership-inference attacks, which highlights that susceptibility to relearning attacks cannot be directly predicted based on susceptibility to membership-inference, hence, relearning attacks are a more powerful tool for the evaluation of unlearning methods. Furthermore, we showed that quantization attacks that work on language models, don't directly translate to our setup. Relearning attacks on the other hand succeed in both cases. Hence, these results highlight the novel aspects of our work.
We are encouraged that reviewer 22p1 found our responses on these points convincing. We hope these revisions and the arguments provided in our detailed response help to address your concerns. As the discussion period is concluding, we wanted to ensure we had the opportunity to address any further questions you might have.
We would be grateful for your feedback.
This paper revisits the idea of the re-learning attack on unlearned models that was proposed in prior work, mainly in the context of LLMs. The authors instead focus on a more controlled experimental setting in image classification. The reviewers acknowledge that even though the experimental setup is far from the real-world use cases of unlearning, the results here shed light on our understanding of unlearning methods. The authors have also actively engaged in the discussion with reviewers and addressed many of the questions and concerns.