5.8

/10

Rejected4 位审稿人

最低3最高8标准差1.8

3.3

置信度

正确性2.8

贡献度2.5

表达3.0

ICLR 2025

Low Compute Unlearning via Sparse Representations

Vedant Shah,Frederik Träuble,Ashish Malik,Hugo Larochelle,Michael Curtis Mozer,Sanjeev Arora,Yoshua Bengio,Anirudh Goyal

OpenReview PDF

提交: 2024-09-27更新: 2025-02-05

TL;DR

We propose a low compute unlearning solution for neural networks which involves use of a specific type of discrete bottlenecks in the model architecture

摘要

关键词

Sparse RepresentationsDiscrete BottlenecksModel EditingUnlearning

评审与讨论

审稿意见

评分: 8置信度: 42024-11-04

This paper propose a mechanism, based on prior work on the discrete key-value bottleneck (DKVB), for tacking the unlearning problem—how to adjust a model such that it no longer performs well on particular training examples.

The DKVB is an approach for learning data from non-stationary distributions (e.g. for continual learning). A fixed, pre-trained encoder is used to generate input-dependent queries for a set of key-value lookup tables. The keys are initialized to cover the full representational space of the encoder and thereafter fixed. The values are used to feed a decoder (which is non-parametric averaging pooling in this work), and the values can be learned. Since each query maps to $k$ keys and only the values associated with those keys are used for decoding, masking key-values belonging to particular samples allows DKVB to be a useful approach for unlearning.

优点

Overall, this is a well-written and motivated paper. The prior work is discussed and the authors successfully position their work as original and significant for tackling the problem in a compute-efficient manner. The ideas presented are intuitive and well-positioned to address the unlearning problem. The results appear promising and the evaluation is correct.

缺点

The authors focus on image classification tasks where the decoder has only one primary output, but for many large-scale pre-trained models such as LLMs, the backbone serves many downstream tasks. Presumably, the decoder will need to produce outputs that apply equally well to all tasks. Whether this is a limitation or an issue of scaling is not discussed.
The input representation of the encoder seems crucial for separating forget keys from retain keys, but this work has limited discussion and evaluation on how to choose a backbone.

问题

The bottleneck design necessarily limits the DKVB output to a smaller subset of downstream tasks than the backbone. Do you envision separate DKVBs for different tasks. How does this compare to an approach that directly modifies the backbone for unlearning?
While the evaluation includes two different backbones, there is limited discussion of how a backbone's representations affect the DKVB. Do you have recommendations for which layer in the original model should be used as the encoder? Should it always be the full backbone?
Related to Question 2, what if you need to forget specific samples (instead of whole classes)? Or have samples in the retain and forget sets with very similar input representations? How does DKVB fair in these settings (there appears to be a little discussion about these limitations, but it might be more helpful to contextualize w.r.t. the DKVB operation)?

评论- Response to the reviewer's feedback

2024-11-20

We would like to thank the reviewer for taking the time to carefully go through our paper and provide constructive feedback. We attempt to address some of the concerns and questions raised by the reviewer.

Scaling to general-purpose models

The direction regarding general-purpose models is indeed very interesting to us but we have saved that to future work. Generally, as long as the keys are well separated, the training objective used to train the values of the DKVB and the following decoder is very flexible. Typical training objectives for general-purpose models and transferrable representations such as contrastive losses or next token prediction can be equally used in the case of models with DKVB as well.

Choice of the pre-trained encoder and which layer should be used as the encoding layer

The working of the DKVB indeed relies on the availability of meaningful representations which are then mapped to the keys in the bottleneck. Hence, as long as an encoder can provide meaningful representations, it can be used for the DKVB, as shown in [1]. The quality of the trained bottleneck will depend on how well the encoder can separate distinct features locally.

The encoder does not always need to be a full backbone. In fact, in some cases, it is beneficial to use higher-level representations. As an example, in the case of ViTs, the final cls token is usually used as the representation for downstream tasks, whereas for ResNet models, outputs of the 3rd or 4th last layers are usually used as representations for downstream tasks, since these representations are more high-dimensional and "general", and thus more useful for downstream tasks, as compared to the very last layer which is usually over-optimized for the pretraining task (i.e. supervised training on ImageNet). The same representations can be used for DKVB as well.

Section 6 in [1] discusses the choice of pre-trained encoders in detail.

Separate DKVBs for separate tasks

We do not require separate DKVBs for separate tasks. In our experiments, the outputs of the DKVB are constrained to work well for one task because the objective used for training the models (negative log-likelihood) is used to train models for one task, namely multiclass classification. As explained in the Scaling to general purpose models section of our response, the outputs of the DKVB can be trained to be more general purpose (similar to the backbone) by using appropriate training objectives.

Instance specific unlearning

As mentioned in Section 6, and elaborated further in Appendix A.6 second paragraph, the current structure of DKVB is not suitable for instance specific unlearning. This is because the keys corresponding to different examples belonging to the same class are located close to each other. As a result, removing only the activations corresponding to a specific example would not make much of a difference since for the same example other activations located close to the remove activations can be selected and will decode into the same class. We identify this as a limitation of the current framework and leave it up to future works to develop DKVB-style bottlenecks that support instance specific unlearning.

References

[1] Träuble et al., 2023. Discrete Key-Value Bottleneck

2024-11-26

Thank you for answering my questions. While I agree that there exists no theoretical limitation to applying the proposed approach to tasks other than image classification, it is not obvious to me that it should work well or scale favorably. I suggest mentioning this in the limitations or considerations for future work.

2024-11-26

Thank you for the suggestion. We will add this as a limitation in the updated version of the paper!

审稿意见

评分: 6置信度: 32024-11-04

The paper proposes a low-compute machine unlearning technique using a discrete representational bottleneck to efficiently erase knowledge of a forget set from trained models with minimal impact on overall performance. Evaluated on datasets like CIFAR-10, CIFAR-100, LACUNA-100, and ImageNet-1k, the method matches or outperforms SCRUB—a state-of-the-art unlearning approach—while incurring negligible computational cost.

优点

The paper is well-structured, presenting with thorough results and analysis.
The proposed technique demonstrates effective unlearning performance with minimal computational cost, validated across multiple datasets.

缺点

The work appears to be primarily built upon the DKVB model but applied in a reverse manner. The authors should more explicitly highlight their unique contributions and clarify how their approach differs from DKVB.
The study focuses only on ViT-B/32 and ResNet-50 models. It's unclear why these particular models were chosen and whether the findings generalize to other architectures.

问题

The paper states that the key initialization is done on the training dataset. Is it sufficient to use only a few batches for initialization, or does it require processing the entire training dataset? If it's the latter, wouldn't this be computationally expensive?
In the experiments with ImageNet-pretrained ResNet-50 backbones, the retain set test accuracy decreases when the forget set test accuracy approaches zero (Figure 2), whereas in other settings, the retain set accuracy remains almost constant. Could the authors provide an explanation?
Unclear Symbol on Page 14: There are two "¿" symbol on page 14 whose meaning is not clear. Could the authors clarify its significance?

评论- Response to the reviewer's feedback

2024-11-20

We would like to thank the reviewer for taking the time to carefully go through our paper and provide constructive feedback. We attempt to address the concerns and questions raised by the reviewer below.

More explicit outlining of the contributions

We thank the reviewer for this suggestion. We will add a paragraph specifying our contributions in the introduction of the paper and update it before the end of the discussion period.

Choice of backbones

We use a subset of the backbone architectures used in [1], to remain close to the existing literature and prior research. Pretrained ResNet-50 and ViT/B-32 are well known for giving high-dimensional general-purpose representations. However, the framework of DKVB and by it, unlearning in DKVB can be adapted to any of the standardly used large-scale pretrained backbones. We would like to refer the reviewer to Figure 3 and Figure 4 of [1] where the authors demonstrate the effectiveness of DKVB on top of other backbones as well.

Key Initializations on Training Dataset

We conjecture that the key initializations may be done on smaller batches of training data, as long as the batches represent each class involved sufficiently. There would be a tradeoff between the amount of data used to initialize the keys and how well the keys are separated. However, key-initialization is simply done through Exponential Moving Averages (EMA) which requires forward passes but no backpropagation and thus, is a fairly inexpensive procedure even when done on the entire ImageNet-1k dataset (which consists of ~1.28 million training examples).

Additionally, the key-initializations may not necessarily be done on the training dataset. As done in the experiments in [1], the key-initializations can also be done on other datasets which are similar to the training dataset in case of unavailability of the training data (for eg. in online settings). However, initializing keys on the training data will lead to the best separation of keys (spatially). Since we do not presume any constraints on the availability of training data in our experimental setup, we initialize the keys on the training data.

Retain set test accuracy dropping towards the end in ImageNet

This occurrence may be due to the keys not being as well separated in ResNet, as they are in ViT. That is, there is more overlap between the keys or top-k combinations of keys corresponding to the retain class and the forget class in the DKVB in the case of ResNet as compared to ViT. Hence, since the majority of the keys corresponding to the forget class are removed from the bottleneck toward the end of unlearning in ResNet-50, this implies that some keys corresponding to the retain class are also removed, leading to a drop in performance on the retain class. How well the keys are separated in the DKVB depends on the quality of the representations coming from the backbone. The representations extracted from ViT/B-32 are better than the representation extracted from ResNet-50, which can also be seen in the fact that the performance of linear probes trained on top of ResNet-50 is significantly worse than the performance of linear probes trained on top of ViT/B-32 across all the datasets (see Table 3 in Appendix A.1).

Overall, the reliance of DKVB on the quality of the representations incoming from the backbones also means that the method will naturally achieve better performance as and when better backbones are released in the future.

Unidentified symbols on page 14

We thank the reviewer for pointing this out. The "¿" symbols are supposed to be greater than (>) signs. Another typo is "8 examples" instead of "8 classes". Further, we would also like to clarify that for the purpose of this experiment, we consider > 5% change in the retain set performance as a "significant deterioration". We will add this clarification and fix the typos in the next update to the paper (before the end of the discussion period)

References

[1] Träuble et al., 2023. Discrete Key-Value Bottleneck.

2024-11-28

Thanks for your response. I would like to maintain my positive score.

审稿意见

评分: 6置信度: 32024-11-04

This paper presenting unlearning with DKVB (Discrete Key-Value Bottleneck), and studies the class unlearning problem
It uses the DKVB architecture which uses a pre-trained encoder, followed by discrete key quantization, then maps these keys to values, then trained a linear encoder on these values
In order to unlearn a class, it records the most commonly used key/values pairs recorded by that class, then removes them from the codebook to unlearn them. (Note there is a difference between two methods of Unlearning with DKVB, examples vs. activations, but this is the broad summary)
The method is able to unlearn a class with very little effect on the classification accuracy of the remaining classes
There is a particular emphasis on the low compute cost compared to existing unlearning approaches, as this method does not require access to the original data (in the case of unlearning via activations), in comparison to other methods which do require the original data

优点

Not requiring access to the original data I believe is very important and this strength should perhaps be stressed more in the paper
The method requires very little compute compared to existing methods
Presentation is straightforward an easy to follow

缺点

I think that the presentation of table 1, particularly the bolding is misleading. While it is true that the DKVB unlearning results in the smallest change in the accuracy of the retain set, wouldn't one prefer a performance increase on the retain set, as it no longer needs to predict that last class?

问题

Is the class logit for the forget class removed entirely for the experiments in Table 1?
Is it possible product similar plots to figure 2 and figure 3 for one of the baseline unlearning methods? For just the smaller datasets such as CIFAR-10/100 would be sufficient
I think more discussion on multi-class unlearning would be useful. For example the figures in appendix A.3. (figure 5) maybe could be moved into the main text, and comparisons to one of the baseline methods would give a better characteristics of the limits of the approach. At least based on my intuition, this approach could fail if there are many classes being unlearned and "too many" keys/values are removed from the codebook. Can the authors provide any insight on this?

评论- Response to the reviewer's feedback

2024-11-20

Ambiguity in bolding in Table 1

We thought that unlearning the forget class while not changing the performance on the retain class might be the most useful scenario: After unlearning a model with forget examples, a practitioner might not know the new performance on the other classes if there are no original test examples available. Therefore the practicioner might be interested in having the least change of behaviour in other classes. Nevertheless, we see your argument as well and we are happy to remove the boldening in the updated version.

Class logit corresponding to the forget class in Table 1

We do not remove any of the logits from the output layer of the models. Instead, the intermediate keys present in the DKVB are removed. The output layer still contains all 10, 100, and 1000 logits in cases of CIFAR-10, CIFAR-100, and LACUNA-100 and ImageNet1k respectively. Thus, during inference, the output logit corresponding to the forget class may be predicted on very rare occasions. However, any such occasion would contribute towards a drop in the accuracy of the retain class, which as can be seen from the empirical results, is negligible, and in fact negative (drop in accuracy being negative => overall increase in accuracy).

Plots similar to Figure 2 and Figure 3

We are happy to work on these plots and will add them to the paper (before the end of the discussion period).

More discussion regarding Multi-Class Unlearning

We thank the reviewer for the suggestion. We will try to move the section to the main paper, if possible to do so without exceeding the page limit. Or at the very least, we will add a reference to Appendix A.3 in the main paper. We are also in the process of running the same experiment for one of the baselines (SCRUB [1]) as suggested by the reviewer.

Since each class uses a combination of values, which are in turn selected using a combination of keys, different combinations may have non-zero intersections in terms of the keys involved. Thus, if a significant number of classes are to be unlearnt (i.e. a significant number of keys are being removed), we agree that our approach might fail. It remains to determine what this "breaking point" would be. The "breaking point" would be determined as the number of classes to be unlearnt for the proposed frame to take a significantly larger hit on the retain set accuracy as compared to the baselines.

References

[1] Kurmanji et al., 2023. Towards Unbounded Machine Unlearning.

2024-11-30

Dear reviewer,
We have updated the manuscript with your suggestions (and suggestions from other reviewers).
We would be happy to further discuss the questions and concerns that you might have.

评论- Gentle Reminder

2024-12-02

Dear reviewer,
We have tried to address most of your concerns in our rebuttal and updated the manuscript accordingly. If you have any further concerns, please let us know and we would be very happy to engage in further discussion to clarify them.

If we have addressed most of your concerns, we would appreciate if you can kindly reconsider your assessment as the discussion period nears its end.

审稿意见

评分: 3置信度: 32024-11-06

This paper studies machine unlearning in the setting of models with discrete key value bottlenecks, specifically in the setting where compute is limited. Specifically, this work focuses on models possessing discrete key value bottlenecks, an information bottleneck proposed recently. The use of this bottleneck facilitates the use of the proposed methods, which identify which keys are necessary for predictions of a) groups of samples, and b) certain classes. The methods are zero-shot, and their effectiveness at unlearning in this situation is demonstrated empirically.

优点

The paper has several strengths:

The proposed methods appear to be highly effective at the tasks of model unlearning (based on the empirical results proposed in Section 5).
The proposed methods are significantly more compute-efficient than existing methods (SCRUB), while attaining similar levels of accuracy.

缺点

The paper has several weaknesses:

Clarity of Exposition

The paper is tricky to parse since no equations representing any of the operations (either describing the mechanics of the DKVB or the unlearning strategy) are provided. This adds ambiguity that should be avoided. Other mathematical details that are important for readers are the exponential moving average used for key initialization. I recommend the authors provide these formal details, at the very least in the appendix.
There are some issues with imprecise language as well, which I recommend the authors address in subsequent revisions, as they affect the readability of the paper. Two examples that stood out to me were:
- For instance, on lines 223-224, the phrase "This ensures that the representations are distributed sparsely enough." This is an unclear statement. It would help significantly if the authors elaborated upon this in a precise fashion.
- Similarly, in lines 238-239, the authors write "Technically, this approach requires access to the original training data corresponding to the forget class. However, it is also possible to carry out this procedure with a proxy dataset that has been sampled from a distribution close enough to that of the forget set." This is somewhat contradictory, and should be clarified.
Typos: there are a few typos in this work. For instance: "sufficient" -> sufficient, "ther specific subset" -> the specific subset.

Experimental Results

There are some points in the experimental section that could be made more clear.

The experimental results in Section 5.2 (for different numbers of samples) should be tabulated neatly and legibly. As is, it is slightly confusing and tricky to parse.
In Appendix A.7, it's stated that for the ResNet50 model, either the third last or fourth last layers are used in the encoder backbone. It's unclear why these particular layers were chosen for this task.

问题

I have a few questions about this work.

On lines 179-182, it's stated that "Although one could argue that our methods lean towards a weak unlearning strategy–given the pre-trained backbone might retain some information about the forget set–our approach deviate from the strict definition of weak unlearning as outlined by Xu et al. (2023)." This is a major concern, particularly if you take an encoder and a DKVB trained on Imagenet. Moreover, this contradicts later points in the paper where the authors consistently refer to "complete unlearning" (i.e. lines 427-429, 460-463). If the goal is to unlearn a class (for instance) totally, isn't removing that class' information from the encoder as well crucial to the success of the method?
On lines 253-255, the authors state "However, using one approach over the other may be more practical or even necessary, depending on the task at hand." Could the authors elaborate more upon this point? For what tasks would one use UvA over UvE (and vice-versa)?
On lines 257-259, the authors state "Moreover, the requirement of access to original training data of the forget class can also be circumvented under appropriate assumptions". Could the authors elaborate upon those assumptions? Moreover, since this is a strong claim made by the authors, is there any theoretical or empirical justification for this claim

评论- Response to the reviewer's feedback

2024-11-20

We thank the reviewer for their time and for providing useful feedback. We attempt to address some of the concerns and questions raised by the reviewer below.

Lack of mathematical equations

We thank the reviewer for these useful suggestions. We are working on adding the relevant mathematical details in the appendix. We plan to finish updating the paper as soon as possible. (within the next few days; before the end of the discussion period)

Imprecise language

Line 223 - 224

After being forward propagated through the pre-trained encoder, the representations of the input are mapped to the top-k closest keys in the information bottleneck. Importantly, the keys of the information bottleneck are not modified during training (as a result of no propagation of gradients between the keys and values in the DKVB during training). Therefore, it becomes important to initialize the keys such that they cover the feature space of the encoder as broadly as possible. This initialization helps the model represent different concepts effectively. We acknowledge that the sentence pointed out by the reviewer might have been unintentionally ambiguous and we will replace it with a more elaborate explanation as described above in an updated version of the paper, in the next few days (before the end of the discussion period).

Circumventing the requirement of access to the training data of the forget class

The main condition for unlearning of the forget class in the bottleneck is that those keys are being removed, which are closest to the encoder representations of the forget set examples. One way to do this is by recording which keys get selected for the forget class examples, and subsequently removing them from the bottleneck. This is the approach we employ in our experiments. However, as mentioned in Lines 239-242, in the absence of the forget set, the same could also be done by passing examples not directly belonging to the forget set but drawn from a distribution that is close enough to the forget set. Approximately, this will result in the same set of keys being selected as would have been if the examples belonged to the forget set.

Typos

Thanks to the reviewer for pointing these typos out. We will fix them in the next update to the paper.

Using representations from intermediate layers of ResNet-50

Using representations from intermediate layers of a pre-trained ResNet is standard practice since these representations are much richer in terms of representing information present in the original input. Additionally, we selected these layers as we wanted to follow the representational choices of [3] which cover a broad range of representations and architectures to show its working across different representational scales and formats. The very last layer of a ResNet is usually over-optimized for the pretraining task (supervised multiclass classification on ImageNet in this case) and hence does not result in rich "general" representations. [3] discusses these representational choices further.

Complete Unlearning and Weak Unlearning

Models consisting of Discrete Key-Value Bottlenecks typically involve a pre-trained backbone, with the DKVB plugged on top of them, followed by a decoder. During training such a model for any given task, the pre-trained encoder remains frozen while the values of the DKVB and the parameters of the decoder can change. At the end of training on a particular task, the parameters of the DKVB (keys and more importantly values) have to propagate that information accordingly about the given task. This can be seen analogously to how the final few layers of a pre-trained model are finetuned while adapting it to a specific task. During the unlearning process, we remove the relevant bottleneck parameters, which is removing much of the information about the task, while in the standard finetuning setting, the forget set information would be masked via the final few layers. Hence, our approach may be classified as a Weak Unlearning approach as defined in [1].

While the parameters of a model that was retrained on only the retain class and our DKVB will not be identical, both models attempt to approximate the same output distribution, which does not contain any information about the forget set anymore.

评论- Response continued

2024-11-20

We acknowledge that it might appear as if it were a limitation that the parameters of the pre-trained encoder are not altered. If someone were to take the pre-trained encoder from the unlearnt model, they might be able to extract some information about the forget class from the (remaining) parameters of the encoder. However, since the representations resulting from the (remaining) parameters would be high-dimensional and very "general", extracting information about the forget class from these parameters would be very difficult without, any sort of finetuning of these representations on the forget class data. Additionally, in our experiments, we ensure that this is also the case for all the baselines, where we make sure that the parameters of the encoder are frozen, and only the parameters of the linear layer are modified during unlearning. This ensures apples-to-apples comparison in terms of compute. In cases where more compute is available / allowed, one could come up with ways of unlearning the forget class from the encoder as well. One such way could be using a combination of using UvA / UvE in the DKVB and using a gradient-based unlearning approach (such as SCRUB [2]) on the encoder. In such a case the baseline in such as case would be using the same gradient-based approach on the baseline models (i.e. pre-trained encoder + linear layer), where all the parameters, including those of the pre-trained encoder, are trainable. In such a case, the approach consisting of a combination of UvA / UvE + a gradient-based approach for the encoder would still be more computationally efficient.

Regarding complete unlearning, we would like to clarify that we define complete unlearning in terms of the performance of the unlearnt model with DKVB, on the forget class, and not in terms of whether the parameters of the model are similar to those of a retrained model (which is the basis of the concepts of strong and weak unlearning). We define complete unlearning to have occurred if the unlearnt model has 0% accuracy (in the context of the multiclass classification task) on the forget class. Complete unlearning may or may not be necessary, depending on the purpose of unlearning. For eg. it is not necessary, and even undesirable in the case of Membership Inference Attacks (MIAs). The proposed framework allows the extent of unlearning to be controllable via the N_a and N_e parameters.

Choice between UvA and UvE

The choice between UvA and UvE could depend on the context and the nature of the task. For eg. UvA is useful in cases where the training data cannot be accessed after the training is over. In such cases, one could cache all the activations corresponding to different classes during the last epoch of training. Then, upon the requirement of unlearning, the required number of activations could be removed to unlearn in the forget class. UvE cannot be used in such cases since even though only the activations corresponding to N_e examples of the forget set could be cached, the parameter N_e might not be known at the training time.

References

[1] Xu et al., 2023. Machine Unlearning: A Survey
[2] Kurmanji et al., 2023. Towards Unbounded Machine Unlearning
[3] Träuble et al., 2023. Discrete Key-Value Bottleneck

评论- Response to author rebuttal

2024-11-28

Thanks to the authors for the detailed response. However, several concerns still remain unanswered.

The updated manuscript promised by the authors, with typos fixed and precise mathematical details added, has not yet been uploaded. Note that this impacts the reproducibility of this work, especially since your source code does not seem to be mentioned in the manuscript.
Unlearning by simply modifying the final linear layer/DKVB is not suitable for unlearing. If this were acceptable (assuming the task is classification, for instance), one could simply set the row of the matrix corresponding to the 'forget class' to 0, which would ensure the model never forgets that class. But in such cases, the encoder representations are still highly useful - one could simply replace the final linear layer (might not even need backpropagation to train, since the features are rich!), and obtain a reasonably accurate model. To reiterate: unlearning only the final layers (either linear or DKVB), and not the encoder, does not change/remove representations learned on the forget class data, and is clearly contrary to works such as [Jia et al, 2023], [Kurmanji et al, 2023], [Golatkar et al, 2020], and [Izzo et al, 2021] among others.
In your rebuttal, you state that " However, since the representations resulting from the (remaining) parameters would be high-dimensional and very "general", extracting information about the forget class from these parameters would be very difficult without, any sort of finetuning of these representations on the forget class data. " I don't believe this to be the case - for the most part, representations generated by latter layers (close to the output) are well known to be rich (see [Zeiler and Fergus, 2013], for instance).
Upon a more careful examination of your work, I also noticed that your experimental slate may be limited. In your experiments, you perform unlearning experiments on the worst class - where the effectiveness is high. However, while the reduction in forget class accuracy may be high, for other classes, particularly those upon which the model predicts poorly, may also be commensurately high (as forgetting, say, class 3 in CIFAR10 may also cause forgetting in class 2). As such, perhaps a better experimental comparison would be to compare the average drop in forget/remain class accuracies over all classes, as is the case in works such as [Jia et al, 2023] and [Kodge et al, 2024].

2024-11-30

Dear reviewer,
Thank you for your comments. We have updated the manuscript with mathematical and algorithmic formulations for EMA and the proposed approaches as promised.

Unlearning by modifying the final layer
We would like to present two arguments against this perspective.

First, we argue that the proposed approach does not always necessarily mean that we are modifying the final layer. In the proposed experiment DKVB acts as a final layer because there is no need for parametric decoders (by the virtue of the datasets being relatively easier). However, in the presence of parametric decoders, DKVB would be more like an intermediate component. Further, even in the setting considered in the experiments, we would like to argue that the proposed approaches are not the same as simply setting the final row of the matrix as zero. In the latter case, one could still use the representations of the penultimate layer (without requiring any sort of gradient-based adaptation) to extract information about the forget class. However, in the case of the proposed approach, we are removing the intermediate parameters, which in the case of more complex and realistic tasks, nat indeed contain more concentrated information about the forget sets.

As an example, consider a model containing a DKVB consisting of A) a pre-trained encoder, followed by B) a DKVB, followed by C) a parametric decoder. The idea is to factorize the information in the parameters of the DKVB such that the information such that the information corresponding to the forget set is localized. This information can be then directly intervened upon and removed without any gradient-based training. Such a localization of information is often not the case in other models and thus it becomes necessary to use some form of gradient based training to remove that information.

Second argument we would like to put forward is that even in the baselines considered, we ensure that the pretrained backbone is frozen. In case we would want to ensure unlearning of the backbone, the unlearning process could be broken down into two steps: A) Unlearning in the DKVB by one of the proposed approaches and B) Separately, unlearning the remaining pretrained encoder using one of the baseline approaches. In such a case, the compute required would still be lower than the compute required for using one of the baseline unlearning approaches on a full (pre-trained encoder + linear layer) model.

Higher level representations
We acknowledge that the layers leading up to the final layers of a large pre-trained model are rich with respect to datasets such as CIFAR-10, CIFAR-100, LACUNA-100 or ImageNet-1k (for eg. zero-shot performance of ViT-B/32) on CIFAR-100 is 65.1%). We provide explanation of the previous section independent of that. that is not dependent on whether or not the encoder contains rich information about the task.

Point about limited experimental slate We would like to clarify that we unlearn the best learned class by the model. We decided to do this following the intuition that the most performative class should be the most difficult to forget, i.e., we would require the highest values of N_a and N_e in order to unlearn this class completely. However, we also discuss an experiment in Appendix A.5 (Table 5), where we choose the forget class randomly for each seed. We run this experiment over 5 seeds and find that even in that case the damage on the retain classes is minimal.

Please let us know if you have further questions or concerns. We would be very happy to discuss further.

评论- Gentle Reminder

2024-12-02

Dear reviewer,
We hope we have addressed most of your concerns in our rebuttal. If you have any further questions or concerns, please let us know and we would be happy to engage in further discussion to address them.

If most of your concerns have been addressed we would appreciate if you can kindly reconsider your score as the discussion period nears its end.

评论- Response to Authors

2024-12-02

I thank the authors for the response. However, I still have several concerns:

Thank you for stating some equations in Appendix A.8. However, while you have expressed the key initialization strategy formally, the mechanics of your unlearning approach should also be stated (as I requested in my initial response). Furthermore, there are serious issues with respect to Algorithms 1 and 2.

In both Algorithms, "components of the DKVB" is listed. Rather than have this in the Algorithm, it should be defined outside, as it isn't a part of the algorithm! The algorithm should clearly state the inputs, outputs, and the mechanics of how the inputs are converted to the outputs (that part looks a bit better).
Also, the function argsort, for instance, is not an input to the algorithm. It is a function, which should be defined prior to presenting the pseudocode. In a similar vein, if the distance function $d(e,k)$ is just the euclidean distance, you can simply write $\|\|e-k\|\|_2$ , which is almost universally understood to be the 2-norm.
Similar note, when you report the range of indices, like $j\in [0, N-1]$ you're saying that $j$ lies in the closed interval between 0 and $N-1$ . Your notation requires much more precision.
There are other serious issues with imprecision in the presentation of the algorithms. This presentation is below the standard required at ICLR.

"We acknowledge that the layers leading up to the final layers of a large pre-trained model are rich with respect to datasets such as CIFAR-10, CIFAR-100, LACUNA-100 or ImageNet-1k (for eg. zero-shot performance of ViT-B/32) on CIFAR-100 is 65.1%). " Since this is the case, what is to stop me from removing the DKVB (or linear layer) and extracting the information from the frozen encoder? There are works such as "Dreaming to Distill: Data-Free Knowledge Transfer via DeepInversion" by Yin et al (CVPR 2020), which accomplishes precisely that.
As regards the point entitled 'Unlearning by modifying the final layer": the same argument from my previous response holds. What is to stop me from swapping out my final layer with something else, on top of the (frozen) encoder? To make unlearning effective, it is ultimately ineffective to leave the encoder frozen. How can the claim that the model has "forgotten" the dataset be true, if I can simply replaceme the DKVB/linear layer (i.e. a 1 vs many Fisher discriminant)? As I said earlier, this is functionally equivalent to setting the row of the linear layer's weight matrix corresponding to the forget class to zero - this will achieve 0% accuracy on the forget class, with no impact on the remaining classes.
DKVBs as intermediate layers: I don't think it's appropriate to hypothesize about settings that you have not conducted experiments on. If you are making the claim that your unlearning approach can be applied to settings where the DKVB is an intermediate layer, this must be supported either by experiments, or by rigorous theory.
The reason I asked about the best vs the worst class (in your work, you conduct unlearning on the best performing class) is that the worst performing classes are those that are hard to distinguish from other classes. For instance, in CIFAR-10, on the model you've used, classes 2 and 4 are the "hardest" classes. If (hypothetically) those two classes are similar, then forgetting class 2 will likely lead to information about class 4 being forgotten as well. This is why most works in this space (including those I mentioned in my previous response) show the average unlearning performance over all the classes. In fact, the claim that " the most performative class should be the most difficult to forget, i.e., we would require the highest values of N_a and N_e in order to unlearn this class completely" should be supported with either experimental validation or rigorous theory.

2024-12-04

We thank the reviewer for their response and for elaborating upon their concerns. We would also like to thank you for your suggestions on improving the algorithmic formulations.

1.) We have incorporated your suggestions and rewritten the section on algorithmic formulations. You can find a copy of the rewritten version at this link: https://purple-mella-37.tiiny.site/

2.) Argument to simply use the frozen encoder: We would like to start addressing this concern by posing the following scenario:

Imagine we have a model pre-trained on large amounts of data such that it acts as a good general encoder, for eg. the ResNet-50 pre-trained on ImageNet, that we use in our experiments. Now, assume a held-out class Z, whose data the model did not see while training. Technically, the model we trained would be an oracle point of comparison for the task of "Unlearning class Z" from a model which was trained on ImageNet classes + class Z). However, we can argue that even if we take the model that we trained (which has not seen data belonging to Z) and extract features of examples belonging to Z from the very first layer of the model (essentially some form of down-projection of the image into some spatial feature maps) , we can still reconstruct a lot of information of class Z (or examples thereof) from those features.

Thus, we can always reconstruct some level of information from an intermediate layer representation. Using the layers occurring just before the DKVB is an arbitrary choice. Given access to a model at the level of being able to intervene on specific layers, some amount of information about a given class can always be reconstructed, even if the model is a perfectly unlearnt model with respect to that class.

Further, such a level of access to the models wherein the end user can intervene on specific layers is not always a given in scenarios where unlearning is usually required. For example, model deployment settings. In fact, one might even argue that more often than not, it is the case that the end user does not have access to the model at all, but only to its final outputs. One such scenario is where an ML model has been deployed by a company and the end user has access to the model only through API calls. In such a scenario, the company must be able to very quickly remove identified unwanted classes without having to retrain the entire model from scratch over and over again. In such a pipeline, it also doesn't matter if the information is still present in some intermediate layer. Thus, the proposed approaches would be very effective in such use cases. We are interested in the "final classification”.

3.) Equating the proposed approaches to setting the final layer matrix row corresponding to the forget class as zero: The proposed approaches cannot be compared to setting the row corresponding to the forget class in the matrix of the final layer of a monolithic model as zero. The DKVB gives rise to factorized representations which are then, in our experiments, average pooled before doing the softmax. Thus, the final layer is AveragePooling. This technicality is important to point out because it gives rise to several advantages of DKVB over setting a row of final layer matrix to be zero.

First, in the proposed approaches, it is possible to control the "extent' of unlearning by controlling the parameters N_a and N_e. This is due to the fact that the representations are factorized and can be intervened upon directly as well as individually. This cannot be done in the case of just setting the final layer matrix to zero. To do unlearning in the final layer only up to a certain extent, we would need some gradient-based approach. This is exactly the baselines that we use in our experiments.

2024-12-04

Second, the proposed approaches also have the potential to allow unlearning in cases where a specific subset of a class needs to be unlearnt. For example, in the classification task, assume the scenario where we have class A (“birds”) but no separate classes for the subtypes (mockingbirds and penguins) in the classification model. That is, both "mockingbirds" and "penguins" were labeled under the same class "birds". Now, we want to make sure that penguins are not labeled as “birds” anymore (although they were classified as birds in the initial training), i.e. we want to unlearn "penguins". In the DKVB, penguin images will map to a different location of the embedding space than “mockingbirds” or other standard birds. Now we could remove all the key-value pairs for penguins which are in reality further away from other "bird" embeddings than another class. This way the model still predicts “birds” for all the classical birds but penguins are unlearned. With the existing framework, whether or not the subset analogous to "penguins" in the above scenario is actually mapped to a region of the representational manifold that is far from the complementary subset of the parent class would vary on a case-by-case basis and depend on the semantic relationship between the subset and the parent class. However, future work may build upon the proposed framework to always ensure such a separation of representations within a class, for eg., by using specific auxiliary objectives. While this is future work (as discussed in Section 6 of the paper), we feel it is important to discuss this to highlight the potential of the proposed approaches. Something like this would not be possible in the baseline where we simply set a layer of the matrix of the final layer to 0.

4.) In order to check for whether or not the effectiveness of the proposed approaches is dependent on the specific class that is being unlearnt, we ran an experiment where we conducted complete unlearning for all 10 classes of CIFAR-10 in a model consisting of a DKVB on top of a ViT-B/32 backbone. We average the retain class test accuracy at complete unlearning for all the 10 classes and report below. From the standard deviation, we can clearly see that the effectiveness of the proposed approaches are not dependent on which class is being unlearnt.

Approach	Retain Class Test Accuracy Change (%)	Forget Class Test Accuracy Change (%)
UvA	0.9365 +/- 0.19 %	-100 +/- 0 %
UvE	0.9921 +/- 0.61 %	-100 +/- 0 %

评论- Updated Manuscript

2024-11-30

We thank all the reviewers for their very useful suggestions. We would like to bring to your notice that we have updated the manuscript by incorporating the suggestions given by the reviewers. All the modifications are marked in blue text. Below is a summary of all the changes:

Added mathematical and algorithmic formulation for EMA and the proposed approaches in Appendix A.8
Elaborated upon the applicability of the approach in the case where training data cannot be accessed after the training (Section 4, Paragraph 4)
Added a line on the applicability of the approach on tasks other than image classification in Section 6. (Limitations and Future Work)
Added results on multiclass unlearning using a baseline (SCRUB) in Appendix A.4
Added a plot similar to Figures 2 and 3 for a baseline (SCRUB) in Appendix A.3
Added a paragraph outlining the main contributions of the paper at the end of Section 1
Changed the bolding scheme in Table 1 to highlight the approaches which achieve the maximum retain class test accuracy rather than minimum damage with respect to the original accuracy on retain class.
Added a reference to the multiclass unlearning experiments (Appendix A.4) in the main paper (in Section 5.2).

We would be happy to engage in further discussions to clarify any concerns and questions that the reviewers have

AC 元评审

2024-12-18

The paper studies the problem of class unlearning, where the goal is to remove the information about one (or a set of) classes from the model. The authors approach this problem by training a discrete key-value bottleneck (DKVB) model on top of a frozen encoder. Then DKVB is a decoder model that makes predictions based on similarity to a frozen set of keys which are selected in a non-parametric manner. Then, the authors remove some of the keys that are most used in predicting on the target class. The authors propose two variations of the method: unlearning via examples and via activations. Across a variety of image tasks, the authors show that the proposed method can reduce the performance on the target class to 0% while preserving performance on the remaining classes.

Strengths:

The proposed method can unlearn classes at a much lower cost than competing methods
In theory, the proposed method is applicable to unlearning subsets of classes, and general sets independent of classes
The authors show results on a range of datasets and architectures

Weaknesses:

In my opinion, the core weakness of the paper has been identified by reviewer X8bn: it is possible to remove a class (or a set of classes) from a classifier by simply masking the logits. This approach would completely “unlearn” the forget class, while improving the performance on retain classes. The experiments presented in the paper do not differentiate the proposed approach from this simple masking. In theory, the proposed method can work in more general settings, and I believe that should be the core of the experimental evaluation, but the authors only show results for class unlearning.
The method does not remove information about the class from intermediate representations.
The clarity of presentation should be improved. Specifically, the algorithm should be explained more clearly and precisely in the main text. Currently, the description, including Figure 1, is quite hard to follow.

Decision recommendation: based on the above, I recommend rejecting the paper. In my opinion, the evaluation should focus on the more challenging setting of unlearning groups that do not align with classes. Also, the fact that information about the forget set is still contained in the representations should be addressed.

审稿人讨论附加意见

The authors responded to the reviewer comments. The concerns of reviewer X8bn was not fully addressed by the rebuttal. Specifically, the issue that information about the forget set is still present in the model, and that similar results could be obtained with a simple modification to the last layer of the model. The authors made some general arguments, but did not provide empirical results to differentiate their proposed method from this simple baseline.

最终决定Reject

2025-01-22

Reject