Data Attribution for Text-to-Image Models by Unlearning Synthesized Images
We identify influential training images that make a synthesized image possible, using our proposed attribution method by unlearning a synthesized image.
摘要
评审与讨论
The authors propose a new approach for data attribution for text-to-image models that utilizes machine unlearning. They then perform multiple experiments to demonstrate that the proposed method is competitive to other methods.
优点
- The paper is generally well-written, clearly structured, and easy to follow.
- The problem of data attribution in text-to-image models is interesting.
- Multiple experiments are conducted to verify the viewpoints.
缺点
- This is not really a weakness, but I wonder if the leave-K-out model can still generate a target image just under a different prompt. Perhaps a metric that see how hard it is to perform Textual Inversion [1] can be used?
References: [1] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, Daniel Cohen-Or, “An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion”
问题
- See weaknesses.
局限性
- The authors have adequately addressed the limitations.
Thanks for the feedback and suggestions.
Can Textual Inversion be used for evaluation?
In general response G3, we verify that Textual Inversion is not an ideal choice to check whether a concept is represented by leave-K-out models.
The paper discusses how we can identify influential examples by unlearning the synthesized images in the text-to-image generation models. Unlearning synthesized images leads to increase in the loss for the most influential training examples for the generation of synthesized images. The paper relies on a well-known regularization for Fisher information in order to avoid catastrophic forgetting.
The paper evaluates image generation quality two ways:
- Retraining the model from scratch after removing the influential examples from the training set. A Poor synthetic image generation quality is seen as a proxy for identifying most influential training examples.
- Using ground truth attribution provided in a Customized Model Benchmark.
优点
- The authors did a great job telling the story, motivating the problem and discussing related work.
- The evaluation metrics and benchmarks are well-described. The authors show the effectiveness of their approach both qualitatively and quantitatively.
缺点
- It is unclear how robust the synthetic image generation for a given caption is. It would be good to estimate the sensitivity to the subtle changes in the caption to the generation quality.
- Overall theoretical justification is not very clear. It is not clear how unlearning affects the overall utility of the model.
- Figure 2 mentions that their approach is qualitatively better than DINO and JourneyTRAK, but how can we tell that it is quantitatively better ? It might be that the proposed method is better for some examples and worse for others. In the implementation details section, line 212, the paper suddenly starts mentioning DDPM loss without explaining what it is and how it is applied in their method. In the methods section there is no mention of it but instead the paper discusses a different loss, EWC loss. This is a bit confusing.
- On Figures 3 and 2 DINO is listed as a baseline against which the authors evaluated the proposed method. It is, however, not an influence function's approximation or unlearning method. It seems that the authors used it for Image similarity. At the same time they compare it also against Journey TRAK which is an influence approximation approach. This is a bit confusing.
- Figure 4: where exactly are the error bars in the plot ?
问题
- Since \hat{z} is not part of the training dataset, how do we know that its effect is unlearned appropriately from the model with the elastic weight consolidation loss ?
- How sensitive is \hat{z} to subtle changes in the caption ?
- Have you considered comparing your work against DataInf: https://arxiv.org/abs/2310.00902 ?
局限性
The paper discusses the limitations of the work.
Thanks for the helpful suggestions and comments.
Sensitivity to subtle changes in captions
We clarify that our work focuses on finding training images that influence a generated image, not how changes in captions affect generation quality. From our experience, the model is not very sensitive to subtle caption changes (e.g., misspell).
Theoretical Justification
In Appendix A, we show the theoretical connection between influence functions and our approach, both of which make approximations toward characterizing the effect of removing a training point and evaluating its effect on a synthesized (or vice versa). As described in L165-173 of the main text, a “perfect” attribution is intractable, since it requires combinatorially searching the set of influential images, and our methods and influence function both serve as “proxy” solutions to this. We validate our method through the leave-K-out evaluation, showing our "proxy" is more effective than baselines such as influence functions. We will clarify this in the text.
How unlearning affects the overall utility of the model
We would like to clarify that we use unlearning to attribute images generated from the original model. Hence, the unlearned model will not be used to generate images for end users but instead only serve as a part of the attribution algorithm.
In addition, in our general response G1, we analyze and report effectiveness of our unlearning method.
How to say an attribution algorithm is better “quantitatively”?
We provide quantitative evaluation in Figure 4 in the main text, and Figs. 7&8 in the supplementary, showing our method consistently outperforms various baselines.
Mention of DDPM loss
DDPM loss is introduced by Ho et al. [3], which is the standard loss used to train diffusion models. This corresponds to in Eq. 4 of the main text. We will clarify this, include a detailed description of DDPM loss, and add a citation in the revision.
Using DINO as a baseline
For completeness, we indeed use an array of strong baselines, including feature similarity, as is standard practice [1, 2]. Our method outperforms similarity-based baselines in most cases (Fig. 4, 7, and 8 of the main text).
Error bars in Figure 4
We show standard errors in the top plot in Figure 4, but the error bars are very small and negligible. To clarify this, following reviewer ZhUS’s suggestion, we will report a table that includes the standard errors in our revision (see Tab. 1 in our response PDF, which summarizes Figs 4, 7, 8).
How to know if synthetic content is unlearned appropriately?
Please see G1 in our general response, where we quantitatively verify that the target image is forgotten, while other information is retained.
Comparison to DataInf
As suggested, below, we compare our work against DataInf, evaluating their performance with metrics proposed in Sec. 5.1 of the main text: leave-K-out model’s (1) Loss change and (2) deviation of generation.
| Loss Change (K=500) | Loss Change (K=1000) | Loss Change (K=4000) | MSE (K=500) | MSE (K=1000) | MSE (K=4000) | CLIP (K=500) | CLIP (K=1000) | CLIP (K=4000) | |
|---|---|---|---|---|---|---|---|---|---|
| DataInf | 0.0032±0.0001 | 0.0034±0.0002 | 0.0036±0.0002 | 0.038±0.004 | 0.039±0.005 | 0.045±0.005 | 0.78±0.02 | 0.78±0.02 | 0.77±0.02 |
| Ours | 0.0051±0.0003 | 0.006±0.0005 | 0.0087±0.0006 | 0.054±0.006 | 0.055±0.004 | 0.059±0.005 | 0.75±0.02 | 0.69±0.02 | 0.62±0.02 |
Our method significantly outperforms DataInf quantitatively. We show a qualitative example in Fig. 4 of the response PDF. DataInf often attributes images that are not visually similar, making them less likely to be influential to the synthesized images. We believe DataInf performs poorly because it is designed for LoRA fine-tuned models, rather than text-to-image models trained from scratch. According to DataInf’s paper, the method uses a matrix inverse approximation to compute the influence function efficiently, but this approach results in a pessimistic error bound. While DataInf’s authors mentioned that the error is more tolerable in LoRA fine-tuning, this error may not be acceptable for data attribution in text-to-image models trained from scratch.
We will add results and discussion in the revision. Thanks for the reference. Due to time constraints, we average over 20 synthesized image queries here, but will show full results with 110 queries in the revision.
Citations
[1] Wang et al. Evaluating Data Attribution for Text-to-Image Models.
[2] Singla et al. A Simple and Efficient Baseline for Data Attribution on Images.
[3] Ho et al. Denoising Diffusion Probabilistic Models.
Thank you authors for the detailed response.
I have a question while looking at Table 1 in the PDF. Is this table for 1 generated image or is this an aggregate for multiple generated images ?
How is the domain of those generated images identified ? How do we ensure that our evaluation of generated images is representative ?
How does the generation change if we say: "bus", "white bus" vs "big white bus" for K-leave out test in figure 1?
In Figure 1 are Related images the ones that are being removed during K-leave out experiment ? If not then it is not clear which images are left out for K-leave out experiment.
Is it possible to include textual captions for generation into the visual results ?
Thanks for your quick response.
-
Table 1 is consistent with the practice in the paper, as described in L236. The results are aggregated by 110 generated images.
-
The generated images' prompts are from the MSCOCO validation set. This ensures that generated images are representative of the training set (also MSCOCO).
-
The generated images don’t change much when we change “bus” in the caption to “white bus” or “big white bus”.
-
No, only the attributed training images from the target are removed (see top row of Fig. 2, main paper). The “related images” are generated images, from captions similar to the target prompt. As seen in the figure, the leave-K-out model successfully removes the target image while preserving the related images. Please see general response G2 for more detail.
-
Yes, we will include captions when adding to our revision, similar to Figs 2&3 in the main paper.
Thank you for the clarification! If related images are the ones generated based on the captions similar to the target it is important to show those captions. Also, how is "similar" measured here.
Since the authors mostly answered my questions I will increase the score but I encourage the authors to incorporate the clarifications and suggestions made in the rebuttal.
We generate related images using the most similar 100 captions retrieved from the MSCOCO val set using CLIP's text encoder.
Also, thanks for your comments and updated scores. Yes, we will incorporate the clarifications and suggestions.
This paper proposes a novel method for data attribution in text-to-image diffusion models. The key idea is to unlearn a synthesized image by optimizing the model to increase its loss on that image, while using elastic weight consolidation to avoid catastrophic forgetting. The authors then identify influential training images by measuring which ones have the largest increase in loss after this unlearning process. The method is evaluated through rigorous counterfactual experiments on MSCOCO, where models are retrained after removing the identified influential images. It outperforms baselines like influence functions and feature matching approaches. The method is also tested on a benchmark for attributing customized models.
优点
- The paper addresses the issue of data attribution, which is important for understanding the behavior of generative models and the contribution of training data.
- The paper proposes an innovative and effective approach to data attribution that outperforms existing methods on counterfactual evaluations.
- The authors conduct extensive experiments, including computationally intensive retraining, to thoroughly validate the effectiveness of the proposed method.
缺点
- The evaluations are limited to medium-scale datasets like MSCOCO. It's unclear how well the approach would scale to the massive datasets, such as LAION, which is used to train state-of-the-art text-to-image models.
- The method is computationally expensive, requiring many forward passes over the training set to estimate losses. This may limit scalability to large datasets.
问题
- How does the performance of the unlearning method vary with different sizes and types of datasets?
- How does the unlearning method perform in terms of efficiency compared with other methods?
局限性
Yes
Thanks for the helpful suggestions and feedback.
Efficiency comparison with other methods
Following the reviewer’s suggestion, we compare our method’s efficiency with other methods. To proceed with the efficiency analysis, we provide a brief overview of each type of method.
- Our unlearning method: we obtain a model that unlearns the synthesized image query and check the loss increases for each training image after unlearning.
- TRAK, JourneyTRAK: precompute randomly-projected loss gradients for each training image. Given a synthesized image query, the authors first obtain its random-projected loss gradient, and then match it with the training gradients using influence function. For good performance, both methods require running influence function on multiple pre-trained models (e.g., 20 for MSCOCO).
- Feature similarity: standard image retrieval pipeline. Precompute features from the training set. Given a synthesized image query, we compute feature similarity for the entire database.
We compare the efficiency of each method.
Feature similarity is the most run-time efficient since obtaining features is faster than obtaining losses or gradients from a generative model. However, feature similarity doesn’t leverage knowledge of the generative model. Our method outperforms various feature similarity methods (Fig. 4 of the main paper).
Our method is more efficient than TRAK/JourneyTRAK in precomputation. Our method’s precomputation is much more efficient runtime and storage-wise compared to TRAK and JourneyTRAK. Our method only requires computing and storing loss values of training images from a single model. On the other hand, both TRAK and JourneyTRAK require pre-training extra models (e.g., 20) from scratch, and precomputing & storing loss gradients of training images from those models.
Our method is less efficient when estimating attribution score. TRAK and JourneyTRAK obtain attribution scores by taking a dot product between synthesized image gradient and the stored training gradient features. The methods require calculating and averaging such dot product scores from the 20 pretrained models. On the other hand, as acknowledged in our limitations (Section 6 of the main text), though our model unlearns efficiently (e.g., only 1-step update for MSCOCO), getting our attribution score involves estimating the loss on the training set, which is less efficient than dot product search. Tradeoff-wise, our method has a low storage requirement at the cost of higher runtime.
Our main objective is to push the envelope on the difficult challenges of attribution performance. Improving computation efficiency of attribution is a challenge shared across the community [1, 2], and we leave it to future work.
Scaling up data attribution to massive datasets
In the community, scaling data attribution to massive datasets remains a huge challenge in terms of both algorithm and evaluation [1, 2]. Our solution to mitigate this issue is to:
- Provide the gold-standard leave-K-out evaluation on a moderately-sized dataset (MSCOCO)
- Test on a Customized Model Benchmark, focusing on attributing personalized/customized large-scale text-to-image models.
While we don't have the tech-giant-level resources to run full LAION-scale experiments, the MSCOCO evaluation is already the most computationally extensive evaluation in this space.
Citations
[1] Park et al. TRAK: Attributing Model Behavior at Scale.
[2] Grosse et al. Studying Large Language Model Generalization with Influence Functions.
Thank you for your response. The authors have addressed my concerns. Based on their rebuttal, I have decided to upgrade my evaluation to the weak accept.
Thanks for increasing the rating. We will incorporate the feedback in our revision.
Authors propose a method for identifying training images that need to be removed from the training set of a generative model to prevent a single specific “bad” (undesired) output from occurring in its output. Authors propose to directly unlearn the synthetized “bad” image, evaluate how this unlearning changes the training loss across training examples, and retrain the model omitting highly effected training examples - instead of evaluating how including each training example affects the “bad” output. Authors found that the resulting approach worked best when combined with prior techniques for preventing catastrophic forgetting in classifier uncleaning (e.g. optimizing only a set of weights and performing unlearning steps using Newton-Raphson iterations on the unlearned example using the original Fisher matrix). Authors show that retraining generators on pruned training sets indeed prevents the retrained generator from generating the undesired image - more specifically, authors show that the loss on the undesired output increases (more so compared to baseline methods) and that rerunning the retrained generator with the same random seed and text conditioning does not produce the “bad” example indeed.
优点
While at first the result might seem trivial (since authors both optimize and measure the same metric - “the model loss”) the finding is actually in fact very much non-trivial since authors retrain the model while pruning images that were affected most by unlearning the problematic output, so it might have so happened that removing these images did not prevent the output from occurring - so reported findings are indeed both interesting and valuable for the community (potentially beyond unlearning work).
Authors provide a random baseline and all provided results are significantly different from random (confidence intervals seem unnecessary).
Authors provide qualitative evaluation (Fig 3) suggesting that the desired effect is indeed observed in practice and not only in metrics.
Overall, the paper is well written and does a very good job of motivating both the problem and the proposed solution. Authors provide an extensive literature review that manages to be helpful even to a person without prior experience with unlearning (me).
缺点
While I enjoyed reading the first half of the paper, I have serious concerns regarding the quality of the results section. While authors provide many qualitative results in the main paper, the burden of making sense of quantitative results (e.g. comparing them to ablations) is not only entirely on the reader, but is made worse by complete lack of any raw numbers for cross-comparison (e.g. no tables - only barplots w\o numbers), and putting the vast majority of quantitative results into supplementary without providing that much more context there. For example, Figure 4 (top) - colors are very similar and lines are very close, making results not really legible. Given that performance is evaluated only at three data points, I see no reason why this can not be a table - that would aid both reproducibility and comprehension (e.g. cross-comparison). I also do not quite understand the point of reporting the "equivalent to X random points” metric - it is never even explicitly discussed in the main paper. All results for MSE and CLIP metric and all Ablations (also no tables, only bar plots without numbers) are in the supplementary, but even there not much additional context for interpreting these results is provided.
Some claims are discussed and are just never qualitatively verified - e.g. authors claim that the proposed technique is specifically designed to not lead to catastrophic forgetting, but I could not find any evidence that would support that. Moreover, from Figure 3 it might seem like it not only forgot how to generate this specific bus, but might have forgotten how to generate all buses - I am not entirely sure if that is the intended behavior. I would appreciate a qualitative confirmation that the model did not just “forget everything”.
I have minor concerns regarding the delta G metric. While Georgiev et al. [12] supposedly already showed in an accepted (workshop) submission that two models trained independently on the same or similar datasets generate similar images when primed with the same noise input - justifying the existence of metric (3) - I’d appreciate some results confirming that this is indeed the case for these models in these experiments. For example, if authors showed at least qualitatively that removing images most influential for generating a particular “bad” bedroom from the training set does not indeed in any way affect the generation of unrelated images (e.g. an image of mountains), that would help. Or, even better, by performing textual inversion on the “undesired” input after removing images that affect it the most, to show that it can no longer be represented by the model. Otherwise, there is no way to tell if a particular “bad generation” just “migrated” elsewhere.
Another minor concern: since authors performed multiple gradient update steps, I wonder whether other methods can also be used with multiple update steps.
问题
In addition to the weaknesses listed above, here are some questions/concerns I had while reading the paper:
Figure 1 - it is not said explicitly, but after several attempts, I figured out that the “bad” output that we want to remove in this figure is “bedrooms”, so it makes sense that the method assigned high loss (bottom orange bars) and “picked” bedroom images, the fact that only a single image (mountain) either “picked by the method” or “remains in the training set” is confusing; it might be worth to keep only two bedrooms and add a couple more images with low loss and somehow highlighting that “bedrooms are picked for removal”, not “only a single mountain is picked for keeping”.
L130 vs L104 - it is not clear why you need to introduce \tilda \theta if you already have \theta, especially given that in (2) RHS appears to not depend on \theta (w\o \tilda) and L157 also uses \theta (w\o \tilda), so the use of \tilda there is somewhat inconsistent and confusing; looking at L483-499 I think I understand now that the implied difference between the two is “optimal” vs “updated”, but i think this distinction is somewhat confusing in the main paper
L132 - \epsilon is never introduced at this point, so this is a bit confusing
L137-139 - trained from the same random noise or generated from the same random noise? worth rewording
L168-172 - these lines appear to not follow from the previous paragraphs introducing the notation for the attribution algorithm since it already provides per-sample scores (and therefore does not require 2^K evaluations); or not, and each tau(\hat z, z_i) is assumed to potentially involve iterating over all subsets of the training set (w\ and w\o z_i) - if so, please elaborate
Figure 2 - “Notably, our method better matches the poses of the buses (considering random flips during training) and the poses and enumeration of skiers.” - I would not say that it is very apparent from the figure.
局限性
Authors do mention some limitations of their work.
Thanks for the thorough comments and suggestions.
Results Presentation
Tables. Thanks for the suggestion. We include a table for baseline comparison in Tab. 1 of the response PDF, which corresponds to results in Figs. 4, 7, and 8 of the main text. We will include this, along with a similar table for ablation studies, in our revision.
More context for supplemental results. We will discuss our ablation studies more in the main text in our revision. To elaborate on the findings as reported in Appendix C.1, we find that using a subset of weights to unlearn leads to better attribution in general. We test three weight subset selection schemes (Attn, Cross Attn, Cross Attn KV), all of which outperform the using all weights. Among them, updating Cross Attn KV performs the best, consistent with findings from model customization [1,2] and unlearning [3].
In addition, we discussed the deviation of generated output (delta G metric) in Section 5.1 of the main text (L271-278). Due to space limit, we included the corresponding figures in the Appendix. We are happy to move some related figures (Figs. 7 or 8) to the main text in the revision.
“Equivalent to X random points” metric. We introduced this metric in L229 of the main text. We proposed this metric to convert from DDPM loss changes (which is not intuitive to understand) to a budget of images (which is more understandable). For example, removing 500 images (0.4% of the dataset) predicted by our approach causes the same performance drop as randomly removing around half of the dataset. We will clarify this in the revision.
Evidence for preventing catastrophic forgetting
Please see G1 in our general response, where we quantitatively verify that the target image is forgotten, while other information is retained.
Did the leave-K-out model “forget everything” about buses?
The leave-K-out model “forgets” the specific query bus, while retaining other buses and other concepts. Fig. 1 of the response PDF shows a qualitative study of this. We also report a quantitative study of this in the general response G2.
Justification of delta G metric
In general response G2, we follow the suggestion and study whether leave-K-out models affect the generation of unrelated images.
In general response G3, we also verify that Textual Inversion cannot faithfully reconstruct the images, therefore making it a less ideal choice to check whether a concept is represented by leave-K-out models.
Can other methods use multiple update steps?
Existing influence function methods for text-to-image models are based on a local linear approximation. The approximation is based on a single, infinitesimal network update, so it is not compatible with multiple update steps.
Feature similarity methods leverage fixed features, so network updates will not apply to these methods.
Other clarifications
Fig. 1 images. We will include more non-bedroom images in Fig. 1 in the revision.
The use of . Yes, refers to the “optimal” pretrained model. Since it is a constant term in the EWC regularization loss (Eq. 4), we use a different term to separate it from , the “updated” parameters for unlearning. We will make the notation more consistent in the revision and rename to for more clarity.
Introducing . In L132, is the noise map introduced in L105. We will clarify this by mentioning it is the noise map again in L132.
Generated from the same random noise. In L137-139, the images are generated from the same random noise. We will rephrase the sentence as follows:
“Georgiev et al. find that images generated from the same random noise have little variations, even when they are generated by two independently trained diffusion models on the same dataset.”
Beginning of method section. To resolve the confusion about L168-172, we clarify our formulation as follows:
- If we had infinite compute and a fixed budget of K images, we could search for every possible subset of K images and train models from scratch. If removing the set of K images leads to the most “forgetting” of the synthesized image, this set should be the most influential set.
- Of course, the above is impractical, so we simplify the problem by individually estimating the influence of each training point.
- One way to estimate the influence of a training image is to obtain a model that unlearns the training image. However, for attribution, it is expensive to run unlearning on every single training image.
- To resolve this issue, we instead apply unlearning to the synthesized image and then assess how effectively each training image is also forgotten as a result. It is much faster as we only need to run unlearning once.
We will reorganize the beginning of the method section to make this clear.
Citations
[1] Kumari et al. Multi-Concept Customization of Text-to-Image Diffusion.
[2] Tewel et al. Key-Locked Rank One Editing for Text-to-Image Personalization.
[3] Kumari et al. Ablating Concepts in Text-to-Image Diffusion Models.
I appreciate the effort to explain the original motivation (for metrics used, etc.) and I agree with it. I also find Table 1 in the rebuttal pdf much more convincing (and thank you for error bars).
Having read the rebuttal message, pdf, and rebuttals for other reviews, I think the final paper would be greatly improved by the addition of results and formatting changes (format results as tables) proposed during review. I think Figure 1 in the rebuttal also mostly addresses my concerns re G delta metric (shows that other images are indeed reconstructed almost perfectly confirming claims of Georgiev et al). For completeness, I'd appreciate Related/Other MSE/CLIP measurement for other baselines (if images are generated and stored, this should not be difficult)? And (later) I would encourage authors to put in more work into Textual Inversion experiments - people report much better reconstruction quality than what was provided in the rebuttal Fig 3.
With the addition of (many) quantitative and qualitative evaluations added in the rebuttal, this submission shapes into a convincing well motivated work. My only concern is that with that many changes, a significant rewrite of the second half of the paper will take place - and that will not be peer-reviewed. Given that the authors did a good job with both 1) the first half of the original submission and 2) provided convincing results in this rebuttal - I tend to believe that they will be able to rewrite the second half of the paper incorporating all the feedback and experiments provided above.
In the light of the previous paragraph, I increased my final rating to Accept, but I encourage authors to do a thorough reword of the second half of the paper.
Thanks for increasing the rating. We will incorporate the feedback in our revision and reword/reorganize the second half of the paper.
We thank the reviewers for their helpful comments. We are happy that reviewers found that our paper motivated the problem well (ZhUS, UBLR), provided an extensive literature review (ZhUS, UBLR), proposed interesting findings (ZhUS, YhLa), and conducted extensive experiments (8wwb, YhLa). We will provide clarifications to questions shared across reviewers here.
(G1) Effectiveness in Unlearning Synthesized Images (ZhUS, UBLR)
Our attribution method relies on unlearning synthesized images, making it crucial to have an unlearning algorithm that effectively removes these images without forgetting other concepts. Following reviewers ZhUS and UBLR’s suggestions, we analyze the performance of our unlearning algorithm itself and ablate our design choices.
We construct experiments by unlearning a target synthesized image and evaluating:
- unlearning the target image: We measure the deviation of the regenerated image from the original model’s output—the greater the deviation, the better.
- retaining other concepts: We generate 99 images using different text prompts and evaluate their deviations from the original model’s output—the smaller the deviation, the better.
We measure these deviations using mean square error (MSE) and CLIP similarity. We evaluate across 40 target images, with text prompts sampled from the MSCOCO validation set.
We compare to the following ablations:
- SGD refers to swapping our method’s Newton update steps (Eq. 5 in main text) to the naive baseline mentioned in L174 of the main text, where we run SGD steps to maximize the target loss without EWC regularization.
- Full weight refers to running our Newton update steps on all of the weights instead of Cross Attn KV.
The following table, along with Fig. 2 of the response PDF, shows the comparison.
| Target MSE () | Target CLIP () | Other MSE () | Other CLIP () | |
|---|---|---|---|---|
| SGD | 0.081±0.003 | 0.67±0.01 | 0.033±0.0004 | 0.83±0.002 |
| Full weight | 0.086±0.005 | 0.7±0.01 | 0.039±0.001 | 0.86±0.002 |
| Ours | 0.093±0.004 | 0.65±0.01 | 0.022±0.0004 | 0.89±0.002 |
As shown in the figure and the table, both our regularization and weight subset optimization help unlearn the target image more effectively, without forgetting other concepts.
(G2) Does leave-K-out models forget other images? (ZhUS)
In Sec. 5 of the main paper, we show that leave-K-out models forget how to generate target synthesized image queries. Reviewer ZhUS raises an interesting question about whether these models forget unrelated images, too. Our findings show that the answer is a no: leave-K-out models forget only the specific concepts while retaining others.
Following reviewer ZhUS’s suggestion, we study how much leave-K-out model’s generation deviates from those of the original model in three categories:
- Target images: the attributed synthesized image. Leave-K-out models should forget these—the greater the deviation, the better.
- Related images: images synthesized by captions similar to the target prompt. We obtain the most similar 100 captions from the MSCOCO val set using CLIP’s text encoder. Leave-K-out models should not forget all of them—the smaller the deviation, the better.
- Other images: images of unrelated concepts. Prompts are 99 different captions selected from the MSCOCO val set. Leave-K-out models should not forget these—the smaller the deviation, the better.
In Fig. 1 of the response PDF, we find that the leave-K-out model “forgets” the query bus image specifically while retaining other buses and other concepts.
Similar to G1, we quantitatively measure deviations using mean square error (MSE) and CLIP similarity. We evaluate 40 pairs of target images and leave-K-out models. The following table shows the results.
| Target MSE () | Target CLIP () | Related MSE () | Related CLIP () | Other MSE () | Other CLIP () | |
|---|---|---|---|---|---|---|
| K=500 | 0.054±0.004 | 0.719±0.0123 | 0.039±0.0003 | 0.862±0.0009 | 0.041±0.0003 | 0.788±0.0014 |
| K=1000 | 0.058±0.0043 | 0.675±0.0136 | 0.041±0.0003 | 0.855±0.0009 | 0.041±0.0003 | 0.788±0.0014 |
| K=4000 | 0.06±0.0036 | 0.612±0.0136 | 0.046±0.0004 | 0.831±0.0011 | 0.041±0.0003 | 0.787±0.0014 |
We find that target images have larger MSE and lower CLIP similarity than related images and other images. Also, as the number of removed influential images (K) increases, the target image error increases rapidly while other images stay almost the same. Interestingly, related images’ errors increase with larger K, but the errors are still much smaller than those of target images. This can be due to the fact that as K increases, the group of influential images can start affecting other related concepts.
(G3) Running Textual Inversion for Leave-K-Out Models? (ZhUS, YhLa)
Both reviewers ZhUS and YhLa suggest using Textual Inversion to check whether leave-K-out models forget the target concept. In fact, we find that Textual Inversion cannot reconstruct a given image faithfully. In Fig. 3 of the response PDF, we run Textual Inversion on the original model, and the inverted results are variations of the reference images instead of faithful reconstructions. This finding coincides with the actual goal of Textual Inversion: generate variations of a user-provided image.
Hence, we believe that deviation of generation (delta G) is a suitable choice for evaluation. In the previous response (G2), we further analyze and validate this choice.
The paper received positive reviews with scores of 7, 6, 6, and 6. All reviewers praised its strengths and recommended acceptance, contingent on the authors implementing the changes outlined in the rebuttal. The meta-reviewers considered it for a spotlight presentation, but after further discussion and comparison with other submissions, we decided that a poster presentation is most appropriate. Congratulations on the strong work!