Déjà Vu Memorization in Vision–Language Models
We show that vision—language models like CLIP can memorize the objects present in the training images, and evaluate different mitigation strategies to minimize this memorization.
摘要
评审与讨论
This work studies training data memorisation in Vision-Language Models (VLM). The paper focusses on contrastive learning with OpenCLIP, using a private Shutterstock dataset and a filtered LAION-50M, and evaluation on ImageNet. The paper proposes a method and metrics to measure déjà vu memorization, and in addition explores mitigation strategies. The paper concludes by stating it has demonstrated the presence of memorization.
优点
This paper is well-written, explores an important issue, and proposes both new ways of measuring the issue and ways of mitigating it. Many experiments are run, addressing many of the questions this work raises.
缺点
W1: The main weakness of this work is that it does not sufficiently distinguish memorisation from learning. On line 162 the paper states: "If no memorization occurs, models fA and fB should be interchangeable and hence this gap is zero.", and while one would expect that two models trained on sufficiently large data would eventually converge - it is unlikely that the models being different before this point is purely due to memorisation.
Two models trained on the first 500 and the second 500 classes of ImageNet-1K will have significantly different performance on in-distribution versus out-of-distribution samples, yet, memorisation does not seem like the most likely explanation for this.
Similarly, with two models trained on the same dataset - with the exception of a single image-text pair, there may be a difference in performance. This paper attributes this to the memorisation of that image-text pair, however, I'd argue this could also be explained by learning - depending on the size of the original datasets and the 'importance' of the held-out data point. For instance, if the held-out data point is the only image-text pair to contain the 'cat' concept then the loss may be very high for this data point, and as such, it influences the network weights disproportionally. As the dataset size grows it becomes more likely that most concepts in the test set will have been seen during training, and thus the importance of any individual data point decreases.
While memorisation may also be at play here, it is difficult to disentangle it from learning without a proper discussion about what distinguishes the two and how this may be seen in the experiments. In particular, if one takes the view that learning is compression of data points the boundary to memorisation becomes very blurry.
W2: A second, but lesser weakness, is that the mitigation experiments do not explicitly address multi-modal nature of VLM - in section 5.4 it is discussed that images are already augmented, and an additional text masking strategy is proposed to match it. Yet, based on this parallel it would then seem logical that data augmentation of images also prevents memorisation - which is not explored. On the other hand, given how strongly the proposed memorisation testing strategy depends on the text prompts, it makes sense that the measured metrics drop - but this does not exclude image memorisation within VLM, which may be unaffected by this augmentation approach.
W3: Two minor points: 1) Figure 1 is rather challenging to decipher before having thoroughly studied the text, and afterwards, the added value is minimal. Consider removing/updating this figure. 2) Line 233 discusses 'the adversary' which is not clear within the context of the paper.
问题
The main question concerns W1, which is, how this this work distinguish between memorisation and learning, and to what extent can the two be disentangled in interpreting the result?
If this point can be addressed I would switch to a more positive score, as I do think the work is interesting, but as long as it leaves open this alternative explanation I will recommend a reject.
局限性
Not applicable.
Thank you for a detailed review of our paper and for raising important questions. Below we would like to clarify some of the key points raised in the review.
Distinguishing between memorization and learning
To clarify, we believe a model can memorize and generalize (or learn) at the same time. This can happen at a subpopulation level, where the model memorizes rare concepts and generalizes to common concepts, or even at a sample level, where memorization is required for learning rare concepts as theorized in Feldman (2020) [1].
Our notion of deja vu memorization is meant to go beyond this, and instead examine when a model that is trained on an image with a broad and generic caption, memorizes many small details about the associated image when given the caption. As a concrete example, in Figure 1 in the paper, the caption is “healthy fruits have vitamin C and D”. A well-generalized model will associate this caption with diverse fruits, which is exactly what model B does. In contrast, model A, which is trained on this image-caption pair, associates this caption with (almost) the same fruits that are in the image. In other words, we define deja vu memorization as what can be inferred about the training image from its caption beyond simple correlations, which can happen through both learning and memorization in the traditional sense.
Multi-modal defenses, such as impact of image augmentation, are not explored
While we agree that having no image-augmentation and no text-augmentation would lead to worse memorization in theory, practical CLIP-style models have image augmentation by default (but not text augmentation) and hence we consider that as our baseline undefended models. We also note that the CLIP training objective interacts across modalities only via image–text alignment and as such does not specifically promote image–image alignment or text–text alignment. So the only way the model can memorize image features is via text features, thus we explore text augmentation which is not done by default in these CLIP-style models.
Other points raised
- Regarding the comment “Consider removing/updating this figure.”, we will simplify Figure 1 to make it more interpretable.
[1] Vitaly Feldman. "Does learning require memorization? a short tale about a long tail." In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pp. 954-959. 2020.
We thank the reviewer for their valuable time. We would like to know if our rebuttal has answered the questions, and would be happy to discuss further if the reviewer has any other concerns.
Thank you for the additional explanation, however, for me this raises a number of additional questions. Firstly, how is a "broad and generic caption" defined, and secondly how are "small details about associated image" defined. I guess within the context of the paper, these small details are the unique objects? Given that, this seems then limited in scope as this may well extend beyond objects to aspects such as colour, texture, or composition which are not studied.
The example in Figure 1 is rather unclear even with the additional explanation, as it seems to imply that the results from model B are preferred. Yet, if there is such a notion as "healthy fruits" (i.e., some fruits would not be healthy) then it makes sense that the model returns those fruits for which it has seen data that indicate that they are indeed healthy. In other words, if the caption had been "healthy foods have vitamin C and D" then it seems appropriate that the model is biased towards returning foods that appear in rather than diverse foods which may not be healthy (e.g., hamburgers or hotdogs), which seems grounded still in 'simple correlations'.
Other reviewers have brought up the notion of overftting - does the deja vu memorisation phenomenon discussed imply overfitting?
We thank the reviewer for following up on the rebuttal. We would like to further clarify the points raised.
-
Regarding “broad and generic caption” The notion of “generic” is relative to how much the caption explains the image. A more specific caption (such as the manually written captions in the COCO dataset) would explain every little detail in the image, such as what objects are there, what action or event is depicted in the image, etc. In the internet-scraped large-scale datasets that VLMs use, often the captions are very generic; as in they do not describe the image in high detail. This fact that captions don’t capture everything in the images is well known in the VLM community [1], [2]. This phenomenon allows for deja vu memorization that we study in our work.
-
Regarding “extending beyond objects” Since these datasets have no annotations, we rely on an open-source annotator to get the ground-truth object annotations. While our approach is also applicable to more detailed annotations that go beyond objects, this is not in the scope of this work. Even in this setting, the prior state-of-art memorization approach [3] only considers a single object label per image (ImageNet) and none of the prior works consider a. multimodal setting, b. large training size sizes, and c. multiple objects per image.
-
Regarding "example in Figure 1" Please note that the point of Fig. 1 is not to evaluate the relevance of image retrieval. The only point we’re trying to make is that although the caption typically contains strictly less information than the image, the model can recover extraneous details about the image through memorizing the caption.
-
Regarding “Other reviewers have brought up the notion of overfitting - does the deja vu memorisation phenomenon discussed imply overfitting?” To reiterate, deja vu memorization measures overfitting at a more granular level (i.e. not just overfit or not overfit), which we articulated in the previous response. We will include a detailed discussion about this in the paper.
[1] Sachit Menon, Ishaan Chandratreya, and Carl Vondrick. (2023). Task Bias in Contrastive Vision-Language Models. International Journal of Computer Vision. 132. 1-15. 10.1007/s11263-023-01945-0.
[2] Samuel Lavoie, Polina Kirichenko, Mark Ibrahim, Mahmoud Assran, Andrew Gordon Wildon, Aaron Courville, and Nicolas Ballas. "Modeling caption diversity in contrastive vision-language pretraining." arXiv preprint arXiv:2405.00740 (2024).
[3] Casey Meehan, Florian Bordes, Pascal Vincent, Kamalika Chaudhuri, and Chuan Guo. “Do ssl models have déjà vu? a case of unintended memorization in self-supervised learning.”, NeurIPS, 2023.
Thank you for the additional explanation, I believe that the definitions of the phenomenon investigated within the scope of this work and how this influences models/results could be sharpened still. Yet, I see the merits of the work and will increase my rating to a 4.
This paper investigates the issue of overfitting to pre-training data in Vision Language Models (VLMs) like CLIP. The authors conduct a comprehensive set of experiments focusing on text-to-image retrieval to evaluate this phenomenon. Their findings indicate that VLMs often memorize the data encountered during pre-training, which can impact the performance of downstream applications. However, the experiments also reveal that this problem diminishes as the scale of pre-training data increases.
Overall I think this paper is important for the community since it tries to understand the limitation of popular multimodal foundational models like CLIP. It does have certain limitations but I think it’s a novel and thorough evaluation conducted at understanding CLIP through the lenses of pretraining data.
优点
- The paper is well written and easy to follow. The metrics and methodology used is properly explained.
- Vision Language Models (VLMs) like CLIP have become crucial for downstream applications, including multimodal chatbots and text-to-image generation systems. Evaluating these models for their limitations is essential to improve their foundational capabilities. The authors assess these models by examining their tendency to memorize training data, a well-documented issue in traditional machine learning models, providing a strong motivation for this study.
- The experiments carried out make sense and the authors also evaluate few methods to mitigate the issue and present empirical results of each method tried.
缺点
-
While the authors evaluate four mitigation strategies, none of them effectively address the identified problem. It would have been beneficial to see a strategy that not only mitigates the problem but also enhances the model's utility.
-
I believe that Figure 6 should be included in the main paper. Since the authors discuss it in detail, having it in the main paper would make it easier to follow and understand.
问题
- Recent work has shown that the pretraining data of OpenCLIP models suffers from long-tail issues, I would like to to know if the authors evaluated this problem in a semantic sense as well? For eg concepts that are rare in the pretraining data, does the model tend to memorise these concepts more?
[1] Parashar et al. (2024). The Neglected Tails in Vision-Language Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
局限性
-
Although the authors conduct experiments to evaluate the impact of data scale, such as scaling from 10M to 50M image-caption pairs, they do not perform orthogonal experiments to assess the impact of model size. Evaluating larger models, such as the ViT-L/14, would have been a natural next step and could have provided valuable insights.
-
I disagree with the assertion that VLMs are typically pretrained on the scale of 10M image-caption pairs. For instance, the first CLIP model was pretrained on 400M image-caption pairs, and subsequent models have used even larger datasets. Although conducting experiments on such large datasets may be challenging, evaluating this problem at the more common pretraining data scale would have been more relevant, especially since the authors state that the problem diminishes as the data scale increases.
We thank the reviewer for their detailed review and for acknowledging the importance of our work. Below we respond to some of the points raised.
Long-tail issues of pretraining data
Similar to Parashar et al. (2024), our work also explores the memorization in pretraining data of openclip models, although our goals are orthogonal to theirs. While they show memorization of long-tails in terms of class labels, we explore long-tails in terms of image-text pairs where our long-tails are the cases where the text captions are not very descriptive of the images and thus are memorized more by the models.
Other points raised
-
Regarding “It would have been beneficial to see a strategy that not only mitigates the problem but also enhances the model's utility.” We find an inherent trade-off between privacy and utility, similar to the prior works [1-3] in membership inference and attribute inference literature, and thus it is difficult to find a mitigation that uniformly achieves both high utility and low memorization across all downstream tasks. Although it is possible to improve utility in some tasks, as shown in the new ARO benchmark results that we have included in Figure 1 in the attached pdf (please see the global rebuttal section for more details). Among all the mitigations that we explore, our text masking seems to achieve the best trade-offs, and even improves the accuracy on the COCO ordering task that requires compositional reasoning and ordering ability.
-
Regarding “believe that Figure 6 should be included in the main paper.”. Thank you for pointing this out, we will be happy to move the figure to the main paper.
[1] Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In IEEE Symposium on Security and Privacy, 2017.
[2] Samuel Yeom, Irene Giacomelli, Matt Fredrikson, and Somesh Jha. Privacy risk in machine learning: Analyzing the connection to overfitting. In IEEE Computer Security Foundations Symposium, 2018.
[3] Yunhui Long, Vincent Bindschaedler, and Carl A. Gunter. Towards measuring membership privacy. arXiv:1712.09136, 2017.
Thanks for your response, I will stay with my rating of 7.
We thank the reviewer for their valuable time. We would be happy to discuss further if the reviewer has any other concerns.
This paper explores the concept of training data memorization within vision-language models (VLMs). The authors introduce a method to measure the degree of memorization by analyzing the fraction of ground-truth objects in an image that can be predicted from its text description. The study reveals a significant level of memorization, which is evaluated at both sample and population levels. The findings indicate that OpenCLIP retains information about individual objects from the training images beyond what can be inferred from correlations or image captions. Additionally, the paper demonstrates that text randomization can reduce memorization with only a moderate impact on the model's performance in downstream tasks.
优点
- Pioneering Approach: The paper addresses a critical and complex issue by proposing a novel method to measure memorization in VLMs. This method, despite its imperfections, lays the groundwork for future research, offering a baseline for further refinement and extension. This work opens up new avenues for exploring memorization in multimodal settings, which is a non-trivial task.
缺点
- Lack of Interpretability: The major weakness of this work is the lack of interpretability of the proposed method. The proposed method is complex and not straightforward in its measurement of memorization. The reliance on an external object detector introduces additional biases and imperfections, complicating the interpretation of the results. For instance, the meaning of a PRG score of 0.17 is unclear, and the authors should provide guidance on interpreting these metrics.
- Absence of Baseline Comparisons: The paper would benefit from comparisons with simple baselines to contextualize the proposed method's effectiveness. Although identifying suitable baselines is challenging, their inclusion could strengthen the validity of the findings.
- Clarity Issues: Some aspects of the paper are difficult to understand. Figure 1 contains too much information and lacks a clear structure, making it hard to follow. Additionally, the results section is laborious to read due to the excessive use of acronyms.
问题
- I understand that memorisation is an issue with generative models as they can regurgitate training data during inference time. But why is memorization important in non-generative models? What are the potential risks or drawbacks? I see that it could negatively impact its predictions, but is there something else?
- Would reproducing the experiments with multiple pairs of fA and fB trained with different random seeds yield more robust results, or do the authors believe this to be unnecessary?
- This paper tackles a challenging and important problem in the field of vision-language models. While the proposed method has limitations, its novelty and potential for future research make it a valuable contribution. This is why I am giving it a borderline accept. The main areas for improvement are enhancing the interpretability of the method (or convincing me that it is interpretable), including baseline comparisons (if possible), and improving the clarity of the presentation. With these improvements, the paper could make a stronger case for acceptance.
[edit: Based on the rebuttal and the other reviews, I decided to increase my rating to 'weak accept'.]
局限性
Limitations are adressed in the paper.
Thank you for your detailed review and for raising important questions that has helped us better shape our paper. Below we respond to the key points raised in the review.
Lack of interpretability of metrics
Our memorization metrics are built bottom-up from our notion of deja vu memorization for VLMs. As a motivating example, consider the use of CLIP in a cross-modal retrieval task, where images are retrieved from a web-scale database given text. We wish to capture the degree of surprise in the retrieval result when the model memorizes training captions, i.e. how many objects can the model recover beyond dataset-level correlation? This prompted us to use an object detector to provide ground-truth annotations for measuring the precision and recall of object recovery. At the sample level, we find the precision and recall for object retrieval for the target and reference models. A positive gap corresponds to the target model memorizing the training sample and the magnitude of the gap indicates the degree of memorization. At the higher level, we find the fraction of training samples where the sample-level precision and recall gap is positive and report the aggregate statistics as PPG and PRG respectively.
While there will always be some bias when using object detectors, human or automated, this bias should not affect our evaluation when considering the gap between the two models. This is because the object detector is not trained on the same training set as the VLM, hence any incurred bias should be independent of the trained VLMs.
Absence of baselines
We note that this is the first work on VLM memorization and as such there are no baselines. The closest baseline could be the image-only deja vu attack of [1], but in their setting there is only one object per image, whereas here we have multiple objects so their method is not applicable here. For their method to work, we would need a background crop that does not contain the foreground object. In our data set, due to the presence of multiple objects, most of the images would not have any meaningful background crops.
Clarity of the paper
We will improve the readability of the paper, by simplifying the Figure 1 and clearly establishing the acronyms before the results section. Further, we will reiterate the acronyms (e.g. PPR and PRG) in the results section to remind the readers what those metrics mean and what to interpret from the metric values.
Other points raised
-
Regarding “why is memorization important in non-generative models?” VLMs such as CLIP are often used in cross-modal retrieval tasks, e.g. retrieve relevant images given text. This use case closely resembles the VLM deja vu test that we proposed: Given a training text caption, retrieve images from the public set. In other words, the deja vu score measures the degree of surprise in the retrieved images as a result of memorization. As a “hypothetical” example, suppose a training image contains a specific person in a specific place, and the caption is the person's name. If the model given this caption can predict the background location, then that might leak information. Beyond cross-modal retrieval, CLIP is also used in text-to-image generative models such as Stable Diffusion to provide text conditioning. Since our metric is meant to be more general, we did not test this use case explicitly, but we believe memorization in the CLIP model can manifest in surprise in generated images as well.
-
Regarding “Would reproducing the experiments with multiple pairs of fA and fB trained with different random seeds yield more robust results, or do the authors believe this to be unnecessary?”. Thank you for the suggestion! We ran an additional experiment where we repeated our experiments for 4 runs, each time using a different seed and training fA and fB from the scratch, and found the PPG and PRG metrics to be more or less the same. For predicting top-10 labels with 100 NNs, both the average PPG and PRG values across 4 runs are 0.066 ± 0.001. This result suggests that having only two models fA and fB can be sufficient for measuring memorization of VLMs.
[1] Casey Meehan, Florian Bordes, Pascal Vincent, Kamalika Chaudhuri, and Chuan Guo. “Do ssl models have déjà vu? a case of unintended memorization in self-supervised learning.”, NeurIPS, 2023.
Thanks for the rebuttal.
The clarification about the interpretability of the metric was very useful, and should be added to the paper.
Based on the rebuttal and the other reviews, I decided to increase my rating to 'weak accept'.
We are happy that our clarification was useful, we will add it to the paper. Thank you for raising the score.
The paper proposes a methodology to measure memorization in vision-language models (VLMs). These measurements are based on the fraction of ground-truth objects in an image that can be predicted from its text description. The authors also explore different mitigation strategies.
优点
- The methodology is novel and useful in evaluating if the model is overfitted to the training data.
- The paper shows extensive evaluation on both population and sample-level memorization. The ablation studies on mitigation methods are comprehensive.
- The paper is well-structured and clearly written, explaining the methodology and results effectively.
缺点
See Questions section.
问题
- Have you considered combining multiple mitigation approaches? For instance, setting both weight decay and text masking rate to 0.3 could potentially yield complementary benefits.
- Have you conducted experiments to compare the performance of models trained with different mitigation approaches on other tasks, such as retrieval or compositional reasoning benchmarks?
局限性
The authors have clearly addressed limitations in their paper.
We thank the reviewer for raising important questions, which has helped us improve our paper.
Combining mitigations
The main contribution of our work is a metric for evaluating memorization of VLMs. The ablation studies are done to validate the effect of key parameters and are not meant to be comprehensive. We do hope future work will adopt our metric to evaluate other mitigations that are either novel or combine several existing mitigation techniques.
Performance of mitigations on other benchmarks
CLIP-style models have been shown to behave as bag-of-words by Yuksekgonul et al. (ICLR 2023) [1] and as such don’t perform well on compositional reasoning (ARO) tasks, but nevertheless we include the ARO benchmarks in Figure 1 in the attached pdf (please see the global rebuttal section for more details). As shown in the new results, our text masking strategy gives the best utility trade-offs.
[1] Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. "When and why vision-language models behave like bags-of-words, and what to do about it?." In The Eleventh International Conference on Learning Representations. 2023.
We thank the reviewer for their valuable time. We would like to know if our rebuttal has answered the questions, and would be happy to discuss further if the reviewer has any other concerns.
Thanks to the authors for the response. I have no other concerns.
We thank all the reviewers for their thoughtful comments and for raising important questions. Here we include some new benchmark results across different compositional reasoning tasks for the various models we test in our paper. We include the answers to other questions / points raised in the individual review's responses.
Additional benchmarks
As per Reviewer GQcH’s request, we have included additional experiment results comparing the performance of various models on compositional reasoning (ARO) benchmarks of Yuksekgonul et al. (ICLR 2023) [1]. The results can be found in Figure 1 in the attached one-page pdf. We also show the impact of various mitigation strategies on these benchmarks. As shown in the new results, our text masking strategy gives the best utility trade-offs. Text masking even boosts performance on some reasoning tasks such as COCO ordering. We believe this could be due to the regularization effect of the mitigation that avoids overfitting on specific text tokens, thereby making the model less likely to behave like bag-of-words [1].
We will include these new results in the revision.
[1] Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. "When and why vision-language models behave like bags-of-words, and what to do about it?." In The Eleventh International Conference on Learning Representations. 2023.
This work investigates memorization in CLIP-like encoder-encoder VLMs. The paper proposes a new method for measuring memorization, that relies on retrieving images using the text-embeddings and comparing the retrieved images with the query image. A high fraction of co-occurring objects that are not obvious from the caption indicates a high degree of memorization.
The work has received scores of 6,7,6,4 and while the reviewer-author discussion cleared out some concerns, and 7Gny did raise their score -- 7Gny has raised important points that should be discussed clearly as limitations in the final paper. The AC recommends acceptance with these caveats.