6.0

/10

Poster3 位审稿人

最低5最高7标准差0.8

3.3

置信度

正确性3.0

贡献度2.7

表达3.0

NeurIPS 2024

Measuring Dejavu Memorization Efficiently

Narine Kokhlikyan,Bargav Jayaraman,Florian Bordes,Chuan Guo,Kamalika Chaudhuri

OpenReview PDF

提交: 2024-05-15更新: 2024-11-06

TL;DR

The deja vu memorization test measures training data memorization, but is inefficient because it involves training another similar model. This work provides a simpler, more efficient way to carry out the test.

摘要

关键词

memorizationprivacy

评审与讨论

审稿意见

评分: 5置信度: 42024-07-12

this paper proposes a method to measure memorization in representation learning models without the need for training two separate models. The proposed method aims to address the computational challenges and practical limitations of the existing déjà vu method by using simpler alternative approaches to quantify dataset-level correlations.

优点

The proposed method achieves similar aggregate results in measuring memorization as the traditional two-model setup.
The use of a single model and simpler classifiers (like Naive Bayes and ResNet50) to approximate dataset-level correlations is interesting but also a limitation :) see the weaknesses section.

缺点

The reliance on simpler models like Naive Bayes classifiers might limit the method's accuracy in certain complex scenarios.
There is a risk that the simpler models used for approximating dataset-level correlations might overfit their training data, which could skew the results.
The reference models used to approximate dataset-level correlations might introduce their own biases, affecting the accuracy of the memorization measurement.
The method provides an aggregate measure of memorization but lacks detailed insights into the types of memorized information. A more granular analysis could help in understanding the nature and implications of memorization in different models.

问题

see Weaknesses section

局限性

see Weaknesses section

作者回复

2024-08-07

Thank you for reviewing our paper and raising those important questions!

Weakness 1 - Sample-level granular analysis

In our paper we obtain dataset-level statistics by aggregating sample level results. Appendix C.1 provides examples for Vision and C.2 for Vision language models. We ran additional experiments and shared the results in the 1 page PDF and in the global rebuttal. Please, check global rebuttal section.

Weakness 2 - The reliance on simpler models like Naive Bayes classifiers might limit the method's accuracy in certain complex scenarios

This is a good point. In our scenario for the ImageNet dataset Naive Bayes provided sufficiently good results. Overall, whether it is Naive Bayes or ResNet we observe similar memorization patterns. Our experiments show that many of those observations (for instance lower memorization in pre-trained models compared to the models trained on the subsets of the dataset) are also true for Vision Language Models (VLMs).

Weakness 3 - A risk that the simpler models used for approximating dataset-level correlations might overfit their training data, which could skew the results.

Yes, this can happen. We used early stopping to avoid overfitting. Apart from that, we retrained the models multiple times with random seeds and showed a small variance across five training runs.

Since Naive Bayes is a deterministic classifier, we used a bootstrapping approach to estimate the variance. We randomly sampled half of the examples per class and retrained Naive Bayes classifiers on those subset. We repeated the experiment 5 times to estimate the mean and the variance.

The results are presented in the table below:

Correlation Model	Mean	Std.
VICReg + KNN	7.53	0.26
Barlow T. + KNN	7.16	0.18
DINO + KNN	6.33	0.04
ResNet	7.5	0.27
NB w/Top-1 CA	4.5	0.08
NB w/Top-2 CA	6.93	0.12
NB w/Top-5 CA	8.799	0.21

The table above represents the error bars for the 2a plot in our paper. Due to limited space in 1-pager rebuttal PDF we represent the results in a table.

Weakness 4 - Biases in the reference models

Yes, this can be the case, however, we have similar observations for dataset, class and sample-level memorization across different types of correlation models (Naive Bayes and ResNet) and different types of representation learning models such as Vision Language Models and Vision Representation models. Our goal in this paper is to show that we can effectively replace two-model déjà vu memorization test with one model. We leave detailed analysis of the biases in the reference models for future work.

评论- Comment to reviewer

2024-08-14

Thank you very much for the thoughtful review feedback! Let us know if you have any questions regarding the rebuttal.

审稿意见

评分: 7置信度: 32024-07-12

This paper proposes methods to gauge the ability of a model to memorize

优点

Originality: to my knowledge, this is an original work with novel results.
Clarity: The writing is clear.
Significance: The memorization of foundation models is critical to study, these methods and findings are significant to the community.

缺点

The results hinge on the smaller classifier not memorizing training data. The authors list this as a limitation, but some empirical result to convince the reader that this is in fact unlikely would be strengthen the argument.
The comparison in Section 4.1.2 is slightly confusing (question below).
A concluding ranking or thought about commonly used models is missing (question below)

问题

In Section 4.1.2, I understand that the authors make the observation that pre-trained models memorize less than the same models trained on a subset of the data and they offer as an explanation that the pre-trained models have lower test error and larger training set. Can the authors clarify here what they mean? To me, if a model memorizes less and generalizes better I don't see any evidence that one is the cause of the other, these might both be the result of some other variable. Maybe the answer is as simple as larger training set leads to both phenomena, but I'd then like to see a test where the training routing (pre-training or not) is held constant and memorization is studied as it varies with trainset size. In any case, some more clarity is needed here.
I feel a paper with this many results could be stronger if it ended with a bit more advice or conclusion. Do the authors have a clear ranking of models/techniques that a practitioner should use to avoid memorization or a set of available models that a user should use if they want a model with less memorization? These points would better motivate the paper as a whole. I think section 4.2.2 aimed to do this but it left me with these questions, so a clearer more direct bit of actionable advice to the reader would be appreciated.

局限性

Yes.

作者回复

2024-08-07

Thank you for the detailed review of our paper and the questions!

Question 1- Clarification on Section 4.1.2

Vision Representation

In our work we are not stating that there is a causal relationship between training size and memorization. We hypothesize that pre-trained models memorize less due to larger training sets. [1] shows in Figure 4 that increasing the size of training data doesn’t always increase the memorization score. This result, however, is reported on smaller subsets of the training data. Retraining the model on larger subsets is computationally expensive and we haven’t run that experiment. Our goal was to show the effectiveness of our approach for pre-trained models and not retraining new models.

VLM

For VLMs, previous work [2] (that requires training 2 models; target and reference models; for their memorization test) did show an experiment with different dataset sizes, where they concluded that larger training set size leads to lower population-level memorization. Our contribution is to show that their method can be applied with just one model, but our conclusion regarding dataset size should remain the same as theirs.

Questions 2 - Advice and stronger conclusion of the work

Vision Representation

Our key contribution is to purpose effective one-model tests that can be used for pre-trained vision models and are comparable to the two model tests in [1]. We do not rank the models based on the amount of memorized data.
Nonetheless, we observe that DINO memorizes the least amount of data compared to VICReg and Barlow Twins.
We plan to release a list of ImageNet examples with large dataset-level correlation scores; a list of examples for which the foreground can be inferred easily based on the background crop. This can help the vision community researcher to better estimate whether their models are memorizing certain examples or whether the models make correct predictions due to strong dataset-level correlations.

VLM

Our key contribution is a method to evaluate the déjà vu memorization with only one model, by utilizing an out-of-box pre-trained language model as a reference model. With this setup we show that the prior work of [2] can be extended to quantify the memorization of pre-trained models provided we have access to the training set. We show that one such pre-trained model (OpenCLIP model checkpoint pre-trained on YFCC15M) suffers from déjà vu memorization, similar to the models we train from scratch on a 40M subset of Shutterstock dataset.
We hope our evaluation methodology helps future researchers and ML practitioners to evaluate the memorization in different models and compare which model is safe for deployment. While the mitigation strategies are not the focus of the paper, existing mitigation results from [2] can be extrapolated to our setting as our setup mimics theirs except that we do not need to train a second (reference) model from scratch.

Weakness 1 - smaller classifier not memorizing training data

[1] shows that classifiers memorize significantly less than self-supervised models. Apart from that we ensured that Grounded SAM, a tool that was used to annotate background images for Naive Bayes, did not rely on ImageNet dataset during training.

We'd be happy to make those clarifications in the paper.

[1] Casey Meehan, Florian Bordes, Pascal Vincent, Kamalika Chaudhuri, and Chuan Guo. “Do ssl models have déjà vu? a case of unintended memorization in self-supervised learning.”, NeurIPS, 2023.

[2] Jayaraman, Bargav, Chuan Guo, and Kamalika Chaudhuri. "D'ej\a Vu Memorization in Vision-Language Models." arXiv preprint arXiv:2402.02103 (2024).

评论- Response to authors

2024-08-12

Thanks for the response to my points. My concerns have been met and I'll raise my score from 6 to 7.

评论- Response to reviewer

2024-08-14

Thank you for increasing the score. We will incorporate your suggestions.

审稿意见

评分: 6置信度: 32024-07-13

The paper introduces a method to measure memorization in representation learning models without the need for training multiple models. Previous déjà vu memorization estimation methods require two models to estimate dataset-level correlations and memorization, which is computationally expensive. The authors propose a simplified approach using a single model, with alternative methods to estimate dataset-level correlations. This approach is validated on various image representation learning and vision-language models, showing that the simplified method yields similar results to the traditional two-model approach.

优点

The authors present a simplified method for measuring memorization that avoids the need for training multiple models.
The method is validated across multiple datasets and models, including ImageNet-trained models and CLIP models, showing consistent results with the traditional approach.
The proposed method can be applied to both image-only and vision-language representation learning models.
The paper is well-written and organized, with clear explanations of the methods, experiments, and results.

缺点

The method builds on existing ideas of dataset-level correlation estimation and déjà vu memorization, which have been previously explored in the literature. This reduces the novelty of the contributions.
The concern that background annotation models like Grounded-SAM used in the paper may not have disjoint sets is important. If the sets were not disjoint, it could affect the results by introducing unintended correlations between the training and testing sets. The paper should address whether this was considered and controlled for in the experiments.
The results are reported for the entire dataset, but memorization metrics are more meaningful at the sample level. The lack of sample-level memorization scores limits the granularity and insights of the analysis. Including sample-level scores would provide a clearer picture of which specific samples are being memorized.

问题

Two-Model Dependency: Does this one-model method still rely on the use of two models, with one being a pretrained model used for estimating dataset-level correlations? How does the choice of such a pretrained model affect the results?
Sample-level Metrics: Providing some discussion as to why aggregate metrics were chosen and how they compare to sample-level results could offer more detailed insights. Can the method provide sample-level memorization scores, and how would this affect the interpretation of the results?
Are there specific cases where this approach may fail or provide? It was hard to decipher the limitations discussed at the end of sections 3 and 4.

局限性

Background Annotation Model Concerns: The potential overlap in data between background annotation models like Grounded-SAM and the dataset being tested should be addressed. If these models were not trained on disjoint sets, it could lead to misleading results regarding memorization.
Sample-level Insights: The lack of sample-level memorization scores is a significant limitation. Providing some a discussion as to why aggregate metrics were chosen and how they compare to sample-level results could offer more detailed insights into which specific samples are being memorized and help in understanding the model's behavior better.

作者回复

2024-08-07

Thank you for the thoughtful comments!

Novelty of the work

Original Déjà Vu memorization [2] work requires training two different models with the same SSL architecture on two disjoint splits of the training data and is computationally expensive. The novelty of our work lies in proposing creative ways of estimating déjà vu memorization for any vision and vision language (VLM) pre-trained model using a generic correlation detection classification model. Our approach is less expensive and can be used for multiple pre-trained SSL architectures and exhibits accuracy that is comparable with two model tests proposed in the original Déjà Vu paper [2].

Grounded-SAM's training set

This is a good point, and this was one of our major considerations in annotation model selection. As our annotation model, we use Recognize Anything (RAM), a component of Grounded-SAM. While our representation learning models are trained on ImageNet, RAM does not use ImageNet during training. Their paper also says that ImageNet has an unusual tag presence, and is hence inappropriate for their training task.

Sample-level memorization

In order to obtain dataset-level memorization, we aggregate sample-level memorization results.

Vision

As described in [2]'s Figure 1, for each background crop we predict the foreground label both with a SSL model (e.g. pre-trained VICReg) and correlation classifier. If SSL predicts the correct class and the correlation classifier fails to predict the correct class then we mark that example as memorized. We then aggregate all memorized examples to obtain a dataset level view.

VLM

For VLM learning, we do the evaluation of sample-level precision and recall gap between objects recovered from target and reference models and report the aggregate population level statistics in the main paper.

We include the sample-level statistics both for Vision and VLM in the1-pager PDF (see Figures 1 and 2) in the attached pdf. See global rebuttal for more details.

We would be happy to include these visualizations in the revision.

Two-Model Dependency

Vision

one of the models is the target model for which we want to measure the memorization. This can be any pre-trained SSL model.
the other model is the reference model that is used to detect dataset-level correlations. It is trained once on (image crop, foreground label) pairs. We still have two models but 1) we don’t have to train a target model - we can use a pre-trained model and 2) we train a correlation classification model only once and use it as a reference across all pre-trained SSL Vision models.

VLM

one of the models is the target model for which we want to measure the memorization. This could also be any pre-trained VLM.
the other model is a pre-trained language model (such as GTE) which we use to capture correlations. We only need to get the captions embeddings for kNN search once with this model. This can be reused for all the target models.

Pre-trained models

For vision, pre-trained models are the target models which we measure the memorization for. We didn't use a pre-trained reference model for vision since we didn't see any such OSS model appropriate for our task.

For VLM we use a pre-trained language model (GTE) [1] as a reference model in the VLM one-model tests. This model is fairly complex and has strong generalization in terms of language model (LM) capabilities as it has been trained on 788M text pairs crawled from various sources from the internet. While this can be replaced with any off-the-shelf language representation model, we would expect similar memorization gaps. In section 4.2.2, the pre-trained models correspond to the target VLMs that are pre-trained on existing datasets and are readily available as checkpoints for evaluation. Here we use the OpenCLIP's checkpoint pre-trained on YFCC15M dataset. The choice of the pre-trained target model will heavily impact the memorization results as the size and quality of the pre-training data will directly control the memorization capability of the model. For instance, the OpenAI CLIP model pre-trained on their 400M private data corpus will have different memorization than the YFCC15M pre-trained model. However, we would need access to the pre-training set to evaluate the memorization. Larger pre-training data typically leads to smaller population-level memorization.

Limitations of our work

Vision

The approach might be less successful if the training set for the correlation reference model is too small and the model is not able to learn dataset level correlations effectively. In our experiments we ensured a large subset (300k per class) for ImageNet. We recommend using large and representative sample size for training correlation classifier.
Although [2] shows that supervised models memorize significantly less compared to the SSL models, it is still possible that ResNet memorizes some of its training data. In contrast Naive Bayes doesn’t memorize its training data, however, it is a much simpler optimizer. Overall we have similar observations based on both types of correlation classifiers.

VLM

The pre-training data of the reference LM either has an overlap with the target VLM’s training data, or has a superior generalization capability than the VLM. In either of the cases, the reference model might match or outperform the VLM in which case the memorization gap will be underestimated. This could happen if, for instance, we use a superior LM that has better reasoning capability to infer beyond dataset level correlations.
On the other hand if the reference LM is simplistic or has a poor language understanding, then the memorization gap will be overestimated.

[1] Zehan Li, et.al. "Towards general text embeddings with multi-stage contrastive learning" arXiv, 2023

[2] Casey Meehan et. at. “Do ssl models have déjà vu? a case of unintended memorization in self-supervised learning”, NeurIPS, 2023.

评论- Response

2024-08-08

Thank you for the response, the authors have adequately answered my questions, I will raise my score.

评论- Response to reviewer

2024-08-09

We are happy to hear that our response was helpful. Thank you very much for raising the score.

作者回复

2024-08-06

On the lack of sample-level memorization scores

Reviewers Aswd and Uwm9 brought up an important question about the lack of sample-level memorization metrics and scores.

In our work we obtain dataset-level metrics by aggregating sample-level memorization results. Hence, it is straightforward to report the distribution of sample-level memorization scores. Figure 1 in the uploaded 1-page PDF visualizes the histogram of memorization confidence scores for the VICReg open source and Vision Language models.

Vision Representation model

Here we used ResNet as a correlation detection model. The same experiment can be repeated for Barlow Twins, DINO and other representation learning models. Instead of ResNet we can also use Naive Bayes correlation classifier.

The memorization confidence for the i-th example is computed based on the following formula:

MemConf (x_i)= Entropy(Correlation\ Classifier) - Entropy_{SSL}(KNN)

$Entropy_{SSL} (KNN)$ is computed according to the description in [1]’s Section 4, Quantifying Déjà Vu Memorization.

$Entropy (Correlation\ Classifier)$ is the entropy over the softmax values for the correlation classifier.

Figure 1 in 1-pager visualizes examples from both ends and the middle of the memorization confidence scores. We observe that the memorized examples with high memorization confidence scores are rarer and more likely to be memorized. The examples in the middle of the distribution have a label which is easy to be confused with another class. E.g. Black and gold garden spider with European garden spider. On the other hand the examples with negative memorization confidence have higher memorization and slightly lower correlation entropies. These seem to be examples for which the true label is not clear and visually obvious.

Vision Language Models

Figure 2 in the attached pdf shows the samples that have higher degree of memorization. The samples are sorted from high to low sample-level memorization such that the top-L samples have higher precision and recall gaps for recovering objects using target and reference models. For this test, we find the gap between the objects recovered from target and reference models for each training record, and estimate the precision and recall gaps. A positive gap indicates that the target model memorizes the training sample and the magnitude of the gap indicates the degree of memorization. Some of these worst case examples are shown in Figure 13 in the paper.

We will be happy to add these results to our paper.

[1] Casey Meehan, Florian Bordes, Pascal Vincent, Kamalika Chaudhuri, and Chuan Guo. “Do ssl models have déjà vu? a case of unintended memorization in self-supervised learning.”, NeurIPS, 2023.

最终决定Accept (poster)

2024-09-25

The paper introduces a more efficient method to measure de ja vu memorization in representation learning models without the need for training multiple models. The method is validated across multiple datasets and models, including ImageNet-trained models and CLIP models, showing consistent results with the traditional approach. The proposed method can be applied to both image-only and vision-language representation learning models

Please incorporate the changes asked by the reviewers and the clarifying discussions from the rebuttal period, such as lack of sample level memorization scores.