Membership Inference on Text-to-Image Diffusion Models via Conditional Likelihood Discrepancy
We propose a membership inference method on text-to-image diffusion models via condition likelihood discrepancy, outperforming previous works on diverse datasets, with superior resistance against early stopping and data augmentation.
摘要
评审与讨论
This paper firstly identifies a condition discrepancy in diffusion models that generation results conditioned on the training text prompts can significantly differ membership datasets and hold-out datasets of the model. Based on this observation, the paper proposes a novel method for membership inference that exploits the difference of losses conditioned on groundtruth text prompts and null (or partially null) text prompts as the feature. This method yields satisfying result.
优点
a) The paper investigates the membership inference of diffusion models, which is a meaningful topic with positive societal impacts.
b) The paper formally reveals an important phenomenon that diffusion models over-fit the condition, which is a finding useful for both membership inference and other scenarios (in case it is real).
缺点
a) The primary weakness of this paper is its fundamental setting, that we cannot assume that we have accesses to the groundtruth conditions (text prompts) c in real-world membership inference of diffusion models. The whole proposed method is based on this unrealistic assumption so that its meaning to the progress of membership inference would be quite limited. For example, we do not know the prompt used in the training of Stable Diffusion and SDXL, where membership inference is in need for copyright data detection. As baselines, SecMI [1] and PIA [2] discuss how different prompts c (null, groundtruth, blip) influence their performance and show that with BLIP caption their methods still hold to be effective. This robustness is in need for real-world usage of membership inference. However, this work is totally based on the groundtruth prompts so that it seems noLimt to have this necessary robustness.
b) In addition to a), the baseline comparison could be unfair, because one can easily adapt a similar conditional mechanism to SecMI [1] and PIA [2] to enhance their performance, which is not included in their default implementation. Notably, this is an extra information so that it seems certain to improve their performance.
c) Two evaluation setups, both over-training and real-world training, do not meet the real world scenario. For example, the author trains models on MS-COCO (2500 images) for 150,000 (over-training) and 50,000 (real-world training) steps and evaluates the proposed method on them. This means 60 steps/image and 20 steps/image. However, Stable Diffusion is trained only 1 step/image on LAION [3]. Hence, there is a gap between the evaluation setup of the paper and that of the real world. By contrast, baselines like SecMI [1] and PIA [2] are evaluated on Stable Diffusion & LAION, indicating their effectiveness on real-world membership inference.
d) The finding of condition discrepancy has been revealed by [1] to some extent, for it shows the clear difference when using blip prompts and groundtruth prompts. This makes the novelty of this finding doubtful.
References:
[1] Duan, Jinhao, et al. "Are diffusion models vulnerable to membership inference attacks?." International Conference on Machine Learning. PMLR, 2023.
[2] Kong, Fei, et al. "An efficient membership inference attack for the diffusion model by proximal initialization." arXiv preprint arXiv:2305.18355 (2023).
[3] https://huggingface.co/runwayml/stable-diffusion-v1-5
[4] Ma, Zhe, et al. "Could It Be Generated? Towards Practical Analysis of Memorization in Text-To-Image Diffusion Models." arXiv preprint arXiv:2405.05846 (2024).
According to the rebuttal, I have raised my score to 7.
问题
Can you introduce some prompt searching mechanisms and test your method based on this mechanisms? Since you have complete access to the diffusion model, it is possible to recover the text prompt without directly accessing to it. You can refer to [4].
局限性
See Weaknesses.
Thank you for your recognition of the societal impact of our work and your acknowledgment of our contribution in first identifying the condition likelihood discrepancy. In the following, we address your concerns point by point. (Refer to the submitted PDF for Tab. A, Tab. B, Tab. C and Fig. A).
Weakness (a):
The concern about the assumption that adversary accesses the ground truth text of images.
Answer (a):
(1) We want to clarify that our threat model is indeed practical in real-world settings. First, for typical T2I DM (such as Stable Diffusion), we can concurrently access their images and text prompts because the image-text data is publicly available [1]. Second, for other non-publicly trained models, we emphasize that the primary application of our method is for dataset owners to audit unauthorized usage (line 104), in which case both images and prompts are available for MI conductors. Third, the assumption of accessing the entire data distribution is also a common setting in representative MI works [2].
(2) However, we genuinely understand your concern about whether our methods are still effective when the corresponding text is unknown. Therefore, we conduct additional experiments to show the effectiveness of our methods without the groundtruth text (Tab. A and Tab. B):
We assume that the adversary first generates the corresponding text of images and then conducts MI using pseudo-text.
We use two models: BLIP [3] and GPT4o-mini [4] to generate text. We still use two setups in Sec. 4.1: Over-training and Real-world training. We observe that when using generated text for membership inference, both the baselines and our methods exhibit performance decline. However, our method still broadly outperforms baselines. We believe this is because the generated text still retains the key semantics of the image, which makes our methods still effective. We provide some generated examples by BLIP and GPT4o-mini for reference (Fig. A). Additionally, Tab. A and Tab. B show that using the text generated by GPT4o-mini yields better results than by BLIP. We think this is because GPT4o-mini generates better captions.
Weakness (b):
The concern about unfair comparison by introducing extra information for our methods.
Answer (b):
(1) We want to clarify that in all experiments we strictly maintain the same setting: all methods could access the image-text data (i.e. no extra information for our methods). These baselines such as PIA, SecMI and PFAMI (in Sec. 4.1) still require inputting both images and text to calculate their indicators, otherwise, their performance will degrade (refer to Tab. A, Tab. B and [5]).
(2) And the experiments in Tab. A and Tab. B also demonstrate that our methods outperform baselines even when groundtruth text is unavailable.
Weakness (c):
The concern about fine-tuning setups: over-training and real-world training settings. And the leak of evaluation in pretraining setting on Stable Diffusion & LAION.
Answer (c):
Open-sourced models bring increasing ease of copyright infringement through fine-tuning (line 109). So we emphasize that MI methods should apply to both fine-tuning and pretraining stages and we explore both in our paper.
(1) For fine-tuning stage, we first use the over-training setting, as it is commonly used by existing baselines such as SecMI, PFAMI. Our experiments under this setting indicate excessive overfitting of the setting so that MI methods cannot be differentiated. We then devise a real-world training setting according to official fine-tuning scripts [7]. Our method outperforms baselines in both settings.
(2) For pretraining stage, please note that we do conduct evaluation on Stable Diffusion & LAION (refer to Sec. 4.6 and Tab. 5 in our paper). We use LAION-v2 5+ and LAION-2B MultiTrans as member/hold-out sets to make sure the distribution consistency between training data and hold-out data and our method outperforms baselines.
And we conduct extra pretraining experiment for comparison (Tab. C). We use the SDv1-2 architecture to train a model from scratch on MS-COCO. The results also show our method's effectiveness.
Weakness (d):
“The finding of condition discrepancy has been revealed by [5] to some extent”.
Answer (d):
Our finding is different from SecMI[5] in two-fold:
(1) For a given data point , [5] can be formalized as:
And when using a different condition such as BLIP-text / Null-text, [5] can be formalized as:
The "condition discrepancy" you mentioned is effectiveness difference of using compared with using . [5] does not, like ours, compute discrepancy from different conditional likelihoods of a single data point to conduct MI, nor does it drive the analytical form of likelihood (Eq. (11)). Our method can be approximatively formalized as:
So the findings and intuition behind these two works are essentially different.
(2) To our knowledge, we are the first to define the phenomenon of condition overfitting in T2I diffusion models and analytically derive the indicator CLiD for MI.
Question (1):
Conduct our method by recovering text prompts without access to it.
Answer (Q1):
We add experiments. Please refer to “Answer(a)” above.
[2] Carlini, Nicholas, et al. "Membership inference attacks from first principles."
[3] https://github.com/salesforce/BLIP
[4] https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/
[5] Duan, Jinhao, et al. "Are diffusion models vulnerable to membership inference attacks?."
[6] https://huggingface.co/docs/diffusers/training/text2image##launch-the-script
Thank you for your rebuttal. I appreciate the effort and believe this work reaches state-of-the-art performance compared to existing baselines. However, I decide to hold my score for the following two reasons:
(1) The main setup of this work is misleading and creates hallucination of success in MIA of diffusion models in real-world scenarios. As mentioned in my review, both so-called over-training and real-world training are non-realistic setups to evaluate MIA of diffusion models, for they train models with a very high step/image ratio. This certainly leads to over-fitting and condition discrepancy. However, we can potentially avoid such discrepancy only by expanding the scale of training dataset and lowering the step/image ratio. This is widely adopted by both well-done fine-tuning (see Kohaku-XL-Eps: https://huggingface.co/KBlueLeaf/Kohaku-XL-Epsilon) and pre-training. In other word, this setup cannot validate the performance of MIA on diffusion models if the trainer really tries hard. This is not what I expect for effective MIA. Also, by accepting this work, follow-up research may continue to use this wrong setup and never carefully think about what the correct setup (like what this paper has done on Sec. 4.6) would be. Hence, I hold the score to notify this point.
(2) The real-world impact of the work is very limited due to its poor performance on diffusion models trained with low step/image ratios, indicating that it scarcely pushes the boundary forward. As shown by Sec. 4.6, CliD only yields a TPR of 3.44% with a FPR of 1% in the pre-training setup and this is the only result in diffusion models trained with low step/image ratios. So I can only suppose that this is its performance on average, which is far from a qualified method to be used in real-world models with low step/image ratios. I have noticed that all baselines perform worse. But this only seems to indicate that we cannot get effective MIAs by exploiting the loss value and we should explore some new directions. Hence, we should no more encourage exploring MIAs with the diffusion loss like this, because this is not the way we can find MIAs with real societal impacts.
Furthermore, I would like to note that the supplementary experiment on MS-COCO and SDv1-4 does not address my concerns because it uses the over-training setup and suffers from the flaw I mentioned above. I do not give a lower score because I believe even the effort to reach the state-of-the-art performance in toy setups should be valued, to some extent. However, we really need to try something real (and harder, certainly) in the task of diffusion MIA.
Thank you for reading our responses and providing further feedback.
For your opinion about “the real-world impact of the work is very limited due to its poor performance with one step/image ratios” and “this work is misleading and creates hallucination of success in MIA of DM in real-world scenarios”, we disagree with our highest respect. We justify the significance and practicability of our work with the following three points:
1. The results of pretraining in Tab. 5 come from stringent settings. Their “imperfect” results do not indicate the real-world impact of our work being limited.
For pretraining, we use the strictest evaluation setting (randomly selected samples and consistent distribution) so that it causes the low results in Tab. 5. This setting is even more strict than real copyright infringement scenarios. For example, in LAION dataset, many data points appear multiple times. Related works indicate [1,2,3] that most privacy leakage and copyright issues in T2I generation progress are due to duplicate training data, which means these data points are not with one step/image ratio during training.
To validate this, we use the LAION data related to privacy leaks and copyright issues used by existing works [2,4] as training set, use LAION MultiTrans as the hold-out set, and evaluate our method with two top-tier baseline works [5,6] under the pretraining MI setting. The model we used is Stable Diffusion v1-4. We report the results in the table below. As shown, all three methods show better results than Tab. 5 and our method still shows significant improvement.
| Method | ASR | AUC | TPR@1%FPR | Query |
|---|---|---|---|---|
| PIA | 65.00 | 71.67 | 16.82 | 2 |
| SecMI | 67.61 | 74.54 | 13.91 | 12 |
| CliD-th | 81.42 | 89.93 | 31.14 | 15 |
This experiment above indicates that the results under strict setting in Tab. 5 do not imply that the MIA method is practically useless. Additionally, while top-tier baselines achieve near random guess accuracy (around 50%) in this strict setting, our method achieves ASR and AUC of 61.32% and 67.64%, respectively, demonstrating a significant improvement that should not be ignored.
2. MI methods do have practical significance under finetuning settings with multi step/image ratio.
We want to clarify that evaluating MI methods for finetuning is of great significance. The release of open-source models [7,8,9] and the popularity of open-source platforms [15,16] make anyone easy to finetune and release models. For example, a malicious model trainer can easily collect an artist’s works, finetune a model to copy the style or the concepts created by the artist, publish it, and claim his ownership. This scenario has been broadly adopted in previous works [10, 11].
We notice that you cite Kohaku-XL-Eps [12] to claim that one step/image ratio should be used in fine-tuning. However, this model is fine-tuned using a dataset of 5.2 million samples [13, 14] with around 1000 types of style prompts. On average, each style corresponds to over 5000 samples. In most cases, an artist will not produce such a large volume of image-text data, and the time cost of implementing such fine-tuning is comparable to pre-training (over ten days [12]), making it uncommon compared with the typical projects on open-source platforms [15]. Additionally, we quote the original statement from the paper [15] on which the Kohaku-XL-Eps model is based: “To address dataset imbalance, we repeat each image a number of times within each epoch to ensure images from different classes are equally exposed during training.” It also indicates that multi step/image ratio is necessary when the data volume for certain classes is limited.
We select the most widely used fine-tuning scripts and widely adopted fine-tuned models from open-source platforms such as Huggingface [16] and CivitAI [15] below for further validation.
| Finetuned Models/Scripts | Description | Step/images (Epoch) |
|---|---|---|
| Finetuning on Pokémon dataset [17,18] | Official HuggingFace script for finetuning Stable Diffusion on Pokémon dataset for concept generation. | 20 |
| Finetuning on WikiArt dataset [19] | A finetuning project using WikiArt dataset that achieves great results. | 5 |
| Heart of Apple XL [20] | A highly effective artist style generation model, supporting approximately 700+ artist tags from Danbooru/Pixiv (and potentially more). | 10 |
| LyCORIS [21] | The paper on which the Kohaku-XL-Epsilon model training is based, supporting extensive style fine-tuning. | 10, 30 or 50 |
Our experiments in Fig. 2 cover all step/image ratios listed in the table above. Our method achieves approximately 78%, 85%, 96%, 99%, and 99% AUC values for ratios of 5, 10, 20, 30, and 50, respectively, demonstrating its effectiveness and surpassing the baselines.
3. Our work is not “misleading”. On the contrary, in our paper, we reveal the evaluation gap between the previous MIA works and the realistic scenario, and strive to adhere to realistic settings.
In our experiments, whether for finetuning or pretraining, we strive to adhere to realistic settings.
For finetuning, we emphasize that there is overfitting in exiting MIA settings [5, 22]. We emphasize that the training steps is a crucial parameter affecting results (Sec. 4.2) and highlight that evaluating MI methods should involve the Effectiveness Trajectory (Sec. 4.3) of training steps (i.e. different step/image ratios).
For pre-training, we emphasize that existing works [5,6] lack consistency in distribution between training set and hold-out set. This reveals the selection of hold-out sets will seriously affect the performance of MIA, and gives a more reasonable setting of pretrained DM MIA (as recognized by Reviewer gj8y).
In summary, in our paper, we first emphasize that MI for both the fine-tuning and pretraining stages is of practical significance. We then evaluate our method under practical experimental settings in both fine-tuning and pretraining stages, demonstrating its superiority. Furthermore, our first definition and validation of conditional overfitting will also contribute to future community research on data memorization in conditional diffusion models. Finally, we believe that, despite not achieving "very perfect" results in a portion of experiments, the contributions of a paper that indeed achieves SOTA results compared to previous top-tier baseline works [5, 6] under practical settings should not be overlooked.
We sincerely hope you can reconsider your rating score and we are open to answering any further questions you may have.
[1] Carlini, Nicolas, et al. "Extracting training data from diffusion models." 32nd USENIX Security Symposium (USENIX Security 23). 2023.
[2] Wen, Yuxin, et al. "Detecting, explaining, and mitigating memorization in diffusion models." The Twelfth International Conference on Learning Representations. 2024.
[3] Somepalli, Gowthami, et al. "Diffusion art or digital forgery? investigating data replication in diffusion models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
[4] Webster, Ryan. "A reproducible extraction of training images from diffusion models." arXiv preprint arXiv:2305.08694 (2023).
[5] Duan, Jinhao, et al. "Are diffusion models vulnerable to membership inference attacks?." International Conference on Machine Learning. PMLR, 2023.
[6] Kong, Fei, et al. "An Efficient Membership Inference Attack for the Diffusion Model by Proximal Initialization." The Twelfth International Conference on Learning Representations.
[7] Rombach, Robin, et al. "High-resolution image synthesis with latent diffusion models." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.
[8] https://huggingface.co/CompVis/stable-diffusion-v1-4
[9] https://huggingface.co/stabilityai/stable-diffusion-2-1
[10] Wang, Zhenting, et al. "Diagnosis: Detecting unauthorized data usages in text-to-image diffusion models." The Twelfth International Conference on Learning Representations. 2023.
[11] Shan, Shawn, et al. "Glaze: Protecting artists from style mimicry by {Text-to-Image} models." 32nd USENIX Security Symposium (USENIX Security 23). 2023.
[12] https://huggingface.co/KBlueLeaf/Kohaku-XL-Epsilon
[13] HakuBooru - text-image dataset maker for booru style image platform. https://github.com/KohakuBlueleaf/HakuBooru
[14] Danbooru2023: A Large-Scale Crowdsourced and Tagged Anime Illustration Dataset. https://huggingface.co/datasets/nyanko7/danbooru2023
[15] https://civitai.com/
[17] https://huggingface.co/docs/diffusers/training/text2image
[18] https://huggingface.co/datasets/diffusers/pokemon-gpt4-captions
[20] https://civitai.com/models/272440/heart-of-apple-xl-love
[21] Yeh, Shih-Ying, et al. "Navigating text-to-image customization: From lycoris fine-tuning to model evaluation." The Twelfth International Conference on Learning Representations. 2023.
[22] Fu, Wenjie, et al. "A Probabilistic Fluctuation based Membership Inference Attack for Generative Models." arXiv preprint arXiv:2308.12143 (2023).
Thank you for your further efforts. However, it seems that you misunderstand my points.
[1] (hallucination of success) Although you mention the effect of step/image ratio, your main experiments are still conducted based on setups with severe overfitting (60 steps/image for over-training and 20 steps/image for real-world training). The success of the proposed method (and potentially, that of future follow-up works) does not mean anything for well-trained diffusion models with low step/image ratios. I am worried that this may provide wrong guidance for future following works.
[2] (moderate real-world impacts) This work assumes that the data owner has full sets of prompts. This is a fundamentally over assumption because few copyright data, the main concern of unauthorized training, is annotated by text. Even it is, the trainer tends to re-do the text annotation. Hence, the (only) real scenario of MIA should be with only the image and the model trained with low step/image ratios. This should not be recognized as the 'strictest' because it can be easily done if the trainer tries. So how does your method perform with only partial (I agree that you can assume you have partial prompts because users can use open annotators like BLIP) prompts and a low step/image ratio model? I do not see any results to address this problem. But this should be the real scenario that people being infringed are faced to.
[Response-7]:
This should not be recognized as the 'strictest' because it can be easily done if the trainer tries.
We have previously demonstrated that for low step/image ratios typically do not relevant to finetuning (refer to Response_1/2, “[Response-1]” and “[Response-2]” above).
And for pretraining datasets, even the widely used LAION dataset contains many duplicate data points [5,6,7], which may lead to multi step/image ratio.
Therefore, achieving a strict one step/image ratio is not easy in copyright risk scenarios.
[Response-8]:
So how does your method perform with only partial (I agree that you can assume you have partial prompts because users can use open annotators like BLIP) prompts and a low step/image ratio model? I do not see any results to address this problem. But this should be the real scenario that people being infringed are faced to.
We reconduct the experiment of Tab. 5 in Sec. 4.6, using Blip-generated image captions instead of groundtruth text to align with your suggested setting. We still report the results of our method with the top-tier baselines in the Table below.
| Method | ASR | AUC | TPR@1%FPR | Query |
|---|---|---|---|---|
| PIA | 52.61 | 52.26 | 1.20 | 2 |
| SecMI | 52.41 | 52.50 | 1.50 | 12 |
| PFAMI | 53.01 | 52.25 | 0.40 | 20 |
| CliD-th | 58.91 | 61.25 | 3.21 | 15 |
As shown, when non-ground truth text is used, the performance of the baselines also declines significantly, approximating random guessing. This also demonstrates that we do not "unfairly introduce additional information" in our evaluations in the paper. In contrast, our method still shows significant improvement.
We acknowledge that our method (even including all existing MI works) does not achieve "perfect" results in Tab. 5 with one step/image ratio in pretraining setting (though we have achieved significant improvement compared to previous baselines).
However, we would like to emphasize that our work still holds significance and practicality in the following aspects:
- We have demonstrated that our method is effective in finetuning scenarios. (refer to “[Response-1]” above).
- We have also demonstrated that finetuning setting is reasonable and potentially more common, as it need a lower-cost infringement compared to pre-training (refer to Response_1/2, “[Response-1]”, “[Response-2]” and "[Response-6]" above).
- Additionally, we emphasize that the "imperfect" results of one step/image ratio do not limit the real-world impact of our work (refer to the first Table in Response_1/2).
Hence, in these scenarios, we believe our work is significant and practical, achieving notable improvements compared to the baselines.
Thank you again for your feedback. We will greatly appreciate if you could recognize the significance and practicality of our work, and we are also open to any further discussion.
[1] https://huggingface.co/docs/diffusers/training/text2image
[2] https://civitai.com/models/272440/heart-of-apple-xl-love
[4] Yeh, Shih-Ying, et al. "Navigating text-to-image customization: From lycoris fine-tuning to model evaluation." The Twelfth International Conference on Learning Representations. 2023.
[5] Carlini, Nicolas, et al. "Extracting training data from diffusion models." 32nd USENIX Security Symposium (USENIX Security 23). 2023.
[6] Somepalli, Gowthami, et al. "Diffusion art or digital forgery? investigating data replication in diffusion models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
[7] Webster, Ryan. "A reproducible extraction of training images from diffusion models." arXiv preprint arXiv:2305.08694 (2023).
[8] Carlini, Nicolas, et al. "Extracting training data from diffusion models." 32nd USENIX Security Symposium (USENIX Security 23). 2023.
[9] Carlini, Nicholas, et al. "Membership inference attacks from first principles."
Thanks. I will raise my score if you clearly express the idea that you recommend all future works only follow the setup of tab 5 in your next draft. Also, you should state that the success in your over training and real world training setups may cause hallucinations of MIA’s success and should not be the setup for the future work. All these claims should be placed in both introduction and experiments. Please show it in the rebuttal then I will raise my score. I appreciate your efforts.
We greatly appreciate your further feedback and your willingness to reconsider the rating score.
We acknowledge that the evaluation setting in Tab. 5 holds greater significance (i.e., MI on the pretraining setting with consistent data distribution between the training set and hold-out set). We also hope our paper can guide future works to focus on more realistic settings, moving towards harder and more practical MI tasks.
Based on your suggestions, we will make the following revisions in the updated version:
1. Revisions to Introduction
In Introduction, we will first emphasize in the third paragraph (line 32-line 47) that the existing evaluation setting [11, 13, 23] of MI on diffusion models does not align with the real-world scenario, such as (1) overfitting caused by excessively high step/image ratio and (2) inconsistent distribution between the training set and hold-out set.
Then, for the fifth paragraph, to reveal the hallucination success of MI by overfitting and recommend that future work focus on more challenging pretraining MI setting, we will revise Line61-Line67 as follows:
“... First, our methods consistently outperform existing baselines across various data distributions and training scenarios, including finetuning settings and the pretraining setting. Second, our experiments on finetuning settings with different training steps (Sec. 4.2) reveal that excessively high step/image ratios cause overfitting, leading to hallucination success; and we develop a more realistic pretraining setting following [12], where our experiments reveal the insufficient effect of existing membership inference works [11, 12, 13, 32] and we hope future works focus on this more challenging and realistic setting. Third, our comparison experiment with varying training steps (Sec. 4.3) indicates that the effectiveness of MI grows with higher step/image ratios and MI should be evaluated under reasonable settings for realistic results. Next, ablation studies ...”
2. Revisions to Experiments
In Experiment, we will first move Sec. 4.6 "Performance on Pretrained Models" (Tab. 5) earlier and merge it with Sec. 4.2 "Main Results". We will emphasize the importance of developing MI methods for pretraining settings as Tab. 5 by adding the following statement:
“Experimental comparison between finetuning and pretraining settings indicates that, while our methods (including existing ones) perform effectively in finetuning, they show insufficient performance in pretraining. Given the many available open-source pretrained models, we emphasize that developing effective MI methods for pretraining is a more challenging and significant task, which we leave for future work.”
Second, in Sec. 4.1, we will add the following statement to emphasize the significance of our pretraining setting:
“Although previous works [11, 13] conduct experiments in the pretraining setting using Stable Diffusion [41] and LAION [45], achieving seemingly effective results, we emphasize that they do not ensure the distribution consistency between the training set and hold-out set.” We used LAION-Aestheticsv2 5+ [45] and LAION-2B MultiTranslated [45] as member/hold-out sets to develop a realistic setting following [12], and evaluate the performance of MI methods on this."
Third, we will include the additional experiments in rebuttal to the Experiment section, such as using Blip-generated image captions for MI in the pretraining setting (refer to Response_v2_2/2), to further guide future work to focus on this setup.
3. Revisions to Limitation Section
We will move the Limitation section to the main text and will revise the Limitation section (line 595-597) as follows:
“Despite the significant improvements in membership inference of text-to-image diffusion models on various data distributions and data sizes, this work still has limitations. First, due to the limited availability of open-sourced pretrained weights of text-to-image diffusion models, evaluations under the pretraining setting are not sufficiently comprehensive. Considering finetuning setting involves a multi step/image ratio, we acknowledge that MI for the pretraining setting is more challenging and realistic. We leave the investigation of more effective MI methods for pretrained models to future work. Second...”
Thank you again for your suggestions on revising our paper. We are open to any further comments on the changes to help our paper have a more valuable impact on the community.
Reference numbers are consistent with those in the paper.
Thanks. I will raise the score.
Thank you for recognizing our work and raising the rating score. We appreciate your critical feedback and revision suggestions. We believe your suggestions will help our paper encourage future work to focus on realistic and practical MI settings, enhancing its beneficial impact on the community. Thank you again!
Thank you for reading our responses and your further feedback. We provide our further responses regarding your new comments sentence by sentence:
[Response-1]:
[1] (hallucination of success) Although you mention the effect of step/image ratio, your main experiments are still conducted based on setups with severe overfitting (60 steps/image for over-training and 20 steps/image for real-world training).
First, as we mentioned (refer to the second table in Response_1/2), real-world finetuning projects (many of which are official scripts [1] or widely known models [2]) include step/image ratios of 5, 10, 20, 30, and even 50. From this perspective, we want to clarify that a 20 step/image ratio for finetuning cannot be considered "severe overfitting." Additionally, we highlight (refer to Response_1/2) that our method achieves approximately 78%, 85%, 96%, 99%, and 99% AUC values for ratios of 5, 10, 20, 30, and 50, respectively, demonstrating its effectiveness under all these ratios and surpassing the baselines.
Second, for the 60 step/image ratio of the “over-training setting”, please note that we include this experiment to emphasize that "such unrealistic over-training scenarios fail to reflect the effectiveness..." (line 258). This serves as reminder for future MI work to avoid evaluations in such overfitting settings.
[Response-2]:
The success of the proposed method (and potentially, that of future follow-up works) does not mean anything for well-trained diffusion models with low step/image ratios.
First, we have provided widely recognized papers/examples [1, 2, 3, 4] (refer to Response_1/2) to demonstrate that most finetuning projects typically do not involve "low step/image ratios" (if "low" means less than 5). On a normal-sized dataset, low step/image ratios result in inadequate performance.
Second, even the most "well-trained" open-source model——Stable Diffusion, still contains repeated data [5,6,7] causing multi step/image ratio. These data are most likely to trigger copyright risks during the generation progress [5,6,7]. We also provide experiment (refer to the first Table in Response_1/2) to show that our method is effective for this kind of training data.
[Response-3]:
I am worried that this may provide wrong guidance for future following works.
In fact, compared to existing baselines, we have taken a step forward in guiding future works to use realistic settings (mentioned in Response_2/2).
We emphasize that the step/image ratio and the data distribution should reflect real-world scenarios of both finetuning (line 228) and pretraining (line 334) .
More than that, in our paper, we ensure that data augmentation usage (line 232), threshold selection (line 248), and the usage of varying scales datasets (line 220) all align as closely as possible with real-world scenarios. Compared with previous works, we believe this will guide future works to a more realistic scenario (as recognized by Reviewer gj8y).
[Response-4]:
[2] (moderate real-world impacts) This work assumes that the data owner has full sets of prompts.
This assumption come from some representative MI works [8,9] that entire data distribution is accessible. We adopt it and treat the image-text pair as a single data point.
Additionally, please note that we provide additional experiments showing: (1) our method achieves significant results even without groundtruth text (Tab. A and Tab. B in the submitted PDF), and (2) our method also achieves significant results even when the text is rephased by the trainer (Tab. 4 in our paper).
[Response-5]:
This is a fundamentally over assumption because few copyright data, the main concern of unauthorized training, is annotated by text. Even it is, the trainer tends to re-do the text annotation.
Textual data is important for training T2I models and relevant to copyright, and some of it even hold commercial value [10]. And in platforms [10,11], the images are usually published or sold with the corresponding text.
Furthermore, even without the groundtruth text usage, the text used for training/finetuning should match the key semantics of the groundtruth text. Otherwise, model performance declines (refer to Tab. 4 in our paper). If the trainer maintains key semantics while redoing the text annotation, our method remains effective (refer to Tab. 4 in our paper).
[Response-6]:
Hence, the (only) real scenario of MIA should be with only the image and the model trained with low step/image ratios.
With our highest respect, we disagree that “the (only) real scenario should be ... with low step/image ratios”. We have demonstrated that the finetuning setting with multi step/image ratio holds practical significance (refer to Response_1/2, “[Response-1]” and “[Response-2]” above). Considering its lower cost and requiring less data, finetuning setting is even holds more significance than pretraining setting (line 109).
This paper proposes a new MIA metric tailored for text-to-image diffusion models. More precisely, they assume that conditional overfitting is more severe that unconditional one. Based on this assumption, a new MIA metric (CLiD) is proposed. The CLiD metric shows superior performance in various text-to-image diffusion models.
优点
-
The problem (MIA) is a very important problem which requires a lot of thought, especially with the increased usage of diffusion models which are essentially trained on the entire internet. Current MIAs tailored for Diffusion Models (DM) mainly focus on uncondition DM. This work bridges the gap between conditional and unconditional DM MIA.
-
The idea is straightforward and further validation using gradually truncating operation is quite insightful.
-
The experiment is very comprehensive. The exps on overfitting and real world scenarios reveals that current MIAs rely on severe overfitting and are not as strong as they claim (while CliD is more sensitive to overfitting). The idea to choose the threshold is more reasonable than current method (globally chosen). The exps about distribution consistency reveals the selection of nonmembers will seriously affect the performance of MIA, which gives a more reasonable setting of pretrained DM MIA. These insights above are interesting and very helpful for the MIA community.
缺点
-
There are some typos. For example, FPR@1%FPR TPR@1%FPR in Tab.3 and Tab.4.
-
One related work should be included and further discussed:
Wen, Yuxin, et al. "Detecting, explaining, and mitigating memorization in diffusion models." The Twelfth International Conference on Learning Representations. 2024.
问题
I do not have any questions.
局限性
The authors have already addressed the limitaions of this paper.
We sincerely thank you for your time and efforts in reviewing our paper. Your recognition of the significance of our work and acknowledgment of our experiments’ comprehensiveness is deeply appreciated.
Weakness (1):
Typos in the paper.
Answer (1):
Thank you for carefully reviewing our paper and pointing out the typos. We will review it and correct all typos in the updated version.
Weakness (2):
One related work [1] missing.
Answer (2):
Thank you for pointing out this related work. This paper presents an interesting and valuable direction: detecting and mitigating memorization in diffusion models. In real life, it can be used to detect if a model remembers specific prompts, which has practical significance for detecting copyright infringement. The main differences between the work and ours are:
-
This paper [1] focuses on detecting and mitigating the diffusion model's memorization of specific tokens (i.e. prompt memorization detection). In contrast, our work primarily aims to determine whether a given image-text pair exists in the model's training dataset (i.e., membership inference).
-
This paper [1] designs a simple and effective detection method based on the intuition that tokens involved in memorization typically lead to a larger magnitude of prediction. It also proposes two mitigation methods: inference-time mitigation method and training-time mitigation method. In contrast, our work is based on the broadly validated phenomenon of conditional overfitting, from which we analytically derive the MI indicators CLiD and propose two MI methods: CLiD_th and CLiD_vec.
We will include it in the related work section and add a further discussion in the updated version.
We thank you again for your careful review and valuable feedback. We appreciate your positive comments on our work and we are more than happy to answer any further questions you may have.
[1] Wen, Yuxin, et al. "Detecting, explaining, and mitigating memorization in diffusion models." The Twelfth International Conference on Learning Representations. 2024.
Thanks for your rebuttal. I keep my score as weak accept.
Thank you for your support of our work. We will polish our paper further and incorporate the related work you recommended in the final revision. Thank you again.
In this paper, the authors propose a novel membership inference attack for text-to-image diffusion models. By examining the discrepancy between the text-conditional predictions and unconditional predictions, the proposed method outperforms the SOTA method by a significant margin. In the end, the authors also show the proposed attack is robust to various defenses.
优点
- The paper is well-written.
- The method is simple and the motivation behind it is straightforward.
- The results look very promising. The proposed method outperforms all other methods by a big margin.
- The authors compare the method to various recent methods.
- The authors include various adaptive defenses, and the method seems to be very robust.
缺点
- The paper claims the setting is grey-box, but I feel like it's just a white-box setting since the attacker needs the model weights to perform the loss calculation, and there's no such API in the real world. I feel like the authors should emphasize it, even though previous works claim such a setting is "grey-box."
问题
- What happens if an image in the training data is associated with multiple different prompts? For instance, if (image, prompt A) appears 100 times in the training set, while (image, prompt B) appears only 10 times, will the method still be effective for (image, prompt B)?
局限性
The authors do include the limitations in the appendix, and I appreciate it.
Thank you for your efforts in reviewing our paper and your valuable feedback. We are encouraged by your appreciation on our clear motivation, extensive experiments and promising results, as well as the comprehensive experiment of adaptive defenses and good writing. Below we address the detailed comments and hope that you may find our response satisfactory. (Refer to the submitted PDF for Tab. C, Tab. D and Fig. D)
Weakness (1):
Grey-box setting is almost equivalent to white-box in real-world scenarios.
Answer (1):
Thank you for your suggestion regarding our threat model. We acknowledge that in most cases in real-world scenarios, the "grey-box" setting is almost the same as the "white-box" setting. Therefore, we discuss them together in Sec. 5 (line 346). We use the term “grey-box” to maintain consistency with previous works [1,2].
We will emphasize this issue in the updated version.
Question (1):
The effectiveness of image-text datasets where a single image corresponds to multiple different texts (including imbalanced text).
Answer (Q1):
Thank you for raising this interesting question. We conduct additional evaluation for both pretraining and fine-tuning training settings to answer this question:
(1) For pretraining setting, since the original Stable Diffusion is trained by LAION which contains one-to-one image-text pairs, we use the Stable Diffusion v1-2 [3] architecture to train a simple text-to-image model from scratch on the MS-COCO 2017 train dataset.
In the COCO 2017 dataset, each image corresponds to approximately five text descriptions, so we randomly use these descriptions during the training phase. We trained 25,000 steps with a large batch size of 64x4x4 using 4 H100 GPUs in about four days. To prevent overfitting, we used data augmentation during the training stage. We report the evaluation metrics using this model in Tab. C and the results show our methods still outperform the baselines across all three metrics.
(2) For fine-tuning setting, we create an image-text dataset with multiple imbalanced texts using data repetition from the MS-COCO dataset. Each image corresponds to two text descriptions, with a ratio of 9:1 in the dataset. All other settings follow the real-world training setting in Sec. 4.1 (50,000 training steps). Thus, (image, promptA) and (image, promptB) appear 18 times and 2 times in the training stage, respectively.
We report the results in Tab. D, which shows that performing MI with less frequent texts is slightly less effective than using more frequent texts. However, our method still demonstrates a significant improvement compared with baselines. We believe the key intuition here is that although one image corresponds to two different texts, the key semantic information in both texts is always consistent. Therefore, performing MI with either text yields a certain level of effectiveness.
Thank you again for reviewing our paper and your insightful questions. And we appreciate your positive comments on our work. We hope our responses above have addressed all your concerns. Please let us know if any follow-up questions you may have.
[1] Duan, Jinhao, et al. "Are diffusion models vulnerable to membership inference attacks?." International Conference on Machine Learning. PMLR, 2023.
[2] Kong, Fei, et al. "An Efficient Membership Inference Attack for the Diffusion Model by Proximal Initialization." The Twelfth International Conference on Learning Representations.
Thank you for providing the additional results. I believe this paper is very solid rn. Therefore, I keep my score positive.
We deeply appreciate that you find our work very solid. We will include the additional experiments in the updated revision and emphasize the threat model as your suggestion. Thank you again!
The paper addresses potential unauthorized data usage and the privacy concerns in text-to-image diffusion models. The authors introduce a novel membership inference method, Conditional Likelihood Discrepancy (CLiD), which leverages the identified phenomenon of conditional overfitting in these models. They propose two practical membership inference methods, CLiD_th and CLiD_vec, which indicate membership by measuring the KL divergence between conditional distribution of image-text pairs and the distribution of images. The results shows superior performance of their methods compared to existing baselines, particularly in real-world training scenarios with common data augmentation techniques. Also, their method shows robustness to overfitting mitigation strategies like early stopping and adaptive defenses. Their experiments across multiple datasets validate the effectiveness and robustness of CLiD in detecting training data in text-to-image diffusion models.
优点
-
Conditional Likelihood Discrepancy is a novel method for membership inference on text-to-image diffusion models is a significant contribution which significantly outperforms existing methods.
-
For the empirical validation of assumption 3.1, authors provide thorough results using various metrics such as FID, Wasserstein Distance, Kernel MMD, and 1-NN.
-
Experimental results show that CLiD_th and CLiD_vec methods significantly outperform existing baselines in terms of ASR, AUC, and TPR@1%FPR. This includes various training scenarios with data augmentation, highlighting the robustness and effectiveness of their approach.
-
By focusing on a foundation text-to-image diffusion model like SD which is widely used in practice, the paper addresses a timely and relevant problem of data privacy and unauthorized usage in a practical context.
缺点
-
Since the theoretical results depend on assumption 3.1, I believe that validating this assumption on other dataset domains is important. Would you observe the same phenomenon if you test on domain specific datasets, such as Faces (CelebA, FFHQ, ...).
-
Authors do not present empirical results on the benchmark that previous papers have test on, such as CelebA, Tiny ImageNet, and CIFAR datasets.
问题
-
What is the main difference between CLiD_th and CLiD_vec methods in terms of accuracy and effectiveness? Is one preferred over the other in general, or in specific settings? Why some tables and figures do not show CLiD_vec results?
-
What is the benefit in evaluating the overfitting scenario? Ideally, shouldn't the evaluation setting be as close as possible to the real world scenarios?
-
How do these methods compare to existing membership inference methods in terms of computational complexity and runtime?
局限性
Yes.
Thank you for appreciating the novelty and the effectiveness of our work as well as providing valuable feedback. Below we address the detailed comments and hope that you may find our response satisfactory. (Refer to the submitted PDF for Tab. E and Fig. B)
Weakness (1):
The concern about assumption validation in domain-specific datasets.
Answer (1):
Thank you for your suggestion. This assumption arises from the overfitting of text-to-image diffusion model to the conditional distribution, so this phenomenon is widely present in image-text datasets, including domain-specific datasets.
First, the Pokemon dataset we used is an open-source dataset containing images of Pokemon characters [1]. Due to its distinct style compared to MS-COCO and Flickr, Pokemon dataset can be considered as a domain-specific dataset. Our experimental results (Tab. 1 and Tab. 2) show that our method is effective on this domain-specific dataset.
Second, we further conduct extra experiments to show that the assumption holds in other domain-specific datasets such as Face (Fig. B). The MMCelebA dataset [2] is a multimodal facial dataset that includes faces and corresponding text descriptions. We repeat the experiments in Sec. 3.2 and Appendix 1. The results show that the distribution distance between the member set and the hold-out set is consistently higher than those with truncated conditions, demonstrating the validation of our assumption.
Weakness (2):
The leak of evaluation on other datasets, such as CIFAR et. al.
Answer (2):
(1) Thank you for your suggestion. Current MI methods mainly focus on unconditioned DMs, which is not suitable for real-world applications. Hence, in this paper, we design a MI method specifically for text-to-image models (conditional DMs). So our method primarily targets image-text data, which is more aligned with real-world scenarios where copyright issues are prevalent.
(2) However, in principle, our method is applicable to class-conditional datasets such as CIFAR-10 as well. We conduct extra experiments using CIFAR-10 to further validate this (Tab. E).
We revised Eq. (14) and Eq. (16) to align with class-conditional DM as follows:
- First, we estimate CLiD between the groundtruth label of each data point and every other label:
where refers to the groundtruth label of a data point, and refers to another false label.
- Then we use threshold-based attack method for final classification:
We use 50,000 CIFAR-10 training data to train a class-conditional DM. We use augment method of RandomFlip to prevent overfitting. In Tab. E, we can observe that with only simple data augmentation, the baseline's performance on the CIFAR-10 dataset shows a decline compared to what was claimed in their paper. In contrast, our method shows a clear improvement, indicating its effectiveness on CIFAR-10.
Question (1):
What is the difference between CLIiD_th and CLiD_vec? Why some tables and figures do not show CLiD_vec results?
Answer (Q1):
Compared to CLID_th (Eq. (16)), which uses a threshold, CLID_vec (Eq. (18)) employs a simple classifier (we use XGBoost in the paper) to distinguish between the member set and the hold-out set. Since the classifier's objective is accuracy (i.e., ASR), this method typically achieves higher ASR and AUC, but a lower TPR@1% FPR compared to CLID_th (Tab. 1 and Tab. 2).
Nevertheless, both methods essentially use CLiD (Eq. (11)) as the MI indicator. Therefore, in further analysis experiments, we only selected one method to save space and computational cost.
In the updated version, we will include the CLiD_vec results in these tables and figures.
Question (2):
What is the benefit of evaluating the overfitting scenario?
Answer (Q2):
Yes, we also emphasize that the evaluation setting should align with real-world scenarios (line 258). We use the over-training setting because it is commonly adopted by previous works such as SecMI, PFAMI. Our experiments under this setting (Tab. 1) indicate excessive overfitting so that this setting prevents MI methods from being properly evaluated. Therefore, we then develop a real-world training setting and show that our method outperforms baselines in both settings.
Question (3):
The evaluation of computational complexity and runtime compared with baselines.
Answer (Q3):
Since diffusion model inference is the main computational process in conducting MI, the computational complexity and runtime are proportional to the query count. We provide the query counts for our methods and baselines (Tab. 1, Tab. 2) and a detailed analysis in Appendix E. Our methods outperform the baselines when their query counts are about the same (such as SecMI and PAFMI). Additionally, in Fig. 3, when we set “M=0, N=1” (Q=4), our method achieves a higher AUC of 0.923 than 0.654, the AUC value of SecMI (Q=12). It shows that even with less computational time, our method still significantly surpasses the baseline.
We thank you again for your valuable feedback. We hope our responses above have addressed all your concerns and questions. We are happy to answer any follow-up questions you may have.
[1] https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions
Thanks for the clarifications. Great work. I keep my score as is.
Thank you for your support of our work. We will include additional experiments in the final manuscript, as stated in the rebuttal. Thank you again!
We thank all reviewers for their constructive feedback. We are encouraged by your appreciation on our clear motivation and positive societal impact (Reviewers iMS8, gj8y, and k8sd), innovative and pioneering method (Reviewers meQ8, iMS8, and gj8y), comprehensive and practical experiments (Reviewers meQ8, iMS8, and gj8y).
We have responded to each reviewer individually.
We have also uploaded a rebuttal PDF that includes our additional experiments as follows:
-
Table A: We conduct the experiments of using generated text for membership inference (MI) when groundtruth text is not available (Over-training setting).
-
Table B: We conduct the experiments of using generated text for MI when groundtruth text is not available (Real-world training setting).
-
Table C: We train a text-to-image diffusion model from scratch using SDv1-2 structure on MS-COCO 2017 dataset and report the results of different MI methods.
-
Table D: We finetune the model using the imbalanced image-text dataset (i.e. each image corresponds to multiple texts with varying proportions) and evaluate the effectiveness of different MI methods.
-
Table E: We train class-conditional diffusion model (class-conditional DM) on the dataset of CIFAR10 and present evaluation results of MI methods.
-
Figure A: Examples of groundtruth and generated text by BLIP and GPT4o-mini of MS-COCO (refer to Table A and Table B).
-
Figure B: The further validation of Assumption 3.1 on domain-specific dataset (MMCelebA).
We hope all of your concerns have been well-addressed in our responses. We are more than willing to address any follow-up questions you may have.
This paper proposes an algorithm to identify unauthorized data usage for text-to-image diffusion models’ training, through the membership inference method. Observing that models tend to overfit the text-conditional image distributions rather than marginal distributions of images, an analytical indicator is proposed to measure the KL divergence of two distributions, thus identifying the membership of images. Reviewers commend the significance of the observation and the proposed method and agree that the proposed algorithm is simple, straightforward, and promising. The rebuttal is detailed and carefully clarifies reviewers’ concerns including implications of experiment results, impacts, and applicability of algorithms when the ground truth text prompt is unknown. Given the consensus among all the reviewers, the AC recommends the acceptance of this paper.