MemBench: Memorized Image Trigger Prompt Dataset for Diffusion Models
We introduce a benchmark containing prompts that induce memorized images in text-to-image diffusion models. Using this benchmark, we evaluate various memorization mitigation methods.
摘要
评审与讨论
MemBench is a novel benchmark designed to evaluate image memorization issues in text to image diffusion models. This work includes a benchmark dataset that evaluates image memorization in text to image generation models (authors claim to be first). the benchmark itself contains comprehensive metrics on memorization.
In addition to the benchmark itself, authors provided analysis of the effectiveness of various mitigation approaches.
优点
The proposed MCMC-based search is a novel idea for quickly searching for problematic prompts.
The design of MemBench is well thought out and is comprehensive in its metrics for wide adoption.
缺点
The work uses Stable Diffusion 2 to show empirical evidence of memorization and how well their benchmark works. However, their proposal seems unable to apply to models like Stable Diffusion 3 (or other newer models than SD2) where training data is unknown. In Appendix D, the authors did not show may evidence of memorization of SD3. This makes me wonder how widely applicable this work is
问题
In Appendix D, the authors mention they cannot do reverse image search on SD3 generated images since its training set is unknown. Have you tried to reverse search on the LAION 5B dataset despite SD3 had no confirmation of using LAION 5B. Given the size and popularity of LAION, I would assume it (partially) overlaps with the training set of many modern text to image diffusion models.
We sincerely appreciate the reviewer's valuable feedback. However, there may have been misunderstanding about our related work and motivation. We would be truly grateful if these points could be revisited during the rebuttal period.
Before addressing the questions, we summarize the contributions highlighted in the global response:
- MCMC trigger prompt searching algorithm: Only approach currently capable of searching trigger prompts that replicate train image where training data (LAION) is no longer accessible. Although without data, it requires significantly fewer computational resources than prior methods [C1, 2] that require training data.
- MemBench: We are the first to 1) provide general prompt scenario, 2) provide absolute standard for performance evaluation, and 3) evaluate image quality.
For details, we kindly invite the reviewer to the global response.
[C1] Extracting Training Data from Diffusion Models, USENIX’23.
[C2] A Reproducible Extraction of Training Images from Diffusion Models, Arxiv preprint.
W1. The method seems inapplicable to models like Stable Diffusion 3 with unknown training data. (The work uses Stable Diffusion 2 to show empirical evidence)
We’d like to remind Figure 6, where we provide Stable Diffusion 3 memorization candidates found by our algorithm. The repetitively generated images shown in Figure 6 meet the criteria for memorization detection methods proposed by Carlini et al. [C1] and result in high values for the detection method established by Wen et al. [C3]. All this evidence suggests that the images found by our algorithm are Stable Diffusion 3 memorized data. However, as stated in Appendix E in the revised paper, we did not create a dataset for newer models like Stable Diffusion 3, where training data is unknown, because we do not know whether the images our algorithm found is actually in the training dataset or not. Building a benchmark without knowing whether training data exists on the web is unreasonable.
Also, we would like to correct that we already demonstrated the generalizability of our algorithm by applying to various models, including Stable Diffusion 1, Stable Diffusion 2, Realistic Vision, and DeepFloydIF in Table 1 of the main paper as well as Stable Diffusion 3 in Figure 6. These prompts found from various models show that our method is generalizable and efficient. We additionally provided examples of Stable Diffusion 1 memorized images and prompts in Appendix G (Appendix H in the revised paper). The reason we focus on Stable Diffusion 2 in Section 6 is to find why image memorization occurs even without image repetition in the training data—a factor widely regarded as a major contributor to memorization. Stable Diffusion 2 is not the only model that shows empirical evidence of our method.
[C3] Detecting, Explaining, and Mitigating Memorization in Diffusion Models, ICLR’24.
W2. Have you attempted to reverse search in LAION-5B for Stable Diffusion 3?
As stated in lines 129–131 of the main paper, LAION-5B has been made inaccessible, making it impossible to query images from it. Our core contribution lies precisely in developing an algorithm to identify trigger prompts despite LAION-5B being inaccessible.
We hope that our additional experiments and responses have effectively addressed the reviewer’s concerns. If there are any further questions or clarifications needed, we would be happy to address them.
thank you for the clarification. appreciate the response
If there are any additional discussion points or questions, we are happy to discuss. Thank you.
This paper introduces MemBench, a benchmark for evaluating memorization mitigation methods in diffusion models. The core technical contribution is an MCMC-based sampling approach that efficiently finds prompts triggering generation of training data, building upon the memorization detection metric (Dθ) from Wen et al. (2024). The benchmark includes both memorized image trigger prompts and evaluation metrics to assess mitigation methods' effectiveness. The authors demonstrate their method finds substantially more trigger prompts than previous work and reveal issues with existing mitigation approaches.
However, the paper's organization and presentation significantly obscure its contributions. The main technical innovation - using MCMC for efficient prompt sampling - is not clearly highlighted and is buried under various other aspects. The benchmark itself, while valuable, would benefit from clearer exposition of its components and limitations.
优点
- Novel MCMC-based approach for finding memorization triggers
- Comprehensive evaluation framework for mitigation methods
- Large-scale trigger prompt discovery
- Practical insights into mitigation limitations
- Not requiring training data
缺点
-
Critical Methodological Limitations:
- The paper relies heavily on pre-trained language models (specifically BERT) for prompt sampling but fails to acknowledge this as a fundamental limitation. This is particularly problematic as such models have a training cutoff date (pre-2018 for BERT), making it impossible to find triggers containing newer terms or concepts.
- The evaluation framework potentially misses a significant portion of memorization cases due to this temporal limitation.
- The dependency on pre-trained language models restricts the generalizability of the method.
-
Unclear and Redundant Writing:
-
The paper's findings section contains three nearly identical statements that could be consolidated: Quote: "MemBench reveals several key findings:
- All image memorization mitigation methods result in a reduction of Text-Image alignment between generated images and prompts...
- The mitigation methods affect the image generation capabilities of diffusion models, which can lead to lower image quality...
- The mitigation methods can cause performance degradation in the general prompt scenario..." These are essentially making the same point about performance degradation.
-
The paper makes claims about metric superiority without proper justification: Quote: "while Ren et al. (2024) measured FID, the Aesthetic Score offers a more straightforward way to evaluate individual image quality and better highlight these issues." No evidence or explanation is provided for why Aesthetic Score is "more straightforward" or "better."
-
Key metrics are used without proper introduction: The paper extensively uses SSCD scores without ever explaining what they measure or their significance.
-
-
Poor Communication of Technical Contributions:
- The paper's main technical innovation (efficient prompt sampling) is not clearly highlighted.
- The findings section contains redundant statements about mitigation methods that could be consolidated.
- Claims about metric choices (e.g., aesthetic score vs FID) lack proper justification.
问题
My main problem with the paper is the organization of the writing, I like the idea and I think it is novel, however writing does not focus on main contribution and lack to properly refrence some of the things I mentioned in weakness section
-
Language Model Limitations:
- How do you address the temporal limitation of using BERT (trained pre-2018) for prompt generation?
- What percentage of potential memorization cases might be missed due to this limitation?
- Have you considered using more recent language models or alternative approaches?
-
Methodological Clarity:
- Could you provide specific details about the computational constraints that prevented certain comparison methods from being included in the main body of the paper not just in the appendix.
- Can you clarify the source and context of the reported AUC values from Wen et al. 2024 in the main body of the paper?
-
Metric Choices and Evaluation:
- What empirical evidence supports the choice of aesthetic score over FID? please also include FID in the paper
We thank the reviewer's valuable feedback. Thank you for finding our contributions, including the efficient MCMC searching algorithm that operates without train data, the extensive prompt collection in MemBench, and our thorough evaluation of mitigation methods. We have conducted and provided all the experiments requested by the reviewer. Through this revision, we believe that our submission is further improved.
W1. Unclear writing: 1) The technical innovation of the paper is not clearly highlighted. 2) The benchmark finding section should be consolidated. 3, 4) Provide a detailed explanation of the SSCD metric / AUC value of the detection method [C1]. 5) Provide specific computational constraints that prevented certain comparison methods from being included in the main body of the paper not just in the appendix.
Thank you for finding one of our contributions, the technical innovation of searching memorized images in diffusion models without relying on train data. We have addressed all the points raised by the reviewer and incorporated the changes into the revised paper, which are marked in red for convenience. (Cons 1: lines 78-84 / Cons 2: 71-75 / Cons 3: 179-184 / Cons 4: 204-210) Regarding Cons 5, we would like to confirm whether it refers to moving the constraints of other methods (PEZ, etc.) to the main paper. If this is correct, we have moved it to lines 371-378. However, if it indicates other works [2,3], we have already discussed them in the main paper on lines 388-391.
[C1] Detecting, Explaining, and Mitigating Memorization in Diffusion Models, ICLR’24.
[C2] Extracting Training Data from Diffusion Models, USENIX’23.
[C3] A Reproducible Extraction of Training Images from Diffusion Models, Arxiv preprint.
W2. Bert (pre-2018) is outdated: could not find trigger prompts with new terminology
Thank you for pointing out this important aspect. To address the reviewer’s concern, we conducted several experiments proposing solutions. And we found that the number of trigger prompts containing new terms is not as significant as concerned.
W2-1. Have you considered using more recent language models?
During this rebuttal, we have implemented our MCMC algorithms with a model trained on the newer dataset, 2023 Wikipedia. We conducted 250 MCMC sampling and identified 55 unique memorized prompts. However, none of these prompts contained new terms and all of the generated images from those prompts were the images previously found in BERT. This suggests that sentences with new terms are likely rare and indicates that the 2018 BERT model we used is sufficient to uncover a significant portion of trigger prompts.
W2-2. Temporal solution for BERT
Additionally, we also propose a novel approach in Algorithm 1 to utilize BERT for finding prompts containing recent terminologies. Specifically, we replace the initially masked sentence “[MASK] [MASK] … [MASK]” with a reinitialized sentence that includes a new term, such as “[MASK] … [New Term] … [MASK],” before performing the MCMC sampling process.
To demonstrate the effectiveness of this method during the rebuttal, we conducted an experiment, where we trained Stable Diffusion to memorize 22 sentences containing new terms (e.g., COVID-19, Clubhouse, Tenet, …) and their corresponding images, to evaluate whether the proposed method works. In the trigger prompt searching process, we initialized sentences by randomly inserting these new terms into the masked format, such as “[MASK] … COVID-19 … [MASK],” and conducted MCMC sampling. As a result, we successfully retrieved 13 out of the 22 memorized images. This demonstrates that random initialization of new terms in a sentence can effectively retrieve the memorized images associated with them.
W2-3. What percentage of potential memorization cases might be missed due to this limitation?
In W2-1, we have noted that the portion of trigger prompts containing new terms would be low. To confirm this in another way, we conducted the controlled experiment to W2-2, using the original Stable Diffusion model instead of a fine-tuned version. We inserted 40 new terms, including those used in W2-1, into masked sentences as initialization and performed MCMC sampling. However, no prompts containing new terms were retrieved. This suggests that the proportion of prompts with new terms is likely very low.
Additionally, we investigated the trigger prompts provided in a recent study [C3] that had access to the LAION dataset and examined all prompts and images to identify memorized images. Upon reviewing their dataset, we found no prompts containing new terms introduced after 2018. As demonstrated, our algorithm is not restricted to BERT 2018, and given a series of our experiments, we demonstrate that the new terms are a minority and not significant at this moment; thus, we believe that this does not discount the notable contributions of our work.
[C3] A Reproducible Extraction of Training Images from Diffusion Models, Arxiv preprint.
W3. Why aesthetic score is better than FID? / Please also include FID
As the reviewer requested, we conducted a deeper investigation into the Aesthetic Score and FID. While Ren et al. [C4] measured FID, FID does not provide meaningful information about image quality. They reported that FID decreases when mitigation methods are applied, and our experiments on MemBench similarly showed a decrease in FID. This is because FID does not evaluate the quality of individual images but rather assesses the quality of the generated image pool, which considers diversity. When memorization mitigation methods are applied, memorized images are no longer generated, leading to an increase in the diversity of the generated image pool and a lower FID. Therefore, an Aesthetic Score, which measures the quality of individual images, is necessary. For details, we kindly invite the reviewer to the global response.
We hope that our additional experiments and responses have effectively addressed the reviewer’s concerns. If there are any further questions or clarifications needed, we would be happy to address them.
[C4] Unveiling and Mitigating Memorization in Text-to-image Diffusion Models through Cross Attention, Arxiv preprint.
If there are any additional discussion points or questions, we are happy to discuss. Thank you.
I want to thank authors for addressing my comments. I think the paper adds value to the community. However I will still keep my score as 6.
This paper studies the memorization issue in text-to-image diffusion models, where generated images is almost or exactly the same as the training image, which leads to privacy concerns. This paper propose a new triggering prompt dataset that contains prompts that will lead to memorized generations. This paper evaluates several baseline works’ mitigation strategies on the proposed datasets and also non-triggering prompts and have concluded that existing strategies still have room for improvement as they degrade overall performance of diffusion models.
优点
- This paper is good in presentation, with good clarity, well-structured, and easy to follow.
- This paper addresses an important and practical task: the reverse process of the diffusion model always invokes almost or exactly the same memorized images from training data. This can lead to privacy issues, which are well-discussed and motivated in the introduction section.
- This paper proposes 3,000, 1,500, 309, and 1,352 memorized image trigger prompts for different generative models, compared to a previous benchmark that provides 325, 210, 162, and 354 prompts. Adding additional trigger prompts as a dataset is beneficial for a more comprehensive evaluation of mitigation strategies.
缺点
- Limited contribution.
- While the additional trigger prompts introduced in this paper can indeed aid in evaluating existing mitigation strategies and provide further assessment of their performance, such contribution remains somewhat incremental. Rather than addressing a critical need in the field, it adds to existing resources that are already robust. For example, prior methods [1, 2] have utilized 500 trigger prompts, which a recent work [3] systematically organized by extracting and categorizing them into three types (Matching Verbatims, Retrieval Verbatims, and Template Verbatims) based on their distinct characteristics. This categorization already offers a comprehensive framework, making the additional prompts in the present work more of an enhancement than a necessary innovation. I understand that this work’s main contribution is proposing a new benchmark trigger prompt dataset; unlike others that focus on proposing strategies to detect and mitigate the memorization problem, such core contribution (the proposed prompt dataset) is still of quite limited contribution.
- I have also carefully read all the claimed contributions in the paper and appreciate the authors’ effort in presenting them in a well-structured way that is easy to follow. However, such concern remains as the other claimed contributions are also trivial. For example, the idea of using general (non-memorized) prompts in addition to those trigger prompts for evaluation is a good point. However, it is more of a good practice for the related work, instead of a significant contribution. Also, the finding of all mitigation methods can result in a reduction of text alignment, reflecting the trade-off nature of all mitigation strategies, which is already well-documented in the related works and is not a new finding.
[1] Detecting, Explaining, and Mitigating Memorization in Diffusion Models.
[2] Unveiling and Mitigating Memorization in Text-to-image Diffusion Models through Cross Attention.
[3] A Reproducible Extraction of Training Images from Diffusion Models.
- Concerns regarding motivation. I find the descriptions (motivations) in lines 46-50 to be misleading, where the paper states: “The current studies (Wen et al., 2024b; Somepalli et al., 2023b) have adopted the following workaround: 1) simulating memorization by fine-tuning T2I diffusion models for overfitting on a separate small and specific dataset of {image, prompt} pairs, and 2) assessing whether the images used in the fine-tuning are reproduced from the query prompts after applying mitigation methods …” In fact, these papers actually fine-tune on constructed non-triggering pairs of image and prompts to mitigate the memorization (overfitting) issue as per the training-time strategies, also they propose inference-time mitigation strategies that does not rely on the construction of prompt datasets. Both are quite practical, contrary to the paper's points.
问题
Please check out the weaknesses section, where the limited contribution and motivation are the major concerns that make me inclined to reject this paper.
We sincerely appreciate the reviewer's valuable feedback. There may have been some misunderstanding about our related work and motivation. We would be truly grateful if these points could be revisited during the rebuttal period.
Before addressing the questions, we summarize the contributions highlighted in the global response:
- MCMC trigger prompt searching algorithm: Only approach currently capable of searching trigger prompts that replicate train image where training data (LAION) is no longer accessible. Although without data, it requires significantly fewer computational resources than prior methods [C1, 2] that require training data.
- MemBench: We are the first to 1) provide general prompt scenario, 2) provide absolute standard for performance evaluation, and 3) evaluate image quality.
For details, we kindly invite the reviewer to the global response.
[C1] Extracting Training Data from Diffusion Models, USENIX’23.
[C2] A Reproducible Extraction of Training Images from Diffusion Models, Arxiv preprint.
W1. Concerns regarding motivation: Contrary to what is stated in lines 44-48, train time mitigation and inference time mitigations [C3, 4] do not require the construction of a trigger prompt dataset to operate.
- Full text of lines 44-48 that reviewer referenced: “As an adhoc assessment method, the current studies [C3, 4] have adopted the following workaround: 1) simulating memorization by fine-tuning T2I diffusion models for overfitting on a separate small and specific dataset of {image, prompt} pairs, and 2) assessing whether the images used in the fine-tuning are reproduced from the query prompts after applying mitigation methods.”
- Our motivation: As we explicitly mentioned in the phrase “As an ad hoc assessment method,” our motivation of the sentences was to highlight the inappropriate method these works use for evaluating their inference time mitigation methods, and we have not claimed that those methods require trigger prompt datasets to operate. Due to the limited number of trigger prompts inherently found in Stable Diffusion, prior works overly fine-tuned Stable Diffusion to artificially create trigger prompts and memorized images, which were then used to evaluate inference-time mitigation methods. We clearly confirmed this through email communications with the authors of Wen et al. [C3] before submitting our paper. As we stated in lines 48–50, demonstrating the effectiveness of mitigation methods on artificially fine-tuned Stable Diffusion does not guarantee their performance on trigger prompts already present in the original Stable Diffusion model. This gap underscores the necessity of our evaluation benchmark.
- Additionally, we thoroughly described both train-time and inference-time mitigation methods in the related work section and explicitly stated that our benchmark was developed to evaluate inference-time mitigation methods due to the current limitations of train-time mitigation methods. We kindly ask the reviewer to lines 88-111 in the revised paper for clarification.
- We kindly ask for additional clarification if our interpretation of the question is incorrect.
[C3] Detecting, Explaining, and Mitigating Memorization in Diffusion Models, ICLR’24.
[C4] Understanding and Mitigating Copying in Diffusion Models, CVPR’23.
W2. Limited Contribution W2-1. Prior methods [C3, 5] have utilized 500 memorized prompts from the prior dataset [C2]
- A) First, we’d like to clarify the fact: The prior method [C3] has not been tested on the mentioned dataset. Even the other work [C5] has evaluated their method on the suggested dataset but within only 325 prompts.
- The prior dataset [C2] the reviewer referenced corresponds to the dataset compared with MemBench in Table 1 of the main paper. However, upon examining the GitHub repository of recent work [C2], we found that only 325 of the 500 samples are trigger prompts, while the remaining 175 are labeled as type 'N,' which denotes false samples rather than memorized images. The difference between 325 [C2] and 3000 samples (ours) is significant, and 325 is a tiny number for a test set. We contacted the authors of the referenced paper [C5], and they clearly confirmed via email that only 325 samples were used.
- As stated in W1, another work [C3] the reviewer referenced does not evaluate the mitigation method on [C2], instead, their evaluations rely on an intentionally fine-tuned Stable Diffusion model.
- Papers [C3, 5] conduct experiments exclusively on Stable Diffusion 1. However, whether a method that performs well on Stable Diffusion 1 would generalize to other models remains uncertain. In contrast, our work provides experiments on Stable Diffusion 2 in Table 4 of the supplementary material.
[C5] Unveiling and Mitigating Memorization in Text-to-image Diffusion Models through Cross Attention, Arxiv preprint.
W2-2. The trade-off between text image alignment and similarity score is not a new finding.
- In the main paper, lines 500-501, we have already stated that Wen et al. [C3] identified a trade-off between text-image alignment and similarity score. However, our additional finding, detailed in lines 501-505, highlights that prior methods fail to sufficiently lower the similarity score while maintaining text-image alignment when compared to the reference performance we provided using Google Image Search. Previously, the assessment of methods relied on relative evaluation observing the trade-off between text-image alignment and similarity score, as there was no absolute reference for comparison; thus, one cannot judge whether an algorithm is usable or not. Our contribution lies in establishing this absolute reference, which is a noticeable contribution in this field. For details on the reference performance, please refer to lines 419-430 of the main paper.
W2-3. Prior work [C2] already provides a comprehensive framework with categorization.
- As stated in W2–2, we are the first to establish an absolute reference, and as the reviewer acknowledged, we provide a practical general prompt scenario and provided more prompts that strengthen comprehensive evaluation. These contributions represent the distinct advantages of our paper as a benchmark, compared to prior work [C2].
- Furthermore, while prior work [C2] provides categorization, it does not define metrics, as its purpose is not to establish a benchmark. However, proposing a comprehensive benchmark with well-defined metrics is crucial to prevent subsequent studies from selectively using metrics to their advantage, which we did. For instance, Wen et al. [C3] (Fig. 4.a in the referenced paper) did not evaluate image quality, and Ren et al. [C5](Fig. 6.a in the referenced paper) did not assess text-image alignment for inference time mitigation. Our contribution lies in presenting comprehensive metrics to build a more robust framework.
- As stated in the global response, we are the first to measure image quality, Aesthetic Score. Wen et al. have only measured Text-Image alignment and similarity. While Ren et al. measured FID, FID does not provide meaningful information about image quality. For details, we kindly invite the reviewer to the global response.
We hope that our additional experiments and responses have effectively addressed the reviewer’s concerns. If there are any further questions or clarifications needed, we would be happy to address them.
If there are any additional discussion points or questions, we are happy to discuss. Thank you.
Thank you for your detailed response. I acknowledge the additional contributions made in this work compared to existing studies. However, my assessment remains that the additional contributions are quite limited. Upon carefully reviewing the most relevant prior works cited in the paper during my initial review, I found many of the claims overlap with points already addressed in those studies. Although the rebuttal has emphasized this paper's contributions again, I find them to be aligned with my original assessments that they are marginal. While this work does provide some marginal improvements, these contributions, such as organizing a larger prompt dataset or incorporating general (non-memorized) prompts for evaluation, feel relatively minor in their impact.
To strengthen the paper, it would benefit from a clearer articulation of the limitations of existing works and a focus on making more substantial contributions. This could provide a stronger motivation for the work and better demonstrate its significance beyond the current incremental advances. I wish to maintain my original rating of 5.
This paper proposes a benchmark to evaluate memorization mitigation methods in text-to-image diffusion models. MemBench includes a large set of trigger prompts that replicate training images, potentially causing privacy or copyright issues. It assesses mitigation methods by examining their effectiveness in preventing memorized images without degrading general prompt performance.
优点
- The paper reformulates the search for memorized image prompts as an optimization problem, which allows for more efficient prompt discovery.
- It employs a Markov Chain Monte Carlo (MCMC) method to solve the optimization problem, resulting in a significantly larger set of memorized prompts compared to previous methods.
- MemBench provides evaluation of existing memorization mitigation methods in diffusion models.
缺点
-
This paper does not consider the detection problem. Wen et al and Ren et al both use detection before mitigation. So they do not always mitigate on general prompts. It does not make sense to compare the generation performance on general prompts.
-
The evaluation metrics are not new. In existing paper like Wen et al and Ren et al, they use all the metrics mentioned in this paper.
-
The dataset construction requires white-box access, which is ineffective to API-based models. In addtion, although it does not use training set, it requires to use an API to search images on the web, which is not efficient.
-
The major problem is the benchmark, which didn't provide a reasonable perspective and new metrics.
问题
Besides providing a larger size of dataset, what is the contribution of the new dataset?
W2. Metrics Are Not New (Wen et al. [C3] and Ren et al. [C4] utilized all the provided metrics)
We kindly ask the reviewer to revisit this point, as there seems to be a misunderstanding of related works. As stated in lines 63-65 of the main paper and the global response, we are the first to measure image quality, Aesthetic Score. Wen et al. have only measured Text-Image alignment and image similarity. While Ren et al. measured FID, FID does not provide meaningful information about image quality; thus, we propose its remedy. For details, we’d like to sincerely ask the reviewer to refer to the global response.
Furthermore, Ren et al. (Fig. 6.a in the referenced paper) did not assess text-image alignment for inference time mitigation methods. To address these selective omissions of metrics in individual studies, our work not only suggested the image quality metric (Aesthetic Score), but also highlighted the necessity of evaluating clearly defined metrics all at once to see multiple aspects.
W3. The proposed dataset construction method is not applicable to API-call-based models.
Our paper’s objective is to create a benchmark for evaluating existing mitigation methods, which are all in and limited to the white box diffusion model setup. Thus, applying these mitigation methods to API call models is not feasible. That’s why we focused our study on models with accessible weights. Therefore, this should not be a weakness of our work.
W4. Reverse Image Search API is inefficient.
The Tineye API that we utilized in the paper requires <0.1s per sample, which is efficient. As mentioned in the related work of the main paper, our approach is far more efficient than existing works [C1,2]. Furthermore, previous methods are no longer usable due to the inaccessibility of LAION 2B. In this regard, our approach is not only efficient but also effective in that our verification checks web-scale data beyond a single training dataset in contrast to the existing methods.
W5. lack of reasonable perspective & new metrics
In W1, we demonstrated the presence of a reasonable perspective, and in W2, we highlighted the advantages of the new metric, Aesthetic Score. Reviewer Dijx has admitted that providing additional trigger prompts as a dataset is beneficial for a more comprehensive evaluation. Additionally, we'd appreciate it if the reviewer could also see the contribution of our MCMC algorithm, which enables efficient construction of our dataset in contrast to the existing small-scale datasets, as highlighted in the global response.
We respectfully request the reviewer to revisit our clarification regarding the concerns raised and our notable contributions outlined in our global response and to redraw the evaluation during the rebuttal period. Thanks.
We thank the reviewer's feedback. However, there may have been some misunderstandings about our related work and motivation. We addressed all the comments. We would be truly grateful if these points could be revisited during the rebuttal period.
Before addressing the questions, we summarize the contributions highlighted in the global response:
- MCMC trigger prompt searching algorithm: Only approach currently capable of searching trigger prompts that replicate train image where training data (LAION) is no longer accessible. Contrary to W4, although without data, it requires significantly fewer computational resources than prior methods [C1, 2] that require training data.
- MemBench: We are the first to 1) provide general prompt scenario, 2) provide absolute standard for performance evaluation (contrary to W5), and 3) evaluate image quality (contrary to W2, 5).
For details, we kindly invite the reviewer to the global response.
[C1] Extracting Training Data from Diffusion Models, USENIX’23.
[C2] A Reproducible Extraction of Training Images from Diffusion Models, Arxiv preprint.
W1. Why do the authors consider the general prompt scenario given that a detection method exists?
This is because we found that the detection methods [C3, 4] accuracy are low in other models and the evaluation of detection methods in these works is limited to specific data. We have conducted a test to evaluate detection methods [C3, 4] according to different test datasets, using Stable Diffusion 2. The test datasets consisted of the union of memorized and non-memorized prompt subsets. While fixing the memorized subsets, we varied the non-memorized subsets across the union sets during the test. For non-memorized prompt setsets, Ren et al. [C4] used prompts generated by ChatGPT, while we randomly sampled 500 real prompts from the COCO dataset and DiffusionDB dataset [C5], and measured AUROC accordingly. We used the AUROC value to quantify the model's ability to distinguish between memorized and non-memorized prompts. For these non-memorized subsets used in this experiment, we have manually verified that the prompts were not memorized.
| Detect. Meth. \ Dataset | GPT (reported by Ren et al.) [C4] | COCO | Diff. DB. [C5] |
|---|---|---|---|
| Wen et al. [C3] | 0.922 | 0.732 | 0.824 |
| Ren et al. [C4] | 0.997 | 0.996 | 0.682 |
Table 1. AUROC of detection methods on Stable Diffusion 2 with the various non-memorized prompt datasets. GPT stands for a prompt dataset generated by ChatGPT, and the results are directly quoted by Ren et al.
As shown in the table, changing the non-memorized prompt dataset results in lower AUROC values. This implies that the detection methods exhibit overfitting behaviors to ChatGPT prompts. The decrease in AUROC for COCO and DiffusionDB compared to ChatGPT-generated prompts is due to the tendency of the detection methods to classify more detailed prompts as memorized prompts. Notably, DiffusionDB, which is a dataset comprising prompts from actual users of the Stable Diffusion, is much more suitable for evaluating the performance of detection methods. In particular, Ren et al.’s method requires identifying specific layers capable of detecting memorization through validation experiments with both memorized and non-memorized prompts. This imposes a significant challenge, as it requires full access to the training dataset of each model to accurately verify truly memorized prompts. In practice, this is often impractical, given that there are many generative models of which training datasets are inaccessible, e.g., LAION 2B is no longer accessible.
[C3] Detecting, Explaining, and Mitigating Memorization in Diffusion Models, ICLR’24.
[C4] Unveiling and Mitigating Memorization in Text-to-image Diffusion Models through Cross Attention, Arxiv preprint.
[C5] DiffusionDB: A Large-scale Prompt Gallery Dataset for Text-to-Image Generative Models, ACL’23.
If there are any additional discussion points or questions, we are happy to discuss. Thank you.
We are grateful for the valuable feedback provided by the reviewers. Our paper introduces Membench, the first benchmark designed to evaluate memorization mitigation methods in diffusion models, along with an efficient algorithm to search for memorized image trigger prompts. The reviewers commended that MemBench provides a large number of trigger prompts (M71r, Dijx, Ebmr), comprehensive evaluation (M71r, Dijx, Ebmr, 8Rkb), metrics (8Rkb), and general prompt scenario (M71r, Dijx). Additionally, the reviewers acknowledged our innovative MCMC searching algorithm (M71r, Ebmr, 8Rkb) that can operate without training data. (M71r) and enable a larger data construction.
While all of our contributions were acknowledged, it seems that individual reviewers might not consider all of our contribution points at once in the reviews. In this global response, we organize our contributions into two parts: the MCMC searching algorithm and MemBench itself.
MCMC Sampling Algorithm Contribution
- Our MCMC trigger prompt searching algorithm is the only approach currently capable of searching trigger prompts that replicate train image where training data (LAION) is no longer accessible.
- Although without data, it requires significantly fewer computational resources (reduced GPU, computation, and memory demands) than prior methods [C1, 2] that require training data.
- Our method holds value because new benchmarks must be created for each new model. We have demonstrated its scalability across Stable Diffusion 1, Stable Diffusion 2, Realistic Vision, and DeepFloydIF.
Benchmark Contribution:
- MemBench is the first benchmark designed for memorization mitigation methods, offering larger trigger prompts compared to the prior [C2] and thereby offering comprehensive evaluation.
- MemBench introduces the general prompt scenario for the first time, for real-world applications of mitigation methods. Recent works [C3, 4] have not evaluated their mitigation methods on non-memorized prompts, but for real-world applications, methods should be also evaluated on them.
- MemBench establishes the first absolute standard for performance evaluation (reference performance) for mitigation methods. We suggest absolute reference criteria by using open web data to understand the practical levels of the mitigation methods.
- MemBench provides all necessary metrics—similarity score, text-image alignment score, and image quality score— and an evaluation protocol to see all at once. The absence of organized benchmark allows researchers to selectively report metrics that favor their methods. Recent works [C3, 4] did not evaluate image quality, and Ren et al. [C4] did not assess text-image alignment for inference-time mitigation.
- To the best of our knowledge, we are the first to evaluate image quality (Aesthetic Score). Reviewers pointed out that previous work [C4] already measured image quality (FID). However, in the discussion below, we reveal that image quality cannot be measured by FID in the memorization mitigation task.
We have conducted all the requested experiments and addressed questions. All revisions are highlighted in red in the revised paper. The line numbers referenced in each response correspond to the revised paper. Additionally, unless explicitly stated as newly added content, the referenced lines are the content that existed in the initial submission. We are grateful for all the improvements suggested by the reviewers.
[C1] Extracting Training Data from Diffusion Models, USENIX’23.
[C2] A Reproducible Extraction of Training Images from Diffusion Models, Arxiv preprint.
[C3] Detecting, Explaining, and Mitigating Memorization in Diffusion Models, ICLR’24.
[C4] Unveiling and Mitigating Memorization in Text-to-image Diffusion Models through Cross Attention, Arxiv preprint.
Important) Why Aesthetic Score (ours) instead of FID (prev. work)?
| FID (↓ is better) | No miti. | RTA | RNA | Wen et al. | Ren et al. |
|---|---|---|---|---|---|
| FID reported by Ren et al. | 102.xx | - | - | 90-92 | 88-91 |
| FID measured on MemBench | 116.07 | 75.33 | 85.32 | 64.21 | 86.79 |
Table 1. FID reported by Ren et. al. [C4] and FID measured on MemBench when mitigation methods are applied. FID was measured between the generated images from trigger prompts and the generated images from LAION (Ren et. al) or Flickr80 (Ours).
Reviewers’ major concern was that previous work [C4] reported FID, and therefore our Aesthetic Score is not a new metric that measures image quality. However, FID is defined to focus on the diversity of the generated image pool and, therefore fails to measure image quality in the memorization mitigation task. As shown in Table 1, FID decreases when mitigation methods are applied. The underlying reason behind the decrease in FID is that the mitigation method prevents memorized images from being generated, thereby increasing the diversity of generated images (which is also noted by Ren et. al.). Therefore, image quality should not be measured by FID, but separately measured by the Aesthetic Score. In this regard, we are the first paper that highlights image quality degradation.
| No miti. | RNA | RTA | Wen et al. | Ren et al. | |
|---|---|---|---|---|---|
| Aesth. Mean | 5.28 | 5.06 | 5.10 | 5.13 | 5.18 |
| Aesth. Std | 0.44 | 0.59 | 0.56 | 0.65 | 0.59 |
Table 2. The mean and standard deviation of the Aesthetic Score were measured in MemBench.
To further comply with the reviewers' demand for metrics, we have measured the standard deviation of the Aesthetic Score. As shown in Table 2, when mitigation methods are applied, it leads not only to a lower Aesthetic Score but also to a larger standard deviation. This indicates that diffusion model outputs become unreliable.
We have included these discussions in the revised version of the paper.
This paper presents a benchmark dataset featuring memorized image triggers, where several memorization mitigation methods are evaluated. Currently, it has received a mixed score. Unfortunately, after the rebuttal, the negative reviewers maintained their original scores. I resonate with Reviewer DiJX's view that the novelty of this work is a bit constrained. The primary contribution lies in creating a larger prompt dataset and including general (non-memorized) prompts for evaluation; however, it seems to lack depth in motivations, insights, and an exploration of the limitations of existing works.
As a result, I have to recommend rejecting this paper. I encourage the authors to take the time to refine and reorganize their work before submitting it again.
审稿人讨论附加意见
After rebuttal, only Reviewer M71r are weakly positive toward this paper, while others keep a borderline score. I do appreciate that the proposed MCMC method provides a way to generate a larger prompt dataset. However, I do agree with the reviewer DiJX' that the novelty of this work is a bit constrained. The primary contribution lies in creating a larger prompt dataset and including general (non-memorized) prompts for evaluation; however, it seems to lack depth in motivations, insights, and an exploration of the limitations of existing works.
Reject