Image Copy Detection for Diffusion Models
We propose a timely and important ICD task, i.e, Image Copy Detection for Diffusion Models (ICDiff), designed specifically to identify the replication caused by diffusion models.
摘要
评审与讨论
This paper studies a novel task of Image Copy Detection for Diffusion Models (ICDiff). Different from traditional Image Copy Detection task, ICDiff focuses on detect and evaluate the degree of image copy for images generated by text-2-image (T2I) generative diffusion models. This is an important and meaningful task for the community to study, with the emergence and influence of T2I diffusion models on downstream tasks, such as image editing, icon generation, or style transfer. Effectively and precisely detecting the generated images that might "copy" content/style/character of the artifically created images in its training data can help the end users avoid violating issues such as Image Copyright.
To study such a problem, the authors collected and proposed a new dataset called D-Rep. To build such a set, the authors first try to find text prompts from DiffusionDB that are highly similar to image titles from LAION-Aesthetics V2, using Sentence-Bert score. Subsequently, they generate 40k images using these collected prompts using Stable Diffusion 1.5 to simulate possible image copies from LAION. To obtain the degree of image copy annotation, the authors employ human annotators to score each generated-image-&-LAION-image paris from 1 to 6.
Furthermore, the authors propose and train a probablity density based embedding model (PDF-Embedding) to score the degreee of image copy for input images. 90% of the collected D-Rep dataset is used as training set for the proposed method.
In the experiments, the proposed PDF-Embedding more effectively and efficiently detect image coplies in the D-Rep test set, comparing to the compared methods. However, the quantative experiments are limited to the proposed dataset. In addition, the comparison is somewhat unfair because many of the compared methods (such as GPT-4V, CLIP, DINOv2) are not trained or fine-tuned on the training split of D-Rep, but the proposed method is.
优点
- [Clear writing]: The writing is clear. The sturcture of the paper is easy to follow.
- [Novel and Important Task]: The studied and proposed task Image Copy Detection for Diffusion Models (ICDiff) is novel, practical, and meaningful. It is important for the community to address the Image Copy issues caused by the training data of T2I diffusion models. This will significantly help downstream tasks avoid copyright issue.
- [Good baseline]: The proposed PDF-embedding is an efficient and good baseline to address the proposed task.
缺点
-
[Incompleteness]: This paper is not complete in itself. The authors largely use "refer to Section XXX in the Appendix" throughout the paper. At Sec. 5.1 (Line#185), the entire section for Training Details contains only 1 sentence "Please refer to Section E in the Appendix". I understand the space is limited in the main manuscript and the authors want to use the space for more important content. However, a paper should be complete in itself. After carefully reading the paper, I cannot know: i) how the model is trained, what is the architecture, what are the configuration; ii) how the other methods are compared, do the authors train them, how are they trained or adapted to the current task? These details are important for the main paper for the readers to understand the method and the experiments.
-
[Limited experiments]: The quantitaitve experiments are only conducted on one dataset (4k images in the test set) in Tab. 1. In addition, this dataset is generated by only one specific T2I model Stable Diffusion 1.5. Different generative models might have different image copy patterns. The current experiments can hardly validate the effectiveness and advantages of the proposed method over other methods. More comprehensive experiments should be conducted. When switching prompts (data) and the generative models, maybe zero-shot methods such as CLIP, M-LLMs (GPT4-V), or DINOv2 are more generalize. Without large scale experiments, it is hard to draw a conclusion.
-
[Unfair comparison]: In Tab. 1, the compared Vision-Language Models (CLIP, GPT-4) and Self-supervised Learning Models (e.g., DINOv2) are not trained or fine-tuned on the evaluation dataset's training split. However, the proposed method is trained on the training split. This makes the comparison unfair. To ensure a fair comparison, either the compared methods should be trained or fine-tuned on the same training split; or, evaluate all methods on novel datasets in a zero-shot manner. With the current unfair comparison, the advantage and effectiveness of the proposed method is not clear.
-
[Duplicate content]: Sec. 5.2 and 5.3 are highly overlapped.
问题
Q1: Do the authors study the quality of text matching using Sentence-Bert? Pure linguistic matching could be quite ambigious and subseqeuntly leads to incorrect matching. For instance, "containers" can be matched to both the huge containers for shipment at the port and the plastic containers (such as lunch box) in a kitchen context. In addition, do the authors study different scores, such as CLIP text encoder score instead of Sentence-Bert score? As observed in [1], for visual content related text, CLIP encoder seems better than Sentence-Bert.
Q2: How do you define Image Copy for Diffusion Models? What kind of "Copy" should be detected? This is important for the studied task. For instance, when a user prompts a T2I diffusion model "A portrait of the Mona Lisa.", of course, a highly similar image that copies the famous Mona Lisa potrait will be generated. But, is this an image copy? This kind of detection might not helpful for downstream tasks. The task should be defined better and thereby the method can be designed better. From the visualization in Fig. 3, I found the high-score copies are mainly "object copy". These copies might occur frequently if the prompt tasks the diffusion model to generate such an image. By conditioning and limiting the prompt text, this can be solved directly. So why is it useful? T In my opinion, a worth-to-detect image copy would be when prompting "Generate a yellow cartoon-style mouse" and a Pikachu-like image is automatically generated without explicitly mentioning it in the prompt.
Q3: How well the proposed method will generalize to images generated by other diffusion models, such as SDXL or DALL-E-3, quantatively?
[1] Chen, Z., Chen, G. H., Diao, S., Wan, X., & Wang, B. (2023). On the Difference of BERT-style and CLIP-style Text Encoders. arXiv preprint arXiv:2306.03678.
局限性
Yes.
We sincerely appreciate your positive feedback and helpful suggestions. We hope our paper will meet your approval once we address the following concerns.
Q1. The large use of "refer to Section XXX in the Appendix".
A1. Due to limited space, we indeed have placed the expected implementation details in the Appendix. Since the camera-ready version will have one more page, we will move these details into the main paper. We also summarize them as below:
- How the model is trained, what is the architecture, what are the configurations?
We implement our PDF-Embedding using PyTorch and distribute its training over 8 A100 GPUs. The ViT-B/16 serves as the backbone and is pre-trained on the ImageNet dataset using DeiT. We resize images to a resolution of 224 × 224 pixels before training. A batch size of 512 is used, and the total number of training epochs is 25 with a cosine-decreasing learning rate.
- How the other methods are compared, do the authors train them, how are they trained or adapted to the current task?
In Table 1, we apply the methods in a zero-shot manner. Specifically, we use these models as feature extractors and calculate the cosine similarity between pairs of image features, except for GPT-4V Turbo. We ask GPT-4V Turbo to reply with one similarity directly.
For the five different methods compared in Table 2, we ensure a fair comparison with the proposed method by using the same training schedule, network architecture, and configuration as described above.
Q2. Generalizability to other datasets or diffusion models.
A2. We provide the required experimental results in the table of the attached PDF, along with the associated analysis in the “Common Concerns” section. These experimental results validate the generalizability of our model across different diffusion models.
Q3. Table 1: Unfair comparison with other models.
A3. We apologize for causing this confusion. The purpose of Table 1 is to demonstrate that all existing models fail to solve the proposed ICDiff task, highlighting the necessity for a specialized ICDiff model rather than to compare our model against them. For details, please refer to the “Common Concerns” section.
Q4. The overlap between Sec 5.2 and Sec 5.3.
A4. We will delete the first part, which contains an inappropriate/unfair comparison, in Section 5.3.
Q5. The quality of text matching using Sentence-Bert.
A5. We are aware that relying solely on linguistic matching can lead to incorrect matching; however, this does not affect the quality of our D-Rep dataset. This is because the text matching is used as a pre-filtering step, and we engage professional labelers after identifying candidate pairs. Specifically, as shown in Line 113 ~ Line 121, these labelers assess the pairs from a visual perspective; for example, they distinguish between large shipping containers and small plastic kitchen containers, assigning a replication level of . As Fig. 3 illustrates, only of the dataset consists of image pairs at Level , indicating the number of mismatched pairs is not large. In conclusion, this approach ensures that mismatching pairs also provide supervision signals (as negative samples) to train models instead of compromising dataset quality.
Q6. The definition of copy or replication.
A6. The definition of copy/replication in this paper follows [1]:
“We say that a generated image has replicated content if it contains an object (either in the foreground or background) that appears identically in a training image, neglecting minor variations in appearance that could result from data augmentation.”
According to [1], this definition “focuses on object-level similarity because it is likely to be the subject of intellectual property disputes”. After clarifying this, we provide a point-to-point reply to your questions.
- The generated Mona Lisa is not a copy, and thus this kind of detection is not helpful for downstream tasks.
We respectfully disagree with that. First, it is important to note that this paper aims to detect replication itself, providing a basis for subsequent copyright checks rather than directly detecting copyright infringement. Furthermore, such cases also cause plausible infringement problems; therefore, detecting them assists in copyright protection. Specifically, although the Mona Lisa is in the public domain and not subject to copyright, diffusion models are equally capable of generating other famous works, such as “The Persistence of Memory”, which remains under copyright by the Dalí Foundation. This is illustrated in the figure of the attached PDF. Utilizing these generated images for profit would constitute a copyright infringement against the Foundation. Therefore, it is reasonable to regard such cases as copies.
- Limiting the prompt solves the copy directly.
We respectfully disagree with that. Because: (1) Many open-source diffusion models, like Stable Diffusion, do not have such limits. Therefore, users can easily generate copies of copyrighted images. (2) Although commercial models limit the prompt text, as shown by [2], this approach does not completely prevent object-level copying. For instance, Fig. 1 in [2] shows copyrighted Superman logo can be generated by ChatGPT without mentioning Superman directly.
- Detecting the case without explicitly being mentioned in a prompt.
This case will also be solved by our method because we focus on matching visual features rather than prompting. For instance, whether the prompt is ‘Generate a yellow, cartoon-style mouse’ or ‘Generate a Pikachu,’ if the generated image resembles Pikachu, our algorithm can match it with a copyrighted Pikachu image.
[1] Diffusion art or digital forgery? investigating data replication in diffusion models. CVPR, 2023.
[2] On Copyright Risks of Text-to-Image Diffusion Models. Arxiv, 2023.
Thanks the authors for provding such detailed rebuttal. Most of my concerns are well-addressed.
However, I highly suggest the authors to largely polish the writing and structure of the current manuscript in the revision. Page limitation exists for most of the publications. Yet, a paper should be complete by itself, and allows the readers to understand necessary preliminaries, methods, and experiments by only reading the main paper. I found the current manuscript has not reached this standard yet.
Nevertheless, this is a good work with valid motivation. I would like to accept this paper for the task it studies. Therefore, I raised my ratings after rebuttal.
We sincerely thank you for your positive rating and will definitely move the required information to the camera-ready version (which has one more page) and polish our manuscripts accordingly.
This paper constructs a Diffusion-Replication dataset aiming to solve the image copy detection problem for diffusion models. This paper proposed a strong baseline named PDF-Embedding which transforms the replication level into a probability density function as the supervision signal. Extensive experimental results and analysis demonstrate the efficiency of the proposed method.
优点
- The topic is timely and interesting. The presentation is clear and easy to follow.
- This paper creates a valuable dataset, which is important to identify the replication caused by diffusion models.
- This paper gives a reasonable theoretical explanation for the proposed method. The analysis of the proposed PDFs is convincing on three primary functions: Gaussian, linear, and exponential.
- The experiments are intensive and reasonable results are achieved.
缺点
- As shown in Figure 5 and Figure 15, authors use different values of A for each function, while there is no explanation on both the selection rule and influence of A.
- As shown in Table 2, the training time, inference time and matching time are not better than other methods.
- This paper creates a valuable dataset, while the details of the data set, such as the label distribution of the dataset, are missing, which is important in practical applications.
问题
- Authors use different values of A for eq3-5, but there is no explanation on both the selection rule and influence of A.
- The details of the data set, such as the label distribution of the dataset, are missing.
局限性
Yes, the authors addressed the limitations and potential negative societal impact of their work.
We sincerely thank you for your positive feedback and helpful suggestions. We address your questions below.
Q1. The selection rule and influence of .
A1. We thank you for this insightful question. We provide the selection rule and influence of here.
Selection rule.
We select to maintain as a validation PDF. Because the random variables are discrete in practice, cannot be randomly selected, and its value must lie within a certain range. The below example demonstrates how to calculate the range of :
For the Gaussian function, according to Eq. 16:
$
\begin{gathered}\sum_{x \in{0,0.2,0.4,0.6,0.8,1}}\left(A \cdot \exp \left(-\frac{\left(x-p^l\right)^2}{2 \cdot \sigma^2}\right)\right)=1, \\ A \cdot \exp \left(-\frac{\left(x-p^l\right)^2}{2 \cdot \sigma^2}\right) \geq 0, \end{gathered}
$
we have:
$
6 \cdot A \geq \sum_{x\in { 0,0.2,0.4,0.6,0.8,1}} \left( A\cdot \exp \left( -\frac{\left( x-p^{l} \right)^{2}}{2\cdot \sigma^{2}} \right) \right) =1.
$
Therefore, we have . As shown in Fig. 15, experimentally, we start with . Similarly, we can calculate the range of for the Linear and Exponential functions.
Influence.
As shown in Fig. 15, for the Gaussian and Linear functions, a larger implies steeper supervision, while for the Exponential function, a larger implies smoother supervision. According to our intuition that “the probability of neighboring replication levels should be continuous and smooth,” the parameter controls the expected smoothness of the learned distribution.
We thank you again for this insightful question, which helps make our method more theoretically sound. We will incorporate these points into our revised paper.
Q2. The efficiency training and testing is not better than others.
A2. Thanks for this question. Since we use a set of vectors to calculate similarity instead of the original one, it is reasonable to expect little computational overhead. However, in our paper, we argue that the computational overhead introduced by our method is negligible. Specifically: (1) during training, our method is only slower than the baseline; (2) during inference, our method is only slower than the baseline; (3) the magnitude of matching, which is , is negligible compared to the magnitude of inference, which is . Furthermore, as shown in Lines 262 to 266, we find that in practice:
Our PDF-Embedding requires only seconds for inference and an additional seconds for matching when comparing a generated image against a reference dataset of 12 million images using a standard A100 GPU. This time overhead is negligible compared to the time required for generating (several seconds).
In conclusion, given the significantly enhanced performance, the introduction of such a minimal computational overhead is worthwhile in practice.
Q3. The details of the proposed dataset, such as the label distribution.
A3. Thanks. We have provided the label distribution of the proposed dataset on the left side of Fig. 3 in the main paper. To make this clearer, we will highlight it in the main text. Additionally, we provide more labeling details in the response to Q2 of Reviewer 2Ky2, which enriches the details of the proposed dataset.
I have checked the feedback from my fellow reviewers as well as the corresponding rebuttals. The concerns appear to have been resolved satisfactorily. My own concerns have also been addressed effectively. Therefore, I raise my score to 7 and confidence to 5. Thanks for authors’ efforts.
We are glad that our efforts have satisfactorily addressed the concerns raised and thank you very much for your thorough review.
This paper introduces a novel method named ICDiff for detecting whether images generated by diffusion models replicate the training set. The authors have constructed a new dataset called D-Rep and proposed a new embedding method called PDF-Embedding. This approach transforms the replication level of image pairs into a probability density function, using this as a supervisory signal to train the model. The experimental results demonstrate that ICDiff outperforms many existing image copy detection method.
优点
- Innovation: ICDiff is the first image copy detection method specifically aimed at replicas generated by diffusion models, filling a gap in current research.
- Dataset Construction: The creation of the D-Rep dataset provides a valuable resource for the study of image copy detection, with its replication level annotations offering clear guidance for model training and evaluation.
- Methodology: The PDF-Embedding method, which utilizes a probability density function as the supervisory signal, is both innovative and effective.
缺点
See questions.
问题
- Intuitively, we would use a single set of vectors to characterize the similarity between images, with larger dot product values indicating higher similarity and smaller ones indicating the opposite. This paper, however, uses six sets of vectors to measure six levels of similarity, which seems counterintuitive. Specifically, when two images are almost identical, the set of vectors indicating their similarity as zero needs to be quite different; when two images are completely unrelated, the set of vectors indicating their similarity as zero needs to be as consistent as possible. I am curious about the rationale behind this design.
- Although ICDiff performs well on the D-Rep dataset, its generalizability to other datasets or different types of diffusion models has not been verified. In particular, I am concerned whether other methods compared with ICDiff (such as SSCD) have been trained on the D-Rep training set? If so, how was the training conducted; if not, is this comparison on the D-Rep test set somewhat unfair?
- Regarding the design of the Relative Deviation (RD). The authors point out in Appendix B that when , the penalty should be greater than when ; my question is, when , should the penalty be greater than when (although both can't be more wrong, the former seems to be more egregiously wrong)?
- How do the authors utilize the trained ICDiff to assess the replication ratio of diffusion models? The paper seems to only mention conclusions (such as 10.91% or 20.21%) without explaining how these figures are derived. Specifically, ICDiff provides six replication levels for each set of images. When assessing the replication ratio of diffusion models, what criteria or which level does the author use as a threshold?
局限性
No limitations.
We sincerely thank you for your positive feedback and helpful suggestions. We address your questions below.
Q1. The rationale behind using six vectors.
A1. We appreciate this insightful question. As you say, when we understand our PDF-Embedding approach separately, it seems counterintuitive: Typically, we use larger dot product values to indicate higher similarity. In contrast, in the context of PDF-Embedding, when two images are more similar, the dot product of a certain set of vectors, which indicates their similarity as zero, needs to be smaller.
However, as the name of our method, Probability-Density-Function, suggests, under ideal conditions, our method is continuous and should not be focused on local regions. From the perspective of continuity, the rationale or intuition is that the probabilities of neighboring replication levels should be continuous and smooth. For instance, if an image-replica pair is annotated as level-3 replication, the probabilities for level-2 and level-4 replications should not be significantly low either. After training, the largest-scored entry indicates the predicted replication level. This intuition is experimentally verified by comparing our method against two standard supervision signals, namely “One-hot Label” and “Label Smoothing,” as shown in Table 2.
Furthermore, even if we only focus locally, learning such features by optimizing the neural network will not cause contradictions. The neural network is capable of capturing the underlying distribution and continuity, ensuring that the learned embeddings reflect the smooth transition between different levels of similarity.
Q2. Generalizability to other datasets or diffusion models.
A2. We provide the required experimental results in the table of the attached PDF, along with the associated analysis in the “Common Concerns” section. These experimental results validate the generalizability of our model across different diffusion models.
Q3. Table 1: Somewhat unfair comparison with other models.
A3. We apologize for causing this confusion. The purpose of Table 1 is to demonstrate that all existing models fail to solve the proposed ICDiff task, highlighting the necessity for a specialized ICDiff model rather than to compare our model against them. For details, please refer to the “Common Concerns” section.
Q4. The design of Relative Deviation (RD): it does not reflect more egregiously wrong in some cases.
A4. We appreciate your deep understanding of our designed Relative Deviation metric. Indeed, the Absolute Deviation (Eq. 12) indicates that when and , the penalty is greater than when and . Although we notice it is difficult to design an evaluation metric suitable for all cases, both Relative Deviation and Absolute Deviation fulfill the primary purpose of proposing the second evaluation metric. Specifically, as mentioned in Lines 132 to 134:
A limitation of the PCC is its insensitivity to global shifts. If all the predictions differ from their corresponding ground truth with the same shift, the PCC does not reflect such a shift and remains large. To overcome this limitation, we propose a new metric called the Relative Deviation (RD).
Both Relative Deviation and Absolute Deviation can reflect global shifts: for the same label, a larger distance from the label results in a higher deviation/penalty score.
Furthermore, if we maintain when and , the penalty is greater than when and , we cannot normalize the penalty to the range for each pair, which may lead to overly optimistic results. This is because: assuming the maximum penalty is , we have
$
1=penalty\left( s^{l}=5,s^{p}=0 \right) >penalty\left( s^{l}=3,s^{p}=0 \right).
$
In conclusion, while we acknowledge that our Relative Deviation may not reflect more egregious wrong in some cases, it empirically aligns better with human intuition. Additionally, we will also seek more perfect evaluation metrics.
Q5. How to assess the replication ratio of diffusion models.
A5. Thank you for your reminder, and we apologize for missing this important detail. When assessing the replication ratio of diffusion models, we consider image pairs rated at Level 4 and Level 5 to be replications. We will add this to the revision.
Thank you for your detailed response. You summarized my questions as Q1-Q5, which I find to be appropriate. Regarding Q2, Q3, and Q5, I have no additional queries.
For Q1, I would like to delve deeper into a hypothetical scenario. Imagine figures A and B are identical (or almost identical), and we have a vector derived from A and another from B following identical procedures. In such a case, these vectors should have a minimal dot product. How does the model address this requirement?
For Q4, I would appreciate further clarification on the statement, "we cannot normalize the penalty to the range for each pair, which may lead to overly optimistic results." Why will we have overly optimistic results?
Thank you again and look forward to your further response.
Thank you for your prompt response and for participating in the rebuttal discussion. We are happy to have addressed 3 out of 5 questions and would like to further discuss your follow-up concerns.
Q1 Further. Imagine figures A and B are identical (or almost identical), and we have a vector derived from A and another from B following identical procedures. In such a case, these vectors should have a minimal dot product. How does the model address this requirement?
A1 Further. Thank you for this insightful question. If images A and B are almost identical, their vectors at Level 0 are indeed very similar, resulting in a high inner product. However, the inner product at Level 5 is even larger. Therefore, because our method relies on the relative scale relationship — using the highest-scoring entry to indicate the predicted replication level — it remains effective. For instance, the second subfigure in the first row of Fig. 16 shows two almost identical images, where the original dot products for Levels 0 and 5 are 0.864 and 0.998, respectively. Our method successfully predicts this pair as Level 5.
Q4 Further. Why will we have overly optimistic results?
A4 Further. Sorry for the confusion, and we clarify this as follows:
Our statement was to highlight that the Relative Deviation is more intuitive because it normalizes the deviation: assigning 1 deviation score to the worst case and 0 deviation score to the best case. In contrast, Absolute Deviation does not provide this intuitive scale. For example, Absolute Deviation for the case where and is 0.6. Although the case is the worst, one may be confused by the 0.6 deviation score and misunderstand that the prediction has a certain degree of correctness.
I appreciate the thorough responses to my questions, the revisions made to the paper, and the additional experiments detailed in the "Common Concerns" section. As a result, I have adjusted my evaluation positively.
From my current understanding, to better capture the properties of an image, multiple vectors are required instead of one. Within this set, some vectors carry a significant amount of the image's information, while others carry less. When assessing the similarity between two images, as their differences increase, the dot product of all the corresponding vector pairs will decrease, but those of vectors that carry more substantial information about the images will decrease faster. Thus, the relative scale relationship could change. Is my understanding accurate?
We sincerely thank you for your positive evaluation. Your interpretation on the "6-vector rationale" is insightful and has given us great inspiration: it triggers a deeper thinking on the underlying mechanism and raises a very plausible explanation. We will also continue our study, and if we find anything new, we will let you know. Thanks again.
This paper proposed a new image copy detection model for diffusion models. A dataset of replication labels (0 to 5) is collected using human labelers and is used to train a model for replication grading. The proposed model estimates a pdf of replication labels and minimizes the error with a pdf representation of class labels. The problem is timely, and the proposed solution is novel.
优点
A new model and framework for image copy detection (ICD) is proposed in the paper. The paper is well-written, and the results indicate a significant step forward compared to the mentioned literature.
缺点
- The use of continuous labels 0-1 will be more useful in practice.
- Some details of the labeling procedure are missing: how many labels per image, variance for each labeler, and across labelers.
- The same dataset D-Rep is used for evaluation.
问题
-
I don't see why the authors did not choose continuous labels instead of the 6-categories approach. Instead of using argmax in eq(10), a weighted average score that is normalized between 0-1 will be more appropriate for practical applications.
-
Some details of the labeling processing are missing: how many images did each labeler label? What is the variance between the labelers for a specific image?
-
Evaluation can be more robust using a separate dataset with a different prompt/image generation procedures.
局限性
Yes.
We sincerely thank you for your positive feedback and helpful suggestions. We address your questions below.
Q1. The use of continuous labels 0-1 will be more useful in practice.
A1. Thanks. According to your suggestion, we implement this idea and find that it improves the original performance while allowing for continuous predictions. For instance, the PCCs for three variants of our PDF-Embedding change from (, , and ) to (, , and ), respectively. We will also include this more useful implementation when we release the code.
Q2. Details of the labeling procedure are missing.
A2. Thanks. Currently, we have 10 professional labelers and 40,000 image pairs. Initially, we assign 4,000 image pairs to each labeler. If labelers are confident in their judgment of an image pair, they will directly assign a label. Otherwise, they will place the image pair in an undecided pool. On average, each labeler has about 600 undecided pairs. Finally, for each undecided pair, we vote to reach a final decision. For example, if the votes for an undecided pair are 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, the final label assigned is 3. As a result, we do not calculate the variance for each image, and we believe this process ensures high-quality labeling.
Q3. Evaluation with a separate dataset.
A3. We appreciate this constructive suggestion and provide the required experimental results in the table of the attached PDF, along with the associated analysis in the “Common Concerns” section. These experimental results validate the generalizability of our model across different diffusion models.
Thanks for your reply and glad to see that my suggestion improved the original performance. It would be helpful to include the details you mentioned in Q2 in the paper.
We sincerely thank you again for your suggestion and will include the labeling procedure in A2 into the camera-ready version.
Thanks and Solve Common Concerns
We sincerely thank the ACs and reviewers for their dedicated efforts in reviewing our paper. We also thank all reviewers for their positive, thoughtful, and helpful feedback, and we will add all these suggestions to the final version of our paper.
We are encouraged that they found our work to be “novel” (Reviewer 2Ky2, Reviewer dJ32, and Reviewer vdEZ), “important” (Reviewer 2jDa and Reviewer vdEZ), “timely” (Reviewer 2Ky2 and Reviewer 2jDa), “valuable” (Reviewer dJ32 and Reviewer 2jDa), and “interesting” (Reviewer 2jDa). We are also glad that Reviewer 2jDa recognized our work as “This paper gives a reasonable theoretical explanation for the proposed method” and “The experiments are intensive and reasonable results are achieved.” Finally, we sincerely appreciate that Reviewer 2Ky2, Reviewer 2jDa, and Reviewer vdEZ think our paper is “well-written”/“clear”/“easy to follow”.
We have thoroughly addressed all the concerns raised by the reviewers in the below separate responses. Here, we want to solve two common concerns regarding generalizability to other datasets or diffusion models (Reviewer 2Ky2, Reviewer dJ32, and Reviewer vdEZ) and unfair comparison (Reviewer dJ32, and Reviewer vdEZ).
- Generalizability to other datasets or diffusion models
Thanks for this constructive suggestion. According to the suggestions of Reviewer 2Ky2, Reviewer dJ32, and Reviewer vdEZ, we provide the quantitative experimental results on 6 unknown diffusion models (in addition to the qualitative results in Fig. 6 in the manuscript) to validate our generalizability further. The experimental results in the table of the attached PDF file show that our model has good generalizability compared to all other methods:
-
Our PDF-Embedding is more generalizable compared to all zero-shot solutions, such as CLIP, GPT4-V, and DINOv2.
-
Our PDF-Embedding still surpasses all other plausible methods trained on the D-Rep dataset in the generalization setting.
-
Compared against testing on SD1.5 (same domain), for the proposed PDF-Embedding, there is no significant performance drop on the generalization setting.
The quantitative evaluation protocol in the attached table: Because the collection process of the images from some diffusion models (see Appendix H) differs from the process used to build the test set of our D-Rep dataset, it is difficult to label 6 levels in a short time and the proposed PCC and RD are not suitable. In the attached table, we consider a quantitative evaluation protocol that measures the average similarity predicted by a model for given image pairs, which are manually labeled with the highest level. When normalized to a range of 0 to 1, a larger value implies better predictions. This setting is practical because, in the real world, most people’s concerns focus on where replication indeed occurs. Due to time constraints, our team and labelers manually confirm 100 such pairs for each diffusion model. Note that, to ensure fairness, the pre-filtering of these pairs was based on an internal model which is not related to any of the models compared.
- Unfair comparison
Thanks for this kind reminder. We did not intend to use Table 1 for direct comparison. The purpose of Table 1 was to investigate the characteristics of our new dataset, D-Rep, by benchmarking it against popular methods. The results show that D-Rep is very challenging and popular vision-language models (1st row), self-supervised learning models (2nd row), supervised pre-trained models (3rd row), and current ICD models (4th row) fail to solve the new task. The reason is clear: these models are not specifically trained for it. Given this context, our method offers a reasonable solution. Furthermore, Table 2 validates the effectiveness of our method by fairly comparing it against five different methods trained on our D-Rep training set. Additionally, the table in the attached PDF provides a relatively fair comparison to these pre-trained popular methods to further validate the superiority of our method by testing these pre-trained popular methods and our method directly on the image pairs generated by unseen diffusion models. We will revise the caption and description in the manuscript to clarify that Table 1 is not intended for comparison.
Please let us know if you have any additional questions or concerns. We are happy to provide clarification.
Authors of Submission #367
The paper proposes a novel dataset: D-Rep, to study the problem of copy detection in images produced by diffusion models. The paper categorizes the copying levels and defines a probability density function on these levels for training a transformer-based copy detection setup.
The paper received all positive reviews, with reviewers acknowledging the timeliness and importance of the dataset and the task. There were questions on: i) the validity and usefulness of the proposed copy levels and their representations, ii) on the training and implementation details, as well as iii) organization of the paper. Authors provided a strong rebuttal, addressing the concerns. AC agrees with the reviewers on the importance of the dataset and recommends accept.
Authors should incorporate all the suggestions provided by the reviewers in the revised paper, and should ensure the final paper is as complete as possible as pointed out by reviewer vdEZ.