PaperHub
6.1
/10
Poster4 位审稿人
最低3最高4标准差0.4
3
3
3
4
ICML 2025

Origin Identification for Text-Guided Image-to-Image Diffusion Models

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24
TL;DR

We introduce the OriPID dataset and a generalizable method with theoretical guarantees to identify original images from their text-guided diffusion model translations.

摘要

Text-guided image-to-image diffusion models excel in translating images based on textual prompts, allowing for precise and creative visual modifications. However, such a powerful technique can be misused for *spreading misinformation*, *infringing on copyrights*, and *evading content tracing*. This motivates us to introduce the task of origin **ID**entification for text-guided **I**mage-to-image **D**iffusion models (**ID$\mathbf{^2}$**), aiming to retrieve the original image of a given translated query. A straightforward solution to ID$^2$ involves training a specialized deep embedding model to extract and compare features from both query and reference images. However, due to *visual discrepancy* across generations produced by different diffusion models, this similarity-based approach fails when training on images from one model and testing on those from another, limiting its effectiveness in real-world applications. To solve this challenge of the proposed ID$^2$ task, we contribute the first dataset and a theoretically guaranteed method, both emphasizing generalizability. The curated dataset, **OriPID**, contains abundant **Ori**gins and guided **P**rompts, which can be used to train and test potential **ID**entification models across various diffusion models. In the method section, we first prove the *existence* of a linear transformation that minimizes the distance between the pre-trained Variational Autoencoder embeddings of generated samples and their origins. Subsequently, it is demonstrated that such a simple linear transformation can be *generalized* across different diffusion models. Experimental results show that the proposed method achieves satisfying generalization performance, significantly surpassing similarity-based methods (+31.6% mAP), even those with generalization designs. The project is available at https://id2icml.github.io.
关键词
Diffusion ModelsOrigin Identification

评审与讨论

审稿意见
3

This paper introduces a new task, identifying the original image for a generated image from text-guided image-to-image translation with diffusion models, which helps prevent the misuse of the generated content such as misinformation and copyright infringement. To deal with this problem, the authors build a dataset OriPID and propose a novel method with theoretical derivations. Although some questions remain regarding the claimed "theoretical guarantee" and generalization to other translation methods, the experimental analysis appears thorough and comprehensive. The paper is well-written and well-structured.

update after rebuttal

The authors have addressed my concerns and I recommend accepting this paper.

给作者的问题

Please refer to the weaknesses mentioned above.

论据与证据

Most of the claims are supported by clear and convincing evidence:

  1. The proposed dataset OriPID supports the experimental evaluations of the origin identification problem, in both seen and unseen settings regarding visual discrepancy between different diffusion models as shown in Figure 2.
  2. Existing methods have difficulties in handling the proposed task. They either completely fail (pre-trained deep embedding models in Table 2), or cannot generalize to unseen scenarios (fine-tuned similarity-based methods and speciallized domain generalization methods in Table 4).
  3. The proposed method is supported with theoretical derivation and implementation details. It achieves good performance in origin identification in both seen and unseen settings as in Table 4, with advantages in generalization, efficiency and robustness. The ablation studies further demonstrate the efficacy of the proposed method.

方法与评估标准

The proposed method, linearly transforming the VAE embeddings which is learned using a metric learning loss, and the evaluation criteria in mAP and Acc makes sense.

理论论述

In the proof of Lemma 1, the derivation of z_0' in Eq. (13) appears problematic. The estimation of noise by the trained network is a step-by-step process, meaning z_0' cannot be directly obtained by reversing Eq. (12) in a single step. This makes the reasoning in Line 235-239 for Theorem 1 questionable.

实验设计与分析

I checked all the experimental designs and analyses and found no further issues.

补充材料

I reviewed all parts of the supplementary material.

与现有文献的关系

This paper proposes a novel task with an effective solution to address the potential misuse of text-guided image-to-image tranlsation results by diffusion models. The benchmark built in this work and the proposed method should further facilitate the exploration of generative content detection and tracing for security concerns.

遗漏的重要参考文献

One series of related works that is missing concerns text-guided image-to-image translation using diffusion models. The scope of translation mechanisms that the proposed method can cover remains unclear. There are some aspects that I think could impact the effectiveness of the proposed method:

  1. How do you encode the original image during translation? In Line 750, "the default mode in the AutoPipelineForImage2Image of diffusers" respresents directly adding noise to the original image latents as in SDEdit [1]? What if you use DDIM inversion [2] as in prompt-to-prompt [3]? How dose the denoising strength affect the results? If the denoising strength is high, the translation results could largely rely on the text prompts. Does the method still performs well with denoising strength 1.0? Does it still make sense to retrieve the original image if the generated image bears little resemblance to it (Figure 10 (c) indoor)? How does the CFG affect the results?
  2. InstructPix2Pix and IP-Adapter encode the original image through VAE latents concatenation and CLIP embeddings, which is discussed in Appendix G. Figure 13 (b) is a bit confusing.
  3. Does attention control affect the results? Such as in prompt-to-prompt [3] and plug-and-play [4]?

[1] Meng, Chenlin, et al. "SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations." International Conference on Learning Representations, 2022.

[2] Song, Jiaming, Chenlin Meng, and Stefano Ermon. "Denoising Diffusion Implicit Models." International Conference on Learning Representations, 2021.

[3] Hertz, Amir, et al. "Prompt-to-Prompt Image Editing with Cross-Attention Control." International Conference on Learning Representations, 2023.

[4] Tumanyan, Narek, et al. "Plug-and-play diffusion features for text-driven image-to-image translation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.

其他优缺点

Strengths:

  1. This paper proposes an important and interesting task.
  2. Most of the claims are supported with evidence as mentioned above.
  3. The paper is well-written and easy to follow.

Weaknesses:

  1. There are questions in the theoretical derivation.
  2. Some related works are missing with sufficient discussion/comparisons, which could impact the scope of this work.

其他意见或建议

In the caption of Figure 10, there are 3 different subjects instead of 6?

作者回复

We sincerely appreciate your efforts in reviewing our paper. We are encouraged that you find: (1) Most of claims are supported by clear and convincing evidence; (2) this paper proposes an important, novel, and interesting task; (3) the analysis appears thorough and comprehensive; (4) the paper is well-written, well-structured, and easy to follow. We address your questions below and will add these into the final version.

Q1. In the proof of Lemma 1, the derivation of z_0' in Eq. 13 appears problematic.

A1. We apologize for the confusion. We acknowledge that, in practice, this process is typically iterative, refining ztz_t over many steps. However, Eq. 13 does not bypass the chain; rather, it is a mathematically equivalent alternative, as shown in [1] (See Eq. 15 and Progressive generation) and [2] (See Eq. 12 in https://shorturl.at/xhnd7). Furthermore, [1] also explicitly mentions that “there is also the possibility of predicting x0x_0, but this gave worse empirical quality than iterative noise-prediction.

In other words, Eq. 13 is a convenient analytical step; practically, the preference for multi-step denoising is due to empirical gains from refining the prediction gradually. Please also see a discussion here: https://shorturl.at/RBZER. Nevertheless, we are happy to further discuss and revise Eq. 12 to Eq. 13.

[1] Denoising Diffusion Probabilistic Models, NeurIPS 2020

[2] Improved Denoising Diffusion Probabilistic Models, ICML 2021

Q2. How do you encode the origin? Do you directly add noise to the origin latents as SDEdit? Does DDIM inversion and attention control affect results, such as in prompt-to-prompt and plug-and-play?

A2. Thank you for the insightful question! The origins are first encoded by VAE, and noise is directly added to the resulting image latents, as in SDEdit. Nevertheless, DDIM inversion and attention control do not challenge the generalizability of our method. Experiments are shown below.

Experimental Setup. Since Prompt-to-Prompt cannot edit real images, we adopt its improved version, EDICT [1], a re-formulation of the DDIM process that allows mathematically exact inversion. Specifically, we:

  • select 1,000 origins and ask GPT-4o to generate inversion prompts;
  • input the inversion prompts, origins, and guidance prompts into EDICT and Plug-and-Play to generate translations. Here, we follow their original parameters;
  • search these queries in 1,000,000 images, consisting of origins and distractor images.

Experimental Results. The table below shows that our model, trained on SD 2 and the SDEdit scheme, successfully generalizes to these new schemes. Again, these experiments validate our Hypothesis 1, i.e., the generalization boundary of our method is determined by whether the generated image is conditioned on a VAE-encoded original image.

MethodsmAPAcc
EDICT (SD 1.4)89.086.6
Plug-and-Play (SD 1.5)99.899.7

[1] Edict: Exact diffusion inversion via coupled transformations, CVPR 2023

Q3. The influence of the denoising strength. Does it still make sense to retrieve the origin if the generated image bears little resemblance to it (Fig. 10 (c) indoor)?

A3. Thank you for the insightful question! The influence of denoising strength is discussed in Q1 by Reviewer PzDk. We acknowledge that it is challenging to clearly define resemblance. Nevertheless, the origin and generations in Fig. 10 (c) share similarities, such as the pool table, indoor setting, lighting placement, viewpoint, and room layout. We argue that retrieving the origins at very large strengths does not make sense, as shown by our failures: https://huggingface.co/datasets/ICML2025Rebuttal/ICML2025_Rebuttal/resolve/main/fail_example.pdf

Q4. How does CFG affect results?

A4. Thank you for the insightful question! In our paper, experiments are conducted at CFG=7.5, which is default for most diffusion models. The table below shows our method performs well across many commonly-used CFGs. These experiments are obtained by training on SD 2 at CFG=7.5 and testing on ColorfulXL at varying CFGs.

CFG3.54.55.56.57.58.59.510.5
mAP94.993.992.390.588.887.786.384.8
Acc94.092.891.089.087.185.984.482.6

Q5. InstructPix2Pix and IP-Adapter encode the origin through VAE latents concatenation and CLIP embeddings, which is discussed in App. G. Fig. 13 (b) is a bit confusing.

A5. Thank you for the kind reminder! Fig. 13 (b) aims to provide a unified perspective on how InstructPix2Pix and IP-Adapter encode the origin. We apologize for omitting the concatenation in InstructPix2Pix. Following your advice, we will replace Fig. 13 (b) in the App. G with textual description in the Related Work.

Q6. In Fig. 10, there are 3 subjects instead of 6?

A6. Thank you for the kind reminder! You're right—we'll fix it to 3 in the final version.

审稿人评论

Thank you for the thorough and detailed response! Most of my concerns are addressed and I have no further questions. Congratulations on this interesting work!

作者评论

We are grateful for your thoughtful and comprehensive feedback, and we are delighted to hear that we have successfully addressed the concerns raised!

审稿意见
3

This paper introduces the Origin Identification for Text-Guided Image-to-Image Diffusion Models (ID^2) task, aiming to retrieve the original image of a given translated query. The paper highlights the risks of misuse, including misinformation, copyright infringement, and evading content tracing. A key contribution is OriPID, a dataset containing a large-scale reference set of images and guided prompts, designed to test the generalizability of ID2 models across different diffusion models. The paper also presents a theoretically guaranteed method that minimizes the distance between Variational Autoencoder embeddings of generated samples and their origins through a learned linear transformation, demonstrating generalizability across different diffusion models. Experimental results show a +31.6% mAP improvement over similarity-based retrieval methods.

update after rebuttal

All of my questions have been fully addressed, and I appreciate the thoughtful and detailed responses. I maintain my positive rating.

给作者的问题

  • What happens if an image is generated by an auto-regressive model and later modified by a diffusion model? How well would ID2 handle this scenario?
  • Have the paper tested the proposed approach on real-world, manually edited images (e.g., Photoshop modifications or any random AI editing tools)? Would the method generalize to such cases?
  • Will the paper release the OriPID dataset?

论据与证据

The paper claims that:

  1. Similarity-based methods fail to generalize across diffusion models → Supported by experiments, e.g., Table 4, Section 5.2, showing a sharp drop in mAP when training and testing on different models .
  2. A linear transformation can bridge the gap in VAE embeddings across models → Proven theoretically and empirically, with experiments demonstrating improved performance.

However, the claim of generalizability to all diffusion models lacks thorough validation, as only specific models (e.g., Stable Diffusion variants) are tested. The prerformance on other diffusion-based models, e.g., FLUX.1-dev, is not explored.

方法与评估标准

The evaluation metrics (mAP, top-1 accuracy) are standard and appropriate. The training/testing split across different models is a strong design choice to assess generalization.

理论论述

The paper provides proofs for the existence and generalizability of a linear transformation that aligns VAE embeddings of original and modified images. The mathematical foundation appears correct.

实验设计与分析

The experiments are well-structured, covering:

  • Performance across seven diffusion models.
  • Efficiency comparisons with similarity-based and generalization-focused methods.

However, the paper does not include visualizations of successfully identifying the origin of a generated query, which would make the claim stronger. Visualizing retrieval cases, including both successes and failures, would significantly improve interpretability.

补充材料

Yes, I have reviewed Supp. G and Fig. 10.

与现有文献的关系

The paper contributes to image modification detection within diffusion models.

遗漏的重要参考文献

None

其他优缺点

  • Strength

    • The proposed linear transformation approach is simple, efficient, and theoretically grounded.
    • OriPID dataset provides a strong benchmark for future work on ID2.
    • Extensive experiments demonstrate generalization across multiple diffusion models.
  • Weakness

    • The paper does not include visualizations of successful origin identification results, which would make the claim stronger.
    • Does not discuss commonly used diffusion-based model, FLUX.1-dev.
    • Current T2I and I2I models also contributed by auto-regressive models, e.g., Janus-pro[A], emu3[B], VAR[C], etc. However, this paper only works on diffusion-based methods.
    • Limited real-world validation: The dataset is synthetic, and real-world image alterations (e.g., Photoshop edits) are not explored.

    References:

    [A] Chen et al., Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling. ArXiv, 2025

    [B] Wang et al., Emu3: Next-Token Prediction is All You Need. ArXiv, 2024.

    [C] Tien et al., Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction. NeurIPS 2024

其他意见或建议

Contribution Point 3 in introduction lists seven performance numbers but does not directly specify which numbers correspond to which models. I would suggest either list them explicitly or provide an averaged score.

作者回复

We sincerely appreciate your efforts in reviewing our paper. We are encouraged that you find: (1) the proposed linear transformation approach is simple, efficient, and theoretically grounded; (2) OriPID dataset provides a strong benchmark for future work on ID2^2; (3) Extensive experiments demonstrate generalization across multiple diffusion models. We address your questions below and will incorporate all your suggestions into the final version of our paper.

Q1. The prerformance on other diffusion-based models, e.g., FLUX.1-dev, is not explored.

A1. Thank you for this valuable suggestion. Accordingly, we evaluate our proposed method on images generated by FLUX.1-dev. The experiments below show that our method generalizes well to FLUX.1-dev.

MethodsmAPAcc
Circle loss71.267.1
SoftMax65.161.7
CosFace68.964.2
IBN-Net72.268.5
TransMatcher76.972.6
QAConv-GS76.871.9
VAE Embed. (Ours)52.347.6
Linear Trans. VAE (Ours)88.987.1

Q2. The paper does not include visualizations of successful/failed origin identification results, which would make the claim stronger.

A2. Thank you for this valuable suggestion. Accordingly, we visualize several successful origin identification results, available at https://huggingface.co/datasets/ICML2025Rebuttal/ICML2025_Rebuttal/resolve/main/success.pdf. Our observations indicate that the origin of queries—spanning various topics and generated by different diffusion models—can be effectively traced. The failure cases are available in the Appendix (Section E).

Q3. Current T2I and I2I models also contributed by auto-regressive models. However, this paper only works on diffusion-based methods.

A3. Thank you for your valuable comment. We acknowledge these autoregression advancements and the current scope of our work is indeed focused on diffusion-based methods. This is because diffusion models remain a cornerstone in both research and industry due to their proven reliability and high-quality results. As a result, the dataset, method, and theory we introduce are directly applicable to practical scenarios and thus useful. Nevertheless, we agree that exploring auto-regressive approaches is an important future direction, and we plan to investigate them in our subsequent work.

Q4. Limited real-world validation: The dataset is synthetic, and real-world image alterations (e.g., Photoshop edits) are not explored.

A4. Thanks for this valuable suggestion. Accordingly, we evaluate our method on a new dataset, SEED-Data-Edit [1], which contains 52k image editing samples. These samples were collected from amateur photographers who posted their images along with editing requests. Photoshop experts then fulfilled these requests, providing the edited images as target images.

Experimentally, we (1) de-duplicate to get 10,274 image pairs; and (2) treat the edited (target) images as queries and search for them within a pool consisting of their origins along with 1,000,000 distractor images.

The experiments below show that: (1) our method generalizes effectively to real-world, manually edited images; and (2) it achieves the best performance compared to all competing methods.

MethodsmAPAcc
Circle loss76.674.5
SoftMax76.273.2
CosFace73.170.1
IBN-Net75.472.1
TransMatcher78.376.4
QAConv-GS74.471.9
VAE Embed. (Ours)66.664.6
Linear Trans. VAE (Ours)86.685.5

[1] Seed-data-edit technical report: A hybrid dataset for instructional image editing

Q5. Contribution Point 3 in introduction lists seven performance numbers but does not directly specify which numbers correspond to which models.

A5. Thanks for this suggestion. We will list them explicitly in the final version: (2) the effectiveness of our proposed method: it achieves 88.8%, 81.5%, 87.3%, 89.3%, 85.7%, 85.7%, and 90.3% mAP, respectively, for Stable Diffusion 2, Stable Diffusion XL, OpenDalle, ColorfulXL, Kandinsky-3, Stable Diffusion 3, and Kolors.

Q6. What happens if an image is generated by an auto-regressive model and later modified by a diffusion model?

A6. Thank you for the question. In response, we (1) generate 5,000 images using Janus-Pro-7B and modify them with ColorfulXL, and (2) use the modified images as queries to search within a pool containing the original Janus-Pro-7B outputs along with 1,000,000 distractor images.

The experimental results indicate that our method achieves similar performance whether the origin is a real image or one generated by an auto-regressive model.

OriginmAPAcc
Real89.387.7
Auto-regressive90.588.4

Q7. Will the paper release the OriPID dataset?

A7. Yes! All the proposed datasets (including training, query, and gallery images) and all code (including training, testing, and dataset-generation code) will be made publicly available to facilitate future research.

审稿人评论

Thank you for the authors’ clarifications. All of my questions have been fully addressed, and I appreciate the thoughtful and detailed responses. I will maintain my positive rating. Well done!

作者评论

We are pleased to know that our efforts have satisfactorily addressed the concerns raised, and we sincerely appreciate your insightful and thorough reviews!

审稿意见
3
  • This paper introduces a new problem, "origin identification", for text‐guided image‐to‐image diffusion models, with the goal of retrieval the original image given a query image that was transformed by a text‐conditioned diffusion model.
  • The paper proposes a new dataset OriPID, containing original images, text prompts, and query images produced by seven popular diffusion models.
  • The paper also provides a novel retrieval‐based method learns a single linear transformation so that an original and its generated variant lie close in the transformed embedding space.
  • The paper includes theoretical arguments showing such a linear transformation should exist, and it should generalize to unseen diffusion models.

给作者的问题

See previous sections

论据与证据

  1. Claim: The paper claims that for a well‐trained text‐to‐image diffusion model and its VAE encoder, one can learn a single linear transform that maps each generated‐image embedding close to its original‐image embedding. Evidence: Theorem 1 show the derivations and proves it. Empirical results in Table. 2 also validates it.
  2. Claim: The same linear transform can work well for images generated by other unseen diffusion models. Evidence: Theorem 2 show the derivations and proves it. Empirical results in Fig. 9 also validates it.

方法与评估标准

  1. The proposed method simply includes VAE embeddings, linear transformation.
  2. For the proposed origin identification task, the paper uses mAP and accuracy for evaluation.
  3. The proposed methods and evaluation criteria make sense and intuitive.

理论论述

See the Claims And Evidence section.

实验设计与分析

  1. The paper conducted extensive experiments on the proposed OriPID dataset, and compare the transformation matrix with baseline methods, including classification models, self‐supervised models, vision‐language models, and image copy detection models.
  2. The paper ablate the proposed method by using different VAE encoders, loss functions, and different matrix rank, showing the propose method is insensitive to VAE encoder, and MLP with activation leads to overfitting.

补充材料

Yes, includes proofs of lemmas, GPT-4o prompts, more experiment results, and failure cases.

与现有文献的关系

  1. The proposed task is related to image copy detection, domain generalization, and text-guided image editing.
  2. The paper underscores the issue of manipulated images for malicious or illegal ends, aligning with concerns in generative‐model detection.

遗漏的重要参考文献

No essential references missed.

其他优缺点

Strengths

  1. The paper offers a large‐scale benchmark (OriPID), carefully curated, which should help standardize evaluations in this new domain.
  2. The proposed solution is simple yet theoretically motivated and empirically superior to baselines.

Weaknesses

  1. The proposed dataset does not containing the editing strength, and the paper does not show how will the proposed method perform with different editing strength.

其他意见或建议

See previous sections

作者回复

We sincerely appreciate your efforts in reviewing our paper. We are encouraged that you find our work (1) provides a large-scale and carefully curated benchmark, (2) proposes a novel retrieval-based method, (3) includes theoretical arguments, (4) is intuitive and make sense, and (5) is empirically superior to baselines. We address your questions below and will incorporate all your suggestions into the final version of our paper.

Q1. The proposed dataset does not containing the editing strength, and the paper does not show how will the proposed method perform with different editing strength.

A1. Thanks for your kind reminders. During testing, the editing strengths for Stable Diffusion 2, Stable Diffusion XL, OpenDalle, ColorfulXL, Kandinsky-3, Stable Diffusion 3, and Kolors, are 0.9, 0.8, 0.7, 0.7, 0.6, 0.8, and 0.7, respectively. The editing strengths used in testing are manually set to prevent significant visual differences between the generated images and the original ones. During training, the editing strength for Stable Diffusion 2 is 0.9.

The table below shows how the proposed method performs under different editing strengths. We observe that although the training and testing images come from different diffusion models with varying editing strengths, the performance of our method remains consistently high across most editing strengths.

It is important to note that:

  • strength = 1 means it's almost like generating from pure noise, which is approximately equivalent to text-to-image generation. Therefore, it is reasonable that we cannot find the origins in that case.
  • As shown in https://huggingface.co/datasets/ICML2025Rebuttal/ICML2025_Rebuttal/resolve/main/fail_example.pdf, we give some examples of strengths where our method fails. These queries are indeed very visually dissimilar with the origins.
  • We do not change the training editing strength for Stable Diffusion 2 while varying test editing strength. That means our method is also generalizable across varying editing strengths.
mAP0.10.20.30.40.50.60.70.80.91.0
Stable Diffusion 2100.0100.0100.099.999.999.899.297.588.843.2
Stable Diffusion XL99.999.999.999.899.599.197.981.568.119.3
OpenDalle100.0100.099.999.999.698.287.349.513.21.8
ColorfulXL100.099.999.999. 899.497.989.361.019.52.6
Kandinsky-3100.099.999.799.297.685.761.914.52.10.0
Stable Diffusion 3100.0100.099.999.999.799.498.585.730.10.0
Kolors99.999.899.698.997.492.190.370.924.52.2
Acc0.10.20.30.40.50.60.70.80.91.0
Stable Diffusion 2100.0100.0100.099.999.899.799.097.186.637.7
Stable Diffusion XL99.999.999.899.799.399.097.578.863.915.7
OpenDalle100.0100.099.999.899.497.885.345.410.71.2
ColorfulXL99.999.999.899.799.297.587.757.116.51.8
Kandinsky-3100.099.999.799.197.183.357.211.41.50.0
Stable Diffusion 3100.099.999.999.899.699.298.182.925.60.0
Kolors99.999.899.598.696.990.888.867.420.51.5
审稿意见
4

This paper introduces the ''Origin Identification'' task for text-guided image-to-image diffusion models, aiming to retrieve the original image of a given modified image generated by diffusion models. The motivation for this task stems from security concerns, including misinformation, copyright infringement, and content tracing evasion. Moreover, this paper proposes a novel dataset, containing 100,000 original images with 20 guided prompts per image and 2,000,000 training images, and designs a theoretically guaranteed identification method.

给作者的问题

Are the proposed datasets and codes publicly available to the academic community?

论据与证据

Overall, this paper's claims are well-supported by theoretical arguments and experimental evidence. However, some claims could benefit from further clarification or additional empirical analysis:

  1. Testing on more diverse diffusion models (e.g., InstructPix2Pix, IP-Adapter) would strengthen this claim.

  2. This paper claims that the linear transformation approach is the best way to generalize across models, but it does not compare against other potential transformations (e.g., non-linear embeddings).

方法与评估标准

The proposed methods and evaluation criteria are well-aligned with the origin identification task. The OriPID dataset, linear transformation method, and evaluation metrics effectively address the task.

理论论述

The paper makes two major theoretical claims (Existence of a Linear Transformation that Minimizes the Distance Between Original and Generated Image Embeddings, Generalizability of the Linear Transformation Across Diffusion Models), each supported by formal proofs.

实验设计与分析

The experimental design is generally well-structured, with appropriate datasets, strong baseline comparisons, and well-justified evaluation metrics. However, some aspects could be improved:

  1. Introducing hard negative mining (i.e., selecting the most confusing negatives to refine the model) and reporting error analysis on top failure cases would improve the quality of this paper.

  2. Add adversarial robustness evaluations or test against image compression and resizing distortions commonly seen in social media uploads.

补充材料

I reviewed the supplementary sections, including the proofs, dataset details, additional experiments, and failure case analyses. Overall, the supplementary material is well-structured and informative.

与现有文献的关系

This paper's contributions align with several key areas in AIGC, AI security, particularly in image provenance, diffusion models, and content attribution.

遗漏的重要参考文献

This paper does a good job of covering related literature in image retrieval, diffusion models, and AI-generated content detection.

其他优缺点

This paper presents a well-executed study on origin identification for text-guided diffusion models, with notable strengths in problem formulation, theoretical grounding, and practical evaluation. However, some limitations remain in generalization, robustness, and interpretability:

  1. The authors only tested on VAE-based diffusion models, it would be better to discuss the generalizationability of the proposed method for the future diffusion models, which may use different encoders.

  2. The proposed method is only tested on Gaussian blur and JPEG compression, but real-world adversaries may apply more advanced modifications.

  3. No evaluation on image cropping, resizing, or watermarking.

  4. No insight into why some transformations fail. This paper does not deeply analyze failure cases beyond mentioning hard negatives.

其他意见或建议

See above.

作者回复

We sincerely appreciate your efforts in reviewing our paper. We are encouraged that you find: (1) this paper's claims are well-supported; (2) the proposed methods and evaluation criteria are well-aligned with the origin identification task; (3) the experimental design is generally well-structured with well-justified evaluation metrics; (4) the supplementary is well-structured; and (5) this paper presents a well-executed study. We address your questions below and will add these into the final version.

Q1. Testing on more diverse diffusion models (e.g., InstructPix2Pix, IP-Adapter) would strengthen this claim. The authors only tested on VAE-based diffusion models.

A1. Thank you for your kind reminder. As shown in the Appendix (Table 9 and Sec. G), we have evaluated our model’s performance on InstructPix2Pix and IP-Adapter, which use VAE and CLIP to encode original images, respectively. The experiments show that our method successfully generalizes to InstructPix2Pix but fails on IP-Adapter.

Based on these experimental results and the theoretical analysis in the main paper, we argue that the generalization boundary of our method is whether the generated image is conditioned on a VAE-encoded original image. Notably, this VAE can vary in architecture and parameters—such as Stable Diffusion, Kolors, and Flux, and the conditioning schemes can also differ—such as those used in SDEdit and InstructPix2Pix (shown in our paper), or Prompt-to-Prompt and Plug-and-Play (see experiments in the Q2 of Reviewer gjdS).

Although this generalization boundary does not cover all possible scenarios, our method remains practically valuable because: (1) most current image-to-image methods indeed utilize a VAE to encode original images; and (2) as shown in Table 9, in fact, no existing methods succeed on IP-Adapter.

In conclusion, our method is currently the most effective one with substantially superior performance. Nevertheless, we acknowledge the importance of overcoming this generalization boundary and consider it as future work.

Q2. This paper does not compare linear transformation against other transformations.

A2. Thank you for your insightful question. As shown in Fig. 9 of the main paper and Sec. F of the Appendix, we have discussed the transformations of multilayer perceptrons (MLPs) with activation functions. We observe that this case leads to an overfitting problem.

Here, we add two alternative architectures: a single convolutional layer and a multi-head attention layer. The below experiments show that: (1) likely due to underfitting, the convolutional layer results in a performance drop; and (2) although the multi-head attention layer marginally improves performance on seen images, its performance on unseen images falls behind our method, due to overfitting.

MethodmAP (Seen)Acc (Seen)mAP (Unseen)Acc (Unseen)
Embeddings of VAE51.047.046.943.0
Convolution37.433.832.529.6
Attention89.087.280.778.2
Linear Transformation88.886.686.684.5

Q3. Introducing hard negative mining.

A2. Thank you for your valuable suggestion. Following it, we combine our currently-used CosFace with a hard mining triplet loss. However, as shown in the table below, this approach does not bring performance improvement. This result is reasonable, as the original CosFace paper suggests that once we aim to train a large margin between classes, additional hard negative mining becomes unnecessary.

MethodmAP (Seen)Acc (Seen)mAP (Unseen)Acc (Unseen)
With88.486.486.884.9
Without88.886.686.684.5

Q4. No evaluation on image cropping, resizing, or watermarking.

A4. Thank you for your valuable suggestion. Following it, we conduct experiments involving image cropping, resizing, and watermarking. As shown in https://huggingface.co/datasets/ICML2025Rebuttal/ICML2025_Rebuttal/resolve/main/mod.pdf, our method is relatively robust against these modifications.

Q5. This paper does not deeply analyze failure cases.

A5. Thanks for this insightful suggestion. We provide an analysis here: our model uses a VAE to compress high-dimensional inputs into a lower-dimensional latent representation. This compression tends to smooth out subtle local details, causing the model to lose critical fine-grained distinctions between similar yet different instances. Furthermore, the imposed prior encourages a uniform distribution in the latent space, forcing nuanced features from distinct instances into overlapping latent representations. These factors reduce the model’s capability to differentiate hard negatives from true positives.

Q6. Are the proposed datasets and codes publicly available?

A6. Yes! All the proposed datasets (including training, query, and gallery images) and all code (including training, testing, and dataset-generation code) will be made publicly available.

审稿人评论

Thank you for the detailed responses. Most of my concerns and questions have been addressed. I recommend including the additional experimental results and findings in the final version, as some of them are practical and interesting. I have accordingly raised my rating.

作者评论

We sincerely appreciate the increased rating, and we will make sure to incorporate these experimental results and findings into the camera-ready version.

最终决定

The reviewers acknowledge the significance of the target problem, find the proposed method to be well-grounded in theory, and consider both the theoretical analysis and empirical results to provide strong support for the paper’s claims. While there are some concerns regarding potential overclaims about generalizability and the superiority of the linear transformation—along with the lack of failure case analysis, detailed performance evaluation, and questions about the soundness of certain derivations—these issues have been adequately addressed in the rebuttal. All reviewers have given positive final ratings. The Area Chair agrees with the reviewers and considers the paper to be solid, provided the authors appropriately incorporate the clarifications and additional content from the rebuttal into the final version.