PolyJuice Makes It Real: Black-Box, Universal Red Teaming for Synthetic Image Detectors
We propose PolyJuice, a black box red teaming method that steers text-to-image generative models to generate images that deceive a synthetic image detector.
摘要
评审与讨论
This papers proposes a novel method attacking synthetic image detectors, named PolyJuice. By constructing a dataset of T2I embedded images labelled as true positive (correctly predicted as fake) and false negative (falsely predicted as real), the method constructs a direction vectors of maximal change between TP and FN samples. During the attack, these direction vectors are applied to the T2I latent space, resulting in SID models false predicting synthetic images as real.
By conducting various experiments, the authors demonstrate the ability of PolyJuice to enhance the ability of T2I models in generated images that prevent detection from SID methods.
优缺点分析
Strengths:
- The paper proposed a novel approach at attacking SID models by only using T2I embeddings of TP and FP images.
- The paper is well written, clearly highlighting the importance of this research direction and the limitations it addresses.
- The proposed method is able to achieve high success rates in attacking SID models.
- The authors make use of multiple SID and T2I method within their evaluation, as-well as investigations in the impact of PolyJuice on T2I models.
Weaknesses:
- The authors only make use of the COCO datasets. Despite them providing reasons for only using this dataset, it would still be useful to make use of other datasets, or even using text prompt outside the COCO dataset textual descriptions for comparison.
- The performance of the method is highly dependent on the ability of SPCA (which the authors highlight when transferring direction from low-res to high-res images, i.e. curve of dimensionality causes performance issues), another variable that may cause issues with the final produced direction vector is the side of the dataset. The authors do not conduct any ablation study on the impact of the number of TP & FN has on the performance of PolyJuice.
- Further to the above point, could the authors give more detail of how they constructed the 20k TP and 20k FN images? The writing suggests they are generated by the targeted SID model? However more information of how they generated those images would benefit this paper.
- The authors defence mechanism involves the use of PolyJuice images to calibrate SID models, could the authors also train a SID classifier using PolyJuice images within the dataset and evaluate its robustness.
- The paper lacks more discussion on the defence methods against this attack, which raises ethical questions.
问题
The work is novel and of a high-quality. Please see my weaknesses above for my questions.
局限性
The authors touch upon the limitation of their method, and suggest a direction of future research to address it.
The authors mention that the motivation of this work is for responsible red-teaming, however more discussion should be had on how to mitigate the risk of such adversarial images. For instance, the authors visualise the CLIP embeddings of both PolyJuice and non-PolyJuice aided synthetic images, which show two clear clusters of images, could this be used to develop some mitigation strategy?
Lastly, a discussion of how the result of this work can influence policy should be had, even if only brief. I.e. The authors mention that images generated by their method seem to consist of warmer colours? Could such aspects of images help identify synthetic images which could be written into a policy?
最终评判理由
The authors have well addressed my concerns and questions which has increased my confidence in this paper. However, I do not believe this work and the authors response necessitated a higher rating (from accept to strong accept).
格式问题
No concerns.
Summary of the Review & Response
We greatly appreciate the reviewer's positive assessment of our work. The reviewer finds our approach novel (S1), our paper well written, well motivated (S2), and our work to be novel and of high quality in general (Q). We are glad that the reviewer identifies our usage of multiple SID and T2I methods in evaluation (S4), and the overall high success rate of our black-box method (S3) as strengths.
Primary concerns by the reviewer include:
1) extension to other text-description datasets (W1)
2) ablation on the number of samples needed to compute PolyJuice directions (W2 & W3)
3) potential defense mechanisms (W4 & W5 and Lm1 &Lm2)
We address:
1) by extending the evaluation to a new text description dataset (PartiPrompts) in Table R5
2) by providing ablation results on the effect of the number of samples on the success rate and clarifying the pipeline for selecting these samples (Table R6)
3) by discussing the potential defense mechanism that will be added to the final version of the manuscript (Table R7)
In the following, we address each of the concerns raised by the reviewer in detail.
Response to Weaknesses
W1. The authors only make use of the COCO datasets. Despite them providing reasons for only using this dataset, it would still be useful to make use of other datasets, or even using text prompt outside the COCO dataset textual descriptions for comparison.
We thank the reviewer for their suggestion, which improves the breadth of the paper. As noted in Sec. 4, we use COCO as it contains diverse text prompts that cover different domains, such as humans, animals, natural scenery, and common objects. This ensures the training data is not domain-specific. Per the reviewer’s suggestion, we evaluate PolyJuice attacks (directions learned from COCO) on a subset of text prompts from the PartiPrompts dataset (Yu et al, 2022) and present the results in Table R5. The results demonstrate the generalizability of the PolyJuice attacks to text prompts outside COCO.
Table R5. Attack success rate (%) of PolyJuice on text descriptions from the PartiPrompts dataset.
| T2I | Detector | Unsteered | PolyJuice (ours) |
|---|---|---|---|
| SD3.5 | UFD | 13 | 75 (+62) |
| SD3.5 | RINE | 8 | 100 (+92) |
| Flux-Dev | UFD | 78 | 96 (+18) |
| Flux-Dev | RINE | 42 | 86 (+44) |
| Flux-Schnell | UFD | 61 | 84 (+23) |
| Flux-Schnell | RINE | 31 | 56 (+25) |
W2. …The authors do not conduct any ablation study on the impact of the number of TP & FN has on the performance of PolyJuice.
To address this concern, we estimate the shift directions using 16K, 40K, and 52K data samples from FLUX[dev] and use them to perform attacks against UFD. The results presented in Table R6 demonstrate that approximating the steering directions with 16K data points yields the same result as the 40K we use in the paper (SR of 96.3%). Using more data points (52K) leads to a very small increase in success rate (SR of 96.5%), which implies the low sensitivity of PolyJuice SR to the number of training samples. This discussion and results will be added to the revised paper.
Table R6. Effect of the number of samples on PolyJuice's success.
| Training Samples | Success Rate |
|---|---|
| 16K | 96.3 |
| 40K | 96.3 |
| 52K | 96.5 |
W3 Further to the above point, could the authors give more detail of how they constructed the 20k TP and 20k FN images? …
First, using a T2I model (e.g. FLUX), we generate a set of fake images using the text prompts available in the COCO training set. These fake images are then labeled by a given black-box SID as fake (TP) or real (FN). To have a balanced dataset, we sample an equal number (up to 20K) of FN and TPs from the labeled data. We will further clarify these details in the experiment settings subsection of Sec. 4 in the revised paper.
W4. The authors defence mechanism involves the use of PolyJuice images to calibrate SID models, could the authors also train a SID classifier using PolyJuice images within the dataset and evaluate its robustness.
We would like to note that the calibration strategy used in the paper is based on EER (Equal Error Rate), which is a widely accepted method in the literature. Per the reviewer’s suggestion, we take the RINE detector and fine-tune it (by training an MLP head) using the attacks generated by PolyJuice. We then compare the detection rate of the original RINE, PolyJuice-calibrated RINE (Sec 4.3, Table 3), and the PolyJuice-trained RINE. Table R7 shows the results of the comparison, which demonstrate the contribution of PolyJuice in improving the defense of the SID (lowering FNR). From the result, we also find that the calibration of SIDs leads to a better FNR than fine-tuning.
Table R7. Comparison of the FNR of the original RINE, PolyJuice-calibrated RINE, and PolyJuice-trained RINE.
| RINE | PolyJuice-Cal. RINE | PolyJuice-trained RINE | |
|---|---|---|---|
| SD3.5 | 15.1 | 3.8 | 6.9 |
| FLUX[dev] | 52.0 | 21.8 | 30.3 |
| FLUX[sch] | 39.6 | 18.4 | 22.6 |
W5. The paper lacks more discussion on the defence methods against this attack, which raises ethical questions.
We thank the reviewer for identifying this concern. First, PolyJuice requires a certain number of queries to the SID in order to construct the dataset and compute the steering direction. Therefore, commonly adopted security barriers, such as a rate limits, can raise the difficulty of applying PolyJuice maliciously. Apart from security barriers, a model can become robust to PolyJuice by learning a representation space that collapses the TP and FN distributions. For more details, please refer to the response to L1.
In summary, while we acknowledge the potential for misuse, we believe that raising awareness of this vulnerability ultimately benefits the community. By highlighting the issue, our work can inspire further research into developing effective mitigation strategies.
Response to Limitations
L1. …For instance, the authors visualise the CLIP embeddings of both PolyJuice and non-PolyJuice aided synthetic images, which show two clear clusters of images, could this be used to develop some mitigation strategy?
One possible mitigation strategy can be training a head on top of the CLIP feature space that collapses the FN and TP clusters. We want to find a subspace that is orthogonal to the direction of shift between these two clusters, while preserving the information of the target attribute (i.e., real vs. fake). Ideas from invariant representation learning, such as AIFL (Xie et al. 2017) or FairerCLIP (Dehdashtian et al. 2024), can be adopted for this purpose. Specifically, in the notation of FairerCLIP, the target would be real vs. fake label, while the attribute to be removed, , would be TP vs. FN. This discussion can enrich the paper, so it will be added to the revised version of the paper.
L2. Lastly, a discussion of how the result of this work can influence policy should be had, even if only brief. I.e. The authors mention that images generated by their method seem to consist of warmer colours? Could such aspects of images help identify synthetic images which could be written into a policy?
We would like to clarify that the common pattern analyzed in Section B.1 of the supplementary material is SID-specific and is not generalizable to other SIDs, since the vulnerability of each model is different. As a result, PolyJuice needs to be applied on each SID to find the vulnerability of the model, and this is not necessarily the same for other detectors. However, a per-SID policy can be written based on the common patterns.
Please consider raising the scores if we address your concerns. If you have more questions, we would be happy to answer them.
Reference
Yu et al. "Scaling autoregressive models for content-rich text-to-image generation." TMLR (2022).
Xie, Qizhe, et al. "Controllable invariance through adversarial feature learning." NeurIPS (2017).
Dehdashtian et al. "Fairerclip: Debiasing clip's zero-shot predictions using functions in rkhss." ICLR (2024).
I thank the authors for their response to my questions and limitations.
The authors response has cemented my original score and I will increase my confidence in the score of this paper.
This paper introduces PolyJuice, a novel black-box and image-agnostic red-teaming method designed to improve the effectiveness of Synthetic Image Detectors (SIDs). Recognizing the limitations of existing red-teaming solutions—which often require white-box access to proprietary SIDs and rely on computationally expensive, image-specific optimizations—PolyJuice offers a universal approach. The core insight behind PolyJuice is the observation of a discernible distribution shift in the latent space of Text-to-Image (T2I) models between images correctly classified by an SID and those misclassified (i.e., synthetic images that the SID erroneously identifies as "real").
优缺点分析
Strengths
1. Strong Empirical Results: I trust this paper provides compelling quantitative results, showing PolyJuice's effectiveness in boosting attack success rates (up to 84%) and, conversely, its utility in improving SID robustness (up to 30% reduction in FNR) when used for data augmentation.
2. Addresses a Critical Gap in Real-world: PolyJuice directly tackles the pressing issue of the "arms race" between rapidly advancing T2I models and the SIDs designed to counter them. Its ability to generate challenging, misclassified synthetic images is vital for evolving SID capabilities. The black-box nature of PolyJuice is highly significant for real-world applications, as many advanced SIDs are proprietary and only offer API access, making traditional white-box attacks infeasible.
3. Practical Threat Model: This paper employs a practical black-box threat model, which is more practical than the previous settings. I think this is important for discuss it in real-world applications.
Weaknesses
1. Strong Empirical Results: The writing and the general structure of this paper needs to be improved, including how the latent space shifting helps to evade the detection. I saw too many unnecessary bolded text in the paper, which may reduce the readability of the paper.
2. Threat Model is Missing: I didn't find the threat model for the proposed attacks in this paper, including the description to the attacker's capability, goal and attack scenarios.
3. Confused about the perturbation optimization: If I didn't get it wrong, PolyJuice first get the hard predictions from the querying to black-box APIs, and it formulate a distribution with latent space in t2i models. I am confusing about the definition of of latent space here. Is the latent space of vae or just the embedding space in CLIP? And how PolyJuice find the optimization direction of the images?
问题
See weakness please.
局限性
yes
最终评判理由
I think the auhors need to improve the overall presentation of the whole paper.
格式问题
Two many unnecessary bold texts in the paper.
Summary of the Review & Response
We thank the reviewer for their thoughtful comments. We are encouraged that the reviewer finds our empirical results strong (S1), our problem domain to be of critical importance (S2), and our black-box threat model practical in the context of real-world settings (S3).
Concerns by the reviewer include:
1) certain stylistic choices made in the paper (W1)
2) lack of a formal definition of the threat model (W2)
3) Potential confusion about the latent space and the computation of the steering directions (W1 & W3)
We address
1) by explaining our rationale behind our stylistic choices
2) by formally defining our threat model
3) by providing clarifying details
Response to Weaknesses
W1a. The writing and the general structure of this paper needs to be improved, including how the latent space shifting helps to evade the detection.
We explain in Section 3.1 how PolyJuice identifies a direction in the T2I latent space that statistically correlates with the SID’s prediction of “realness”. This is based on a distribution shift between the latents of samples predicted as real versus fake (see Fig. 1b). By universally steering the latent in this direction during the image generation process (Fig. 2), we create samples that are more likely to land in SID failure regions (as further visualized in Fig. 5 and Fig. 7).
W1b. I saw too many unnecessary bolded text in the paper, which may reduce the readability of the paper.
We put these short, bold sentences at the beginning of the paragraphs so that a first-time reader can quickly skim the whole paper and get a general idea of the paper. We would appreciate it if the reviewer could point to the exact places where the bold text is reducing the readability of the paper, so we can address them in the revised version.
We hope these clarifications make the core mechanism of PolyJuice more accessible and improve the overall readability of the paper.
W2. Threat Model is Missing: I didn't find the threat model for the proposed attacks in this paper, including the description to the attacker's capability, goal and attack scenarios.
We are encouraged that in item 3 of the strength section, the reviewer finds our threat model practical and applicable to real-world scenarios. We adopt the suggestion of the reviewer and formally define the threat model for our method in the following.
We extend the notation defined in Sec. 2 of the paper to our threat model.
Threat Model:
- Attacker’s Goal: Given a text prompt and a latent , the attacker aims to generate synthetic images using a text-to-image (T2I) generative model that deceives a target synthetic image detector (SID) into misclassifying them as real (class ).
- Attacker’s Capability:
- Black-box access to the SID: The attacker can only query the SID and observe hard labels (real/fake), without access to model weights or gradients.
- Full access to the T2I generative model: The attacker can manipulate the latent space and control the generation process of a text-to-image (T2I) model.
- Sufficient number of queries: The attacker can generate a dataset of image-latent-label triplets to analyze the SID's behavior in response to various inputs.
- Attack Scenario:
- Step 1: The attacker queries the black-box SID with fake images and obtains hard labels, constructing a dataset of TP and FN samples.
- Step 2: The attacker pre-computes steering directions in the T2I latent space that correlate with increased probability of being misclassified as real by the SID (Eq. 3).
- Step 3: At test time, the attacker applies this universal direction to arbitrary prompts, producing images that evade detection by the SID (Eq. 6).
We will add this formal threat model to the revised version of the paper.
W3a. …Is the latent space of vae or just the embedding space in CLIP?
The latent space mentioned in the paper is the latent space of the VAE used by a latent diffusion/flow-based T2I model, and not the embedding space of a CLIP model.
W3b. …how PolyJuice find the optimization direction of the images?
We first form a set of true positive (TP) and false negative (FN) images using hard labels obtained from the black box SID. Our goal is to find the direction of the distribution shift from TP to FN (see Fig. 1b) in each image generation step. These directions are estimated by maximizing the statistical dependence (measured through HSIC) between the latents and the labels , i.e., calculating SPCA. This optimization has a closed-form solution (L121 - L123). At inference time, these directions are used to steer the trajectory of the T2I model (Eq. 6) such that the final generated image is guided towards high error rate regions (see Fig. 2).
Please consider raising your score if we have addressed your concerns. If you have additional questions, we would be happy to answer them.
Dear Reviewer qKp6,
Thank you for your thoughtful review. In our rebuttal, we’ve clarified your questions regarding the steering mechanism and added a formal threat model to address your concerns. Please feel free to also refer to our responses to other reviewers, where we present extensive experimental results to address concerns.
As the discussion period nears its end, we kindly encourage you to share any remaining questions or thoughts, as it would help to make a more informed decision on the paper.
If our rebuttal has already addressed your concerns, we would greatly appreciate it if you could consider revisiting or updating your evaluation scores.
Thanks to the authors for the detailed responses. I will increase my score for encouragement. I hope the auhors can improve the overall presentation of this paper in the final version.
The paper introduces PolyJuice, a universal red-teaming method that steers text-to-image generators toward regions that fool synthetic-image detectors (SIDs) while requiring only black-box label access. It computes a single "realness shift" direction per diffusion timestep via supervised PCA on latent vectors of true-positive vs. false-negative samples, then applies this vector during sampling. The pre-computed directions transfer across prompts and resolutions, enabling query-efficient, image-agnostic attacks. Experiments with three generators (SD v3.5, FLUX-dev, FLUX-sch) and two SIDs (UFD, RINE) show up to 84 p.p. increase in false-negative rate and demonstrate that augmenting training data with PolyJuice images can cut detector errors by up to 30 p.p.
优缺点分析
Strengths
- Novelty: Black-box, universal unrestricted attack on SIDs leveraging a latent-space distribution shift.
- Efficiency: One-off offline direction discovery and low→high-resolution transfer avoid costly per-image optimization.
- Effectiveness: Large gains in SID false-negative rates across multiple generators and resolutions.
- Practical utility: PolyJuice images improve detector robustness when added to training data.
- Clarity: Motivation, algorithm and experiments are explained with helpful figure.
Weaknesses
- Baseline Coverage: No comparison to prior black-box or transfer-based UA attacks; only "no-steering" baseline. A quick search revealed Kotyan et al 2024, proposing a black-box attack using evolutionary algorithm. I think the authors should conduct a more thorough literature survey and conduct a better quantitative study to show their attack's effectiveness over other black-box attacks. Infact, I would even urge to compare against White-Box Attacks to quantify the gap between White-Box and Black-Box Attacks, Good black-box attacks should minimize the performance gap. Simply showing effectiveness of their own approach by being ignorant and neglecting existing works do not give a good baseline comparision.
- Limited Detectors: Evaluation covers just two academic SIDs that are not even state-of-the art SID. Once again a shallow literature survey from this perspective is revealed, a quick search clearly indicates UFD is not state-of-the-art, Baraldi et al. 2024 (DRCT) and Chen et al. (CoDE) are much general SID models and would test the capabilities of PolyJuice better. A bad detector will be easy to fool, but a strong general detector will be harder. It is imperative to test and validate the attack against SOTA SID models.
- Image Quality Metrics: While the authors conduct evaluation using performance, lack of reporting of image quality of adversarial samples and comparision to non-adversarial samples cast a doubt over claims indicated in the article. Zhong et al 2023 demonstrate that performance of SID models decrease when image quality is affected, therefore it is impertative to check and validate that image quality of adversarial samples are not degraded, if image quality is degraded, then the claims of the article is wrong as PolyJuice is greedily fooling SID by decreasing the image quality rather than directing image-generation models towards the blind-spot. Slighty degraded images compared to non-degraded non-adversarial images also point to distribution shift affecting the SID models not explictly trained for this distrubution shift.
- Potential Gaps in Theroetical Claims and Assumptions: Latent Size of Modern diffusion models are far bigger than the sample size so in this “large-p, small-n” regime, the sample covariance is necessarily rank-deficient, so the eigenvectors are extremely sensitive to noise. No formal discussion of query noise or finite-sample error; suggest adding concentration inequality or error bound.As a result, the steering direction may capture sampling artefacts (like poor image quality) rather than a genuine class-separating shift. Further for steering direction, no proof is supplied that this convex combination maximises HSIC in the directional (rank-1) setting; the classical SPCA solution would instead select the single leading eigenvector. A short lemma is needed to justify why the proposed weighting improves—or even preserves—the HSIC objective.
- Reverse Process Validity: Adding at each step perturbs the reverse posterior. The proof lacks a guarantee that the perturbed trajectory remains within the manifold on which the score network was trained, raising the possibility of off-manifold artefacts that are trivially hard for the detector. Samuel et. al 2023 showed that if latent vector is manipulated much such that their norm changes than the image quality decreases reiterating that images generated might be degraded compared to original samples leading to misclassification by SID.
- Ablations for Hyperparameter Sensitivity: Any point of hyperparameter is relegated to appendix; however no ablations are conducted to validate the selection of the hyper-parameters. An ablation study can better validate these choices.
- Overarching Red-Teaming Claims: The authors suggest that the attack proposed by them can be used as a red-teaming tool. The main purpose of a red-teaming tool is to identify an actionable vulnerability that can be used to improve robustness. Current article, only proposes an adversarial attack, it hypothesizes on the vulnerability and make no contributions in improving the defence of the SIDs or show evidence tfor the same. Therefore, I think the claim is overarching and should be avoided. The contribution is simply a black-box adversarial attack on SID models.
- Untested Commercial Claims - Authors highllight that the method can be used on commerical SID models but lacks tests to validate it. To test true "black-box" claim, it is essential to apply the attack to commercial SID models like Sightengine (https://dashboard.sightengine.com/ai-image-detection) and AI or Not: (https://www.aiornot.com/dashboard/home) and demonstrate the capability, otherwise I urge the authors strongly to not overexaggerate their claims and refrain from making unwarranted claims.
In general, I find two major limitations in terms of literature survey and overexaggerated claims;
- Literature survey of the article highly unsatisafactory and outdated in 2025, both from the point of black-box attack using Diffusion Models and choosing Synthetic Image Detectors for the experiments, failing to account many studies that came in 2024.
- Claims made in the article are not accounted and rather boldly presented as contributions.
Comment on Scope (Won't term it weakness but definitely very narrow scope impacting significance)
The proposed does not use domain information specific to SID yet it's application was limited to SID. I would appreciate if authors can comment upon why narrow domain was targetted, when the proposed approach can be used for any image classfiication tasks? Object Classification, NSFW Classifcation? If the authors can demonstrate this, it would raise the significance and applicability of this research to a broader community. Further even in SID, specicialised domain like Facial Images are not evaluated. Recently, Guo et al 2025 proposed an interpretable SID using LLM showing a shift of SID task from Binary Clasification to more complex task. It would be nice to see if the proposed attack can also fool such latest models diverging SID from simple binary classification.
References
- Samuel, D., Ben-Ari, R., Darshan, N., Maron, H., & Chechik, G. (2023). Norm-guided latent space exploration for text-to-image generation. Advances in Neural Information Processing Systems, 36, 57863-57875.
- Zhong, N., Xu, Y., Li, S., Qian, Z., & Zhang, X. (2023). Patchcraft: Exploring texture patch for efficient ai-generated image detection. arXiv preprint arXiv:2311.12397. https://fdmas.github.io/AIGCDetect/data/
- Baraldi, L., Cocchi, F., Cornia, M., Baraldi, L., Nicolosi, A., & Cucchiara, R. (2024, September). Contrasting deepfakes diffusion via contrastive learning and global-local similarities. In European Conference on Computer Vision (pp. 199-216). Cham: Springer Nature Switzerland.
- Chen, B., Zeng, J., Yang, J., & Yang, R. (2024, July). Drct: Diffusion reconstruction contrastive training towards universal detection of diffusion generated images. In Forty-first International Conference on Machine Learning.
- Kotyan, S., Mao, P. Y., Chen, P. Y., & Vargas, D. V. (2024). Breaking Free: How to Hack Safety Guardrails in Black-Box Diffusion Models!. arXiv preprint arXiv:2402.04699.
- Guo, X., Song, X., Zhang, Y., Liu, X., & Liu, X. (2025). Rethinking Vision-Language Model in Face Forensics: Multi-Modal Interpretable Forged Face Detector. In Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 105-116).
问题
See mainly weaknesses for longer discussions but would appreicate crisp answes to the following;
- What is the minimum number of SID queries needed to approximate the steering direction within 5 pp of reported success rates?
- How does PolyJuice compare to a transfer-based black-box attack (e.g., DiffPGD trained on a surrogate SID)?
- Have you evaluated on non-COCO domains like, face-focused, medical or multi-modal detectors?
- How sensitive are results to the dimensionality d of the SPCA subspace and to the choice of λₜ?
- Could non-linear kernels in SPCA further improve attack success? Any preliminary experiments?
局限性
- No discussion of query-based cost constraints or rate-limiting defenses at commercial endpoints.(as the claim in fooling even commercial APIs)
- Highly limited literature survey neglecting latest SOTA models on SID and black-box attacks using Diffusion Models
- Does not address potential legal/ethical implications of releasing universal attack directions.
- Assumes access to clean TP/FN split; impact of noisy SID labels is not explored.
最终评判理由
Authors have addressed my concerns in the rebuttal and I hope with the additional discussion that followed, the article's quality will be improved. I highly encourage the authors to add the results they have posted in the rebuttal in their final version.
格式问题
N/A
Summary of Reviews & Response
We are encouraged that the reviewer highlights the novelty (S1), efficiency (S2), effectiveness (S3), and practical utility (S4) of our red-teaming approach as well as the clarity of our paper (S5) and its broader applicability beyond SID (in “comment on scope”).
Reviewer’s primary concerns are:
1) literature review (W1 & 2, Q2, L2)
2) paper’s claims (W4, 5, 7 & 8)
3) image quality evaluation (W3)
We address:
1) by demonstrating PolyJuice’s effectiveness on four additional SID models in Table R3, and by comparing it to a transfer-based black-box attack (Q1) in Table R4. We point out that the paper already evaluates against RINE (ECCV 2024), a recent SID that is as recent as or newer than the reviewer’s suggestions (W2) and has a lower FNR (Table R3).
2) by clarifying misunderstandings on latent space size, data requirements, and SPCA components. We refer to Table 3, where we empirically show that PolyJuice reduces SID error (and also in Table R7 reviewer 9sRv-W4), directly supporting our red-teaming claims. Although we discuss PolyJuice’s applicability to commercial SIDs, we make no performance claims, as access to those models was not granted.
3) by providing image quality score (CLIP IQA) and prompt alignment score in Table R2.
Response to Weaknesses
W1a. Comparison to Kotyan et al. (EvoSeed)
EvoSeed is not scalable to modern T2I models. It needs up to = 16,500 imgs / 1 atk for SD. To match our experiments (SR(%) on 1000 atks), it needs up to 16.5M images, which is practically infeasible.
W1b. Comparison to transfer-based attack
We perform a transfer attack on the RINE detector with DiffPGD and a surrogate (Q2). Table R4 shows that PolyJuice achieves a better SR(%) than the transfer attacks.
Table R4. SR(%) of PolyJuice vs. regular () & realistic () attack from transferred DiffPGD.
| FLUX[dev] | SD3.5 | |
|---|---|---|
| Unsteered | 52.4 | 15.3 |
| DiffPGD | 57.2 | 23.5 |
| DiffPGD | 70.1 | 32.1 |
| PolyJuice (ours) | 81.2 | 99.7 |
W1c. Comparison to white-box attack
We believe such a comparison is not informative. A black-box attack with 40% SR appears weak vs. a white-box attack with 90%, but if the theoretical max. for black-box attacks is 41%, then it's very strong. So, comparing against white-box is not enough to judge a black-box attack.
W2. Not SOTA / Limited SID
We believe there is a misunderstanding. While the reviewer questions UFD’s relevance, they overlook our inclusion of RINE (ECCV 2024), a recent and competitive SID published alongside CoDE and after DRCT, which qualifies it as a SOTA academic SID. UFD remains a critical baseline, widely used or built upon by DRCT, CoDE, NPR, FatFormer, and RINE.
To address the reviewers’ suggestions (2n7t & 719g), we evaluate PolyJuice against four new SIDs and report strong improvements in Table R3.
Table R3. SR(%) of unsteered SD3.5 samples vs. PolyJuice.
| SID | Unsteered | PolyJuice |
|---|---|---|
| UFD | 12.8 | 80.6 (+68) |
| RINE | 15.3 | 99.7 (+84) |
| CoDE (linear) | 43.3 | 100.0 (+56) |
| DRCT(UFD) | 25.3 | 100.0 (+74) |
| NPR | 6.0 | 87.6 (+81) |
| FatFormer | 5.5 | 62.1 (+56) |
W3a. Image Quality Metrics
Table R2 reports (i) CLIP-IQA scores for image quality and (ii) CLIP alignment scores for prompt consistency. The results show that PolyJuice significantly boosts attack success while preserving both quality and alignment, confirming it targets SID blind spots rather than exploiting degradation.
Table R2. Quality & alignment scores vs. SR (%) on UFD & RINE, using FLUX[dev].
| UFD | UFD | UFD | RINE | RINE | RINE | |
|---|---|---|---|---|---|---|
| SR | CLIP-IQA | CLIP Score | SR | CLIP-IQA | CLIP Score | |
| 256-Unsteered | 67.6 | 0.8427 | 30.57 | 52.4 | 0.8427 | 30.57 |
| 256-PolyJuice (ours) | 96.3 | 0.8410 | 30.45 | 81.2 | 0.8457 | 30.52 |
| 512-Unsteered | 84.0 | 0.8535 | 30.92 | 77.2 | 0.8535 | 30.92 |
| 512-PolyJuice (ours) | 98.9 | 0.8487 | 30.94 | 96.7 | 0.8526 | 30.84 |
| 1024-Unsteered | 75.6 | 0.8657 | 30.86 | 82.4 | 0.8657 | 30.86 |
| 1024-PolyJuice (ours) | 98.4 | 0.8633 | 30.91 | 94.9 | 0.8503 | 30.69 |
W3b. Regarding Zhong et al.
Zhong et al. show that UFD has up to 8% drop when applied to degraded images, compared to clean images. In stark contrast, PolyJuice improves the attack success rate against UFD by up to 67%. This substantial gap shows that the improved attack success is not caused by degradation (supported by Table R2).
W3c. Stance on slight degradation
Although Table R2 shows that PolyJuice preserves image quality, it’s worth noting that many recent SIDs are explicitly trained to be robust against degradations like blurring, pixelization, etc. (e.g. CoDE, Sec. 3-2, pg. 5). Therefore, if a slightly degraded image still bypasses such defenses, it exposes a critical failure mode that needs to be addressed.
W4a. “large-p, small-n” assumption
The reviewer’s assumption is incorrect. For 256x256 images, the latent size in the FLUX/SD3.5 is p=16,384 (i.e. 16x32x32), while we use n=40,000 data points. Moreover, due to the resolution transferability of PolyJuice (Sec 4.4), we do not need to estimate this direction again for higher resolutions. Table R6 (reviewer 9sRv-W2) also shows that the eigenvector is not sensitive to the number of data points.
W4b. Selection of SPCA eigenvectors
For with classes, subspace’s dimensionality (num. of non-zero ) is . In our binary case, and Eq. (4) becomes as is expected by the reviewer. We will clarify this in the revised paper.
W5. Samuel et al. & Reverse Process Validity
Samuel et al. is only concerned with the initial Gaussian noise prior, , but not the trajectory. They show that if is perturbed such the norm deviates from , the image quality significantly drops. There is no contradiction with our findings, as we do not apply any perturbation on the initial noise; we only apply the direction starting from step 1 (L134 & Eq 6).
Previous papers (Universal Guidance) justify empirically choosing steering strength. Table R2 shows that PolyJuice’s success is not due to image degradation.
W6. Ablation on Hyperparameters
We ablate w.r.t the number of samples by using n = {16K, 40K, 52K}. The results, shown in Table R6 (reviewer 9sRv, W2), indicate that PolyJuice directions are not sensitive to n. We also visualize the correlation between successful attacks and (i) steering strength and (ii) steering timesteps in Fig. 9 and Fig. 10, respectively.
W7. Overarching Red-Teaming Claims
We show in Sec. 4.3 that using PolyJuice attacks to calibrate the detectors improves their detection rate by up to 30%. In fact, this is acknowledged as a strength by the reviewer (S4. Practical utility), and also by reviewer 71g9 (S4). Table R7 (rev. 9sRv, W4) also supports this.
W8. Untested Commercial Claims
We clarify that PolyJuice does not claim to demonstrate empirical performance on commercial models but rather enables the possibility of red teaming them due to its black-box design.
During the rebuttal period, we were unable to access commercial SIDs due to either their ToS or the companies’ disagreement.
Comment on Scope
We appreciate the reviewer’s recognition of the method’s broad applicability. Our work focuses on red teaming SIDs due to the urgent need to address risks posed by AI-generated content (AIGC), making it a timely and critical problem. While this paper targets SIDs, we hope future works explore its applications to other domains.
Response to Questions
Q1. …approximate the steering direction within 5 pp …?
As shown in Table R6 (reviewer 9sRv-W2), approximating the steering directions with 16K points yields the same result as the 40K we use in the paper (SR of 96.3%). Using more data points, 52K, leads to a very small increase (96.5%), which alludes to the low sensitivity of directions to n.
Q2. …DiffPGD trained on a surrogate SID?
Addressed in W1 & Table R4; we apply DiffPGD (with 2 T2Is) on surrogate SID (ResNet50) and transfer to black-box RINE. We find that PolyJuice has better attack performance.
Q3. Non-COCO domains
PolyJuice's primary focus is on red teaming general AI-gen. image detectors, rather than domain-specific ones. As a dataset beyond COCO, we also evaluate on PartiPrompts (Table R5, reviewer 9sRv-W1).
Q4. Sensitivity to and
The dimensionality of the subspace is determined by the number of classes of and in the binary case is equal to one; therefore there is no variability here wrt . For sensitivity, please refer to Sec. A.3 supplementary material.
Q5. Possible usage of non-linear SPCA
Thank you for the thoughtful question. As noted in L320, while a non-linear kernel may offer a boost in success rate by improving separability, it adds significant complexity due to the need for solving a pre-image problem.
Response to Limitations
We address (L1) in response to W5 of the reviewer 9sRv, (L2) in W1, W2, (L3) in Sec 6 & W4 5 of the reviewer 9sRv. We would appreciate clarification on the noisy TP+FN scenario described in (L4).
Please consider raising your score if we addressed your concerns. We are happy to answer more questions.
Dear Reviewer 2n7t,
Thank you for your valuable feedback on our submission. We have carefully addressed your comments and concerns in our rebuttal, supported by experimental results, which were released three days ago.
As the end of the discussion period approaches and your score is below the acceptance threshold, we kindly encourage you to share any remaining questions or thoughts to further clarify our responses. If our rebuttal has already addressed your concerns, we would greatly appreciate it if you could consider revisiting or updating your evaluation scores.
We deeply value your time and effort in reviewing our work and look forward to your thoughts.
Dear Reviewer 2n7t, Thank you for taking the time to engage with our responses and for finalizing your evaluation. Your valuable feedback has enriched our paper. As we no longer can see the revised scores, we would be very grateful if you could give us feedback on our last response and whether you have any further questions or suggestions.
Thank you for detailed clarifications. I am partially convinced by the arguments made by authors.
Baseline Coverage
I appreciate the author's experiment on DiffPGD. That resolves my concern about baselines, since now there is atleast one attack compared quantitavely.
Though I am not convinced by the author's argument against comparision with white-box attacks. It is true thay black-box attack will be less potent than a white-box attack. However, I feel strongly, that we need to close the gap with better optimizations. I am not convinced that black-box attacks have a theoretical max. Black-box performance depends on the optimizations they used and such a commenting on the gap for this task would lead better to development of better black-box optimizations to close the gap. If the authors, really feel that there is a theoretical max, they need to demonstrate it.
Limited Detectors
I appreciate the author's clatification about RINE and additional of experiments using CoDE and DRCT. This resolves my concerns about limited evaluation on SID Detectors.
Image Quality Metrics
I appreicate the author's addition of CLIP-based Image Metrics that resolve my concern to some extent. I'd like to note that I share authors opinion here that PolyJuice doesnot degrade image quality. As such I like to see concrete irrrefutable evidence for it. Otherwise it's quite easy to misunderstand that vulnervaility against image quality is being exploited.
Therefore, I would urge the authors, to also use some non-CLIP (or in general model) based Image Quality Metrics like FID to measure image degradation. Since it's possible for CLIP to have some robustness against Image degradation and as such these degradations will not be measured using CLIP-based or any general model based IQM.
Theoretrical Claims
I appreciate the author's clarification on large-p, small-n assumption of mine. While I agree that authors have used large n, small p cases in the experiments. I think it's pretty clear that this is a theoretical limitation and as such should be mentioned explictly in the article to prevent the cases of large-p and small-n experiments by people experimenting.
Reverse Process Validity
I am not convinced with the argument made against reverse process validity. As I mentioned in my earlier comment, I share the opinion that image degradation is not happening, but I am not throughly convinced theoretically or emprically. Authors have to understand image degradation goes beyond perceivable degradation like blurring. For example, JPEG compression at 85% since results the image visually similar but some information is lost. Thererfore, while exemplar images can demonstrate perceivable degradations are not happening, I am not convinced whether non-perceviable degradations are happening or not.
Ablation Studies
Thank you for the experiments, I have no further concerns about ablations.
Overarching Claims
Thank you for the argument and directing me towards section 4.3. I misjuuged the Table 3.
Commerical Claims
I stand by my point that unless any commerical system is attacked, it should not be mentioned as a contribution. It would be misleading to say that your attack can be tried on commercial systems, but there's always a chance that the attack might fail. Unless you demonstrate explictly, that the attack doesnot fail for commerical systems, it should not be listed as a contribution. Its more like a future work or so, maybe in conclusions (But I would argue even against it).
Scope
I highly feel the work is much more beyond than attack against SID.
Questions
Thank you for the answers.
Final Comments: Once again thank you for your rebuttal. All of my concerns except concrete evidence against image degradation (Emperical: Point on Image Quality Metrics, Theoretical: Reverse Process Validity) is addressed , as such I am willing to raise my score. I again urge the authors to provide concrete evidence that cannot be argued to show that any form of vulnervaility against image degradation is not being exploited by the Polyjuice.
We are happy that we fully addressed the reviewer’s concerns and questions regarding the literature review (W1 & 2), theoretical claims (W4), ablation study (W6), and overarching claims (W7). We are encouraged that the reviewer is willing to increase their score, and in the following, we provide more evidence and arguments for the remaining partial concerns regarding 1) image quality metrics, and 2) comparison to white-box in detail.
Regarding Image Quality
(a) …to also use some non-CLIP (or in general model) based Image Quality Metrics like FID to measure image degradation (b) …these degradations will not be measured using CLIP-based or any general model based IQM
FID is also model-based: We note that FID is also a model-based approach, as it is calculated in the feature space of InceptionV3, thus potentially suffering from the same limitations as CLIP-based metrics.
Non-model-based IQM: We adopt NIQE (Natural Image Quality Evaluator) as a non-model-based IQM, and show in Table R8 that there is no notable change in the image quality. We also note that the same trend as CLIP-IQA is followed by this non-model-based metric, thus agreeing with the results of CLIP-IQA in Table R2.
Table R8: Model-based (CLIP-IQA) & non-model-based (NIQE) quality assessment.
| UFD | UFD | UFD | RINE | RINE | RINE | |
|---|---|---|---|---|---|---|
| SR(%) (↑) | CLIP-IQA (↑) | NIQE (↓) | SR(%) (↑) | CLIP-IQA (↑) | NIQE (↓) | |
| Unsteered | 67.6 | 0.8427 | 6.57 | 52.4 | 0.8427 | 6.57 |
| PolyJuice (ours) | 96.3 | 0.8410 | 6.64 | 81.2 | 0.8457 | 6.39 |
Difficult to compute FID during a short rebuttal: Nonetheless, following the reviewer’s suggestion, we would like to compute FID too. However, the standard practice to compute FID requires 50K attacks per setting, which is not feasible in the short rebuttal period. In the revised version, we will add FID alongside the current CLIP-IQA metric to enhance the experimental results of the paper. For now, we thus provided CLIP-IQA, which is also an acceptable metric for image quality (as shown in Table R8 and Table R9).
Red Teaming vs. Attacker’s Perspective: During PolyJuice's development, we found that PolyJuice-steered images can help SIDs improve detection of unsteered/undegraded images (Sec. 4.3; Table R7 in response to Reviewer 9sRv), suggesting these images contain features useful beyond simple degradation or noise. Since PolyJuice meets its red-teaming objective, a qualitative analysis of image quality was deemed sufficient in the original submission. However, recognizing that image quality is critical from an attacker's perspective, we will include a more extensive evaluation of PolyJuice-steered attacks in the final revision.
Whether CLIP-IQA can measure degradation. In Table R9 (below), we find that CLIP-IQA captures both obvious (blurring) and less visible (JPEG compression) image degradation.
In the revised version, we will add FID alongside the current CLIP-IQA measurements to enrich the experimental results of the paper.
(c) …image degradation goes beyond perceivable degradation like blurring. For example, JPEG compression at 85% since results the image visually similar but some information is lost. … I am not convinced whether non-perceviable degradations are happening or not.
We agree that both perceivable and non-perceivable degradation is a valid concern. We thus investigate whether CLIP-IQA can actually detect such degradations.
We apply JPEG compression at 85% (imperceptible) and Gaussian blurring (perceptible), and observe in Table R9 that the CLIP-IQA metric captures both.
As the reviewer expects, perceptible degradation, like blurring, is very easily captured by CLIP-IQA (score sharply drops by ~12%). More importantly, we observe that imperceptible degradation, like JPEG compression, is also captured by CLIP-IQA to some extent (the score noticeably drops by ~4%). This change isn’t notably reflected in the alignment score, which is meant to measure prompt faithfulness rather than the objective quality of the image.
Therefore, CLIP-IQA is indeed sensitive to various kinds of degradations, even less visible ones like JPEG compression, and is thus an acceptable metric.
Table R9: Quality and Faithfulness metrics vs. visible degradation (Blur) and imperceptible degradation (JPEG compression).
| CLIP-IQA (↑) | CLIP Align. Score (↑) | |
|---|---|---|
| Unsteered | 0.8427 | 30.57 |
| w/ JPEG (85%) | 0.8084 (-4%) | 30.01 (-0.5%) |
| 256 w/ Blur | 0.7196 (-12%) | 29.29 (-1.2%) |
[Part 2 continued in next comment]
Image Quality (Continued from Part 1)
(b) Since it's possible for CLIP to have some robustness against Image degradation… (d) As such I like to see concrete irrrefutable evidence for it. Otherwise it's quite easy to misunderstand that vulnervaility against image quality is being exploited.
We have three additional evidence that together offer a reliable conclusion that image quality is not exploited:
- A non-model-based quality metric, NIQE, agrees with the CLIP-IQA results (Table R8).
- CLIP-IQA is able to detect both perceivable and less visible degradation (Table R9).
- Fine-tuning with PolyJuice identified images improves SID detection on unsteered/undegraded images (Table 3 & R7).
We will add these valuable discussions, in addition to quality metrics like CLIP-IQA and FID, in the revised version of the paper.
Gap with White-Box Attacks
It is true that black-box attack will be less potent than a white-box attack. However, I feel strongly, that we need to close the gap with better optimizations.
In Table 1, we showed that the average success rate for PolyJuice is 89.4% and 91.7% for UFD and RINE, improving over unsteered by 30% and 38% respectively.
Since an ideal white-box attack has 100% SR, the gap between PolyJuice’s success rate and an ideal white-box attack is about ~10% on average.
To validate this gap, we conducted white-box DiffPGD attack against RINE and UFD and report the results in Table R10; we observe that the success rate of white-box attack is indeed almost 100%. We will add a discussion about this gap in the revised version.
Table R10. SR(%) of PolyJuice vs. regular () & realistic () attack from white-box (w.) DiffPGD.
| RINE | RINE | UFD | UFD | |
|---|---|---|---|---|
| FLUX[dev] | SD3.5 | FLUX[dev] | SD3.5 | |
| Unsteered | 52.4 | 15.3 | 67.6 | 12.8 |
| w. DiffPGD () | 100.0 | 100.0 | 100.0 | 99.7 |
| w. DiffPGD () | 100.0 | 100.0 | 100.0 | 99.7 |
| PolyJuice | 81.2 | 99.7 | 96.3 | 80.6 |
Theoretical Claims
I appreciate the author's clarification on large-p, small-n assumption of mine. While I agree that authors have used large n, small p cases in the experiments. I think it's pretty clear that this is a theoretical limitation and as such should be mentioned explictly in the article to prevent the cases of large-p and small-n experiments by people experimenting.
In the revised version, we will specify the need for large-n and small-p as necessary conditions for for reliable application of PolyJuice.
Commercial Claims
I stand by my point that unless any commerical system is attacked, it should not be mentioned as a contribution. It would be misleading to say that your attack can be tried on commercial systems, but there's always a chance that the attack might fail. Unless you demonstrate explictly, that the attack doesnot fail for commerical systems, it should not be listed as a contribution. Its more like a future work or so, maybe in conclusions (But I would argue even against it).
We agree with the reviewer that mentioning commercial system (e.g. in the contributions) might lead to confusion; therefore, we will remove it from the paper and only use it as a motivation why developing black-box models in general is crucial.
Scope
I highly feel the work is much more beyond than attack against SID.
Thank you for your encouraging comments. We’re thrilled that you see PolyJuice’s potential beyond SID attacks, and we hope its generality will spark new applications and inspire further work across a variety of domains.
Final Comments
Once again thank you for your rebuttal. All of my concerns except concrete evidence against image degradation (Emperical: Point on Image Quality Metrics, Theoretical: Reverse Process Validity) is addressed , as such I am willing to raise my score. I again urge the authors to provide concrete evidence that cannot be argued to show that any form of vulnervaility against image degradation is not being exploited by the Polyjuice.
Thank you for your positive feedback. We addressed the remaining concerns of the reviewer regarding the image degradation (part 1) and highlighting the gap with white box (part 2). We also note that we will add all these additional discussions along with the FID score in the final version of the manuscript.
This paper introduces PolyJuice, a novel and versatile black-box red-teaming framework designed to launch unrestricted adversarial attacks against synthetic image detectors (SIDs). Unlike prior methods that rely on white-box access or image-specific optimization, PolyJuice identifies and exploits systematic weaknesses in SIDs by discovering a universal shift in the T2I latent space. By leveraging this insight, PolyJuice can efficiently steer generated images toward SID failure modes, significantly increasing attack success rates. Moreover, augmenting training datasets with PolyJuice-generated examples leads to notable improvements in SID robustness, with detection accuracy increasing by up to 30%.
优缺点分析
Strengths:
- The manuscript is generally well-written and easy to follow, with clear motivation and logical flow.
- The topic of improving the robustness of AIGC image detection addresses an urgent and practical challenge in the era of powerful text-to-image models.
- The proposed method is novel in that it introduces a black-box, image-agnostic red-teaming strategy by exploiting latent space distribution shifts — a departure from typical white-box or per-image attack strategies.
- The method not only enables effective attacks but also contributes positively by improving detector robustness when used for data augmentation.
Weaknesses:
- It is unclear whether the observed latent space distribution shift (between correctly and incorrectly classified images) is consistent across different detectors. If the distribution shift is detector-specific, how transferable or generalizable is the proposed PolyJuice approach?
- The method assumes that a linear direction in latent space can universally guide images towards detector failure modes. Is this assumption valid across diverse prompts and image categories? A discussion or ablation study would help clarify.
- While the method improves attack success rates, it is important to analyze whether the manipulated/generated images remain visually realistic and prompt-faithful. Otherwise, there’s a risk that PolyJuice may just generate degraded or unrealistic images that trivially evade detection.
- The authors evaluate PolyJuice primarily on UFD and RINE as target synthetic image detectors. However, it remains unclear how well the method generalizes to more recent and potentially more robust detectors such as FatFormer (CVPR 2024) and NPR (CVPR 2024).
问题
- Please refer to Weaknesses
- Is the observed latent space distribution shift consistent across different synthetic image detectors? Specifically, does PolyJuice need to re-estimate the steering direction for each SID individually, or can the same direction be transferred between detectors?
- How does the latent steering affect the visual quality and semantic consistency of the generated images? Are there metrics or human evaluations to ensure the attack does not simply degrade image realism?
局限性
The proposed method relies on identifying a latent space shift that separates correctly and incorrectly classified samples for a specific detector. This shift may not generalize across detectors, potentially limiting the reusability of the learned steering direction.
最终评判理由
Thank the author for their response. The author needs to discuss in detail the impact of the method on image quality in the final version.
格式问题
No major formatting issues were found.
Summary of the Review & Response
We thank the reviewer for their positive feedback. We are pleased that the reviewer finds the manuscript to be well-written, clearly motivated, and easy to follow (S1), and we view this as a key strength of our presentation. The reviewer identifies that our paper addresses an urgent and practical challenge regarding AI-generated image detection (S2) through a novel attack strategy (S3). Further, the reviewer highlights our positive contribution to improving detection methods (S4).
Primary concerns and questions are regarding:
1) whether the attack directions are reusable between detectors (W1, L1)
2) whether the method generalizes over diverse prompts (W2)
3) prompt faithfulness and quality of the generated images (W3)
4) experiments on additional SID models (W4)
We address:
1) by clarifying that the attack directions are SID-specific, and that transfer attacks are more relevant for white-box approaches
2) by highlighting the diversity of the prompts used in PolyJuice evaluation (Table R1); we also extend PolyJuice to captions beyond COCO using PartiPrompts (Table R5, response to reviewer 9sRv, W1)
3) by computing CLIP alignment score (prompt faithfulness) and CLIP-IQA score (quality) in Table R2
4) by evaluating the effectiveness of PolyJuice on 4 additional SID models (Table R3)
Response to Weaknesses
W1. It is unclear whether the observed latent space distribution shift is consistent across different detectors. If the distribution shift is detector-specific, how transferable or generalizable is the proposed PolyJuice approach?
We first clarify that the distribution shift is detector-specific, as different detectors have distinct vulnerabilities. The shifts are not consistent unless the models share similar weaknesses. As shown in Sec B.1 of the supplementary material, the common patterns differ across models, reflecting variations in training data and techniques. In general, transferability is more relevant in the context of white-box attacks where one must craft an attack on a surrogate detector in order to apply a transfer attack to a target black-box model. In contrast, we focus on black-box attacks where we are able to target each model directly.
W2. The method assumes that a linear direction in latent space can universally guide images towards detector failure modes. Is this assumption valid across diverse prompts and image categories? A discussion or ablation study would help clarify.
Thank you for your suggestion, we agree that such an ablation can be useful to the paper. We first inspect the successful attacks generated by PolyJuice using COCO validation prompts and categorize the attacks according to available meta-labels. Table R1 shows the fraction of successful attacks per prompt category. We observe that PolyJuice improves the success rate across all categories, demonstrating that the direction is universally valid across diverse prompts and image categories. We also extend PolyJuice to another diverse dataset, i.e., PartiPrompts (see Table R5, in response to reviewer 9sRv-W1).
Table R1. Success rate per prompt category in unsteered vs. PolyJuice (on COCO).
| UFD | UFD | RINE | RINE | |
|---|---|---|---|---|
| Unsteered | PolyJuice | Unsteered | PolyJuice | |
| Person | 13.7 | 69.5 | 16.1 | 83.5 |
| Animal | 12.5 | 70.8 | 9.3 | 90.7 |
| Food | 14.4 | 58.1 | 11.3 | 88.8 |
| Vehicle | 10.0 | 73.5 | 20.1 | 79.5 |
| Furniture | 18.15 | 66.1 | 17.3 | 82.6 |
W3. While the method improves attack success rates, it is important to analyze whether the manipulated/generated images remain visually realistic and prompt-faithful. Otherwise, there’s a risk that PolyJuice may just generate degraded or unrealistic images that trivially evade detection.
First, we want to clarify our stance on how an SID should process an “unrealistic” image versus a “degraded” one. Since the goal of red teaming is to discover and mitigate the failure modes, finding attacks that are not necessarily realistic (e.g., cartoonish features that fool the SID) is also important to us, since it reveals the failure mode of the target SID. An ideal SID must trivially detect these kinds of unrealistic images as fake.
We agree with the reviewer’s remark that it is important to generate attacks that are prompt faithful, and of reasonable image quality. To measure prompt faithfulness, we compute the CLIP alignment score between the generated attacks and the input prompts; to measure image quality, we use the CLIP IQA score (Wang et al. 2023). We present the alignment and image quality score alongside the success rate for FLUX[dev] in Table R2. We find that the CLIP scores and CLIP-IQA scores are comparable to those of the unsteered models, suggesting that PolyJuice is not trivially evading detection by degrading the images.
Table R2. Quality & alignment scores vs. SR (%) on UFD and RINE, using FLUX[dev]
| UFD | UFD | UFD | RINE | RINE | RINE | |
|---|---|---|---|---|---|---|
| SR | CLIP-IQA | CLIP Score | SR | CLIP-IQA | CLIP Score | |
| 256-Unsteered | 67.6 | 0.8427 | 30.57 | 52.4 | 0.8427 | 30.57 |
| 256-PolyJuice (ours) | 96.3 | 0.8410 | 30.45 | 81.2 | 0.8457 | 30.52 |
| 512-Unsteered | 84.0 | 0.8535 | 30.92 | 77.2 | 0.8535 | 30.92 |
| 512-PolyJuice (ours) | 98.9 | 0.8487 | 30.94 | 96.7 | 0.8526 | 30.84 |
| 1024-Unsteered | 75.6 | 0.8657 | 30.86 | 82.4 | 0.8657 | 30.86 |
| 1024-PolyJuice (ours) | 98.4 | 0.8633 | 30.91 | 94.9 | 0.8503 | 30.69 |
W4. The authors evaluate PolyJuice primarily on UFD and RINE as target synthetic image detectors. However, it remains unclear how well the method generalizes to more recent and potentially more robust detectors such as FatFormer (CVPR 2024) and NPR (CVPR 2024).
First, we note that RINE (ECCV 2024), which is evaluated in our paper, is a recent and concurrent work to the detectors noted by the reviewer. We thank the reviewer for suggesting these SID models that can enrich the empirical results of the paper. As such, we adopt additional SID models: NPR and FatFormer (suggested by reviewer 71g9), as well as DRCT and CoDE (suggested by reviewer 2n7t) for further evaluating the effectiveness of PolyJuice. The additional results are shown in Table R3. These results show the effectiveness of PolyJuice in improving the success rate of the attacks by 56.7% on CoDE, 75% on DRCT, 82% on NPR, and 56% on FatFormer. These results will be added to Table 1 of the paper in the final revision.
Table R3. SR(%) of unsteered SD3.5 samples vs. PolyJuice.
| Detector | Unsteered SD3.5 | PolyJuice-Steered SD3.5 |
|---|---|---|
| UFD | 12.8 | 80.6 (+68) |
| RINE | 15.3 | 99.7 (+84) |
| CoDE (linear) | 43.3 | 100.0 (+56) |
| DRCT(UFD) | 25.3 | 100.0 (+74) |
| NPR | 6.0 | 87.6 (+81) |
| FatFormer | 5.5 | 62.1 (+56) |
Response to Questions
Q2. Is the observed latent space distribution shift consistent across different synthetic image detectors? Specifically, does PolyJuice need to re-estimate the steering direction for each SID individually, or can the same direction be transferred between detectors?
As mentioned in the response to W1, different models have different vulnerabilities, so the directions are different. As such, the directions need to be estimated for each SID. However, estimating the steering directions requires a negligible computational cost since once we generate the images and latents for the first SID, we need to 1) predict the labels using the new SID, 2) calculate the eigenvalue problem in Eq. 3.
Q3. How does the latent steering affect the visual quality and semantic consistency of the generated images? Are there metrics or human evaluations to ensure the attack does not simply degrade image realism?
The latent steering performed by PolyJuice has a negligible effect on the visual quality and semantic consistency of the generated images, as shown in qualitative analysis in Sec B of the supplementary material. Further, we employ CLIP IQA score and CLIP alignment score to measure image quality and prompt faithfulness of the generated attacks compared to unsteered images, and the results are provided in Table R2. The results support our remark that the quality and consistency are practically unchanged by PolyJuice while the attack success rate (SR) is improved significantly.
Response to Limitations
L. The proposed method relies on identifying a latent space shift that separates correctly and incorrectly classified samples for a specific detector. This shift may not generalize across detectors, potentially limiting the reusability of the learned steering direction.
We mentioned that the directions are not necessarily transferable across SID due to the difference in their vulnerabilities. However, if two SIDs’ predictions have a significant overlap with each other, then the directions from one SID can potentially be useful in attacking the other SID.
Please consider updating your scores if we have addressed your concerns. If you have more questions, we would be happy to answer them.
References
Wang et al. "Exploring clip for assessing the look and feel of images." AAAI (2023).
I would like to thank the authors for their rebuttal. I have read their responses as well as the reviews from fellow reviewers. Some common concerns were raised in the initial review, including the impact on image quality.
In their rebuttal, some of my doubts were addressed. Similar to Reviewer 2n7t's concerns, I believe it is necessary for the authors to employ certain non-CLIP-based image quality metrics to evaluate the images comprehensively.
We thank the reviewer for their valuable feedback, and we are glad to have addressed most of their doubts. Regarding non-CLIP/non-model-based image quality metrics (as suggested by the reviewer), please refer to our two-part response to Reviewer 2n7t.
Paper summary
This paper introduces a black-box red-teaming method designed to deceive synthetic image detectors. The core idea is to identify a discriminative direction within the latent space of a generative model, a vector separating real from synthetic characteristics, and then steer the image generation process along this vector. This steering creates outputs that are more effective at fooling the detectors. The authors demonstrate through experiments that their method is not only robust but also transferable across different detectors, making it a valuable tool for auditing and understanding the vulnerabilities of such systems.
Meta-Review and Recommendation
The review process for this paper was highly constructive. The reviewers initially raised valid points regarding the need for additional experiments, clarity, and the scope of the claims. The authors' detailed rebuttal successfully addressed most of these concerns, leading to consensus for acceptance (rating it borderline accept or higher). One of the remaining issues is the measurement of the image quality; the authors have committed to including metrics like FID in the final version. Given the paper's contributions and the effective resolution of all major concerns during the discussion period, the AC recommends an Accept for the paper. The work represents a solid and timely contribution, and the authors have shown they are committed to delivering a polished final manuscript.