PaperHub
6.8
/10
Poster4 位审稿人
最低3最高5标准差0.8
3
4
5
5
3.5
置信度
创新性2.5
质量2.5
清晰度2.8
重要性2.5
NeurIPS 2025

FerretNet: Efficient Synthetic Image Detection via Local Pixel Dependencies

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29

摘要

关键词
Synthetic Image DetectionLocal Pixel DependenciesMarkov Random FieldsImage Forensics

评审与讨论

审稿意见
3

The authors address the challenge of synthetic image detection by focusing on local pixel dependencies, observing that generative models often introduce subtle inconsistencies within small neighborhoods. To capture these artifacts, they compute the median value within a local context window for each pixel and measure the difference between the original image and this median-filtered version, generating a feature map that highlights local pixel dependencies. Their proposed model, FerretNet, is built using a series of Ferret Blocks that incorporate dilated and grouped convolutions to effectively capture multi-scale features. The approach is evaluated on images generated by both GANs and LDMs. To justify their architectural choices, the authors conduct ablation studies on the neighborhood window size, the treatment of the center pixel, the selection of the median operation, and the influence of different backbone networks.

优缺点分析

Primary Justifications: My negative assessment is primarily due to the lack of clarity in the methodology (see W1, W3–W6) and the unconvincing empirical results (see W8–W10).

Strengths:

S1) The ablation studies are comprehensive and provide strong justification for the proposed method.

S2) The core idea of modeling local pixel dependencies is conceptually interesting and novel.

Weaknesses:

W1) While the introduction presents the Synthetic-Pop dataset as a contribution, the paper provides minimal detail about its curation, contents, or validation methodology.

W2) The related work section lacks essential background on the foundational models, limiting the reader's ability to understand the context.

W3) Figure 1 is unintuitive, overly verbose, and does not add meaningful insight.

W4) The argument in Section 3.1 is unconvincing. It draws vague parallels between LDMs, GANs, and VAEs, despite significant conceptual differences in their latent spaces (e.g., dimensionality, inference steps). These connections are not rigorously justified. This section should either be substantially clarified or removed.

W5) Line 134 references the MRF assumption without explaining why it is applicable in this context. This should be explicitly motivated and justified.

W6) The training procedure for FerretNet is insufficiently described (see Q2 for more detail).

W7) The Synthetic-Pop dataset appears to omit key SOTA models, such as FLUX.

W8) Results in Table 1 are not competitive with SOTA; FatFormer outperforms the proposed FerretNet.

W9) Table 2 shows only a marginal (~2%) average improvement on diffusion datasets, which is concerning given that FerretNet underperforms in Table 1.

W10) If Synthetic-Pop is intended as a novel benchmark, its utility is questionable: performance exceeds 96%, suggesting the dataset may not be sufficiently challenging to provide value to the community.

问题

Q1) Lines 141–142 state that the model performs better with an even number of pixels, but this seems counterintuitive, especially since the neighborhoods used (e.g., 3×3, 5×5, 7×7) are all odd-sized. Could the authors clarify this claim?

Q2) How is FerretNet trained? While the architecture and the LPD mechanism are described, the training loss function is not clearly specified.

Q3) Were any of the evaluated methods VAEs? VAEs are referenced multiple times throughout the paper, but it's unclear whether they are included in the comparisons.

Q4) What is the origin of the Synthetic-Aesthetic dataset? Is it a novel dataset introduced in this work, or does it come from prior literature?

局限性

yes

最终评判理由

My primary concerns during the first review cycle centered on the lack of clarity in the writing and what I initially felt were unconvincing results.

I now find the results of the main method to be convincing. While the method does not outperform SOTA approaches, it achieves comparable performance with significantly fewer parameters, which is a meaningful contribution.

However, I still find several aspects of the paper unclearly presented and potentially misleading, particularly in relation to the proposed dataset (presented as 1/3 of the paper's contribution). The dataset is positioned as a major contribution early in the paper, but this is not adequately supported throughout. First, its usefulness appears limited: while it helps differentiate between existing methods, it is unlikely to serve as a valuable benchmark for future work due to saturation. Second, the dataset is not well-situated within the context of related work. Third, its overall presence and impact in the paper are minimal relative to its prominence in the introduction, leading to inconsistencies in how its significance is perceived. Fourth, due to technical limitations, the authors were unable to present results for the most recent SOTA models, which makes it difficult to understand how the proposed dataset fits within the context of the current literature.

Beyond the dataset, other portions of the manuscript (such as certain figures and the section describing the underlying intuitions) also struck me as unclear or potentially misleading.

Following a discussion with the authors, they have agreed to revise and clarify many of these areas. As a gesture of good faith in their commitment to improving the paper, I have increased my score from 2 to 3. That said, the extent of the unclear and potentially misleading content leads me to believe that the revised version would still benefit from additional peer review focused on clarity. For this reason, I am not comfortable raising my score further at this time.

格式问题

none

作者回复

Dear Reviewer,

Thank you for your detailed review. While we appreciate the time you have invested, we must respectfully state that we believe your overall negative assessment is based on several fundamental misunderstandings of our paper's content and a misinterpretation of our experimental results. We have identified numerous claims in your review that are factually inconsistent with the information presented in our manuscript.

We will address each of your points below, providing direct references to the paper to correct these inaccuracies.

1. Clarifications on Factual Misunderstandings in the Review

A significant portion of the negative evaluation stems from claims that are directly contradicted by the text of our paper.

  • Regarding W6 & Q2 (Missing Training Loss): You state that the training loss function is not specified. This is incorrect. As clearly stated on Page 6, Line 216, "Binary Cross Entropy with Logits Loss (BCEWithLogitsLoss) is adopted as the loss function." The entire training setup, including the optimizer, learning rate, and data augmentation, is detailed in Section 5.2 (Lines 212-217).

  • Regarding Q4 (Source of Synthetic-Aesthetic Dataset): You ask for the source of this dataset. This information is explicitly provided. On Page 6, Lines 205-209, we state that the images were "sampled from the Simulacra Aesthetic Captions (SAC) dataset" and the real images were from "LAION-Aesthetics V2".

  • Regarding Q1 (Even vs. Odd Pixels): You claim we state the model "performs better when the number of pixels is even," which you find counter-intuitive. This is a significant misreading of our text. We never make this claim. On Page 4, Lines 140-142, we state our zero-masking strategy is "particularly beneficial when the neighborhood contains an even number of pixels." This refers to the number of neighboring pixels used for the median calculation (e.g., a 3x3 window has 8 neighbors after excluding the center—an even number), not the window size itself. This standard practice ensures the median is always an actual pixel value from the neighborhood, enhancing statistical robustness.

  • Regarding W5 (MRF Justification): You claim we do not explain the applicability of the MRF assumption. This is the very foundation of our method, explained in Section 4.1. On Page 4, Lines 133-135, we state: "According to the Markov Random Field (MRF) assumption, the probability distribution of a pixel depends only on its local neighborhood." We then immediately provide the mathematical formulation (Equation 2) and build our entire LPD feature (Equation 5) upon this principle. The method itself is the justification for its applicability.

  • Regarding Q3 (Inclusion of VAEs): Your question about VAEs highlights a key trend in generative AI. While we mention VAEs in our introduction to provide historical context, their primary role in the current state-of-the-art is not as standalone generators due to limitations in output fidelity. Instead, VAEs are now a critical component within Latent Diffusion Models (LDMs), where they function as powerful encoders and decoders for the latent space. Therefore, our mention of VAEs serves to trace the developmental lineage of generative techniques, acknowledging that the LDMs we extensively evaluate are fundamentally built upon them. Our evaluation correctly focuses on the final generative models (GANs, LDMs) as per established benchmarks.

2. Rebuttal of Methodological and Experimental Criticisms

  • Regarding W8 & W9 (Performance vs. SOTA): Your claim that our results are "not competitive" and improvements are "marginal" is a severe mischaracterization of the evidence presented.

    • Efficiency vs. Accuracy Trade-off: You state that FatFormer outperforms FerretNet in Table 1. You fail to mention the colossal difference in model size, which is a central point of our paper. As shown in Table 4 (Page 8), FatFormer has 577.3M parameters, while our FerretNet has only 1.1M (~0.2% of the size) and runs ~8.7x faster. Achieving competitive accuracy (95.9% vs. 98.4%) with a model that is 500 times smaller and vastly more efficient is a significant achievement, not a weakness.
    • Superiority on Diffusion Models: You call the improvement in Table 2 "marginal." We respectfully disagree. Our FerretNet outperforms the SOTA FatFormer (96.9% vs 95.0% ACC). Defeating a model 500x its size on the most recent and relevant class of generative models is a clear and significant contribution, directly contradicting your assessment.
  • Regarding W10 (Utility of the Synthetic-Pop Dataset): You argue that because our method exceeds 98% accuracy, the dataset is "not challenging." This logic is flawed. The utility of a benchmark is to differentiate methods. As shown in Table 3 (Page 7), on our Synthetic-Pop dataset, our method achieves 98.3% ACC, while existing SOTA methods like FatFormer and NPR fail dramatically, scoring only 70.5% and 77.4% respectively. This does not show the dataset is easy; it proves the exact opposite. It demonstrates that modern, high-quality generators pose a significant challenge to existing detectors, and it is precisely this challenge that our method successfully overcomes. The dataset is therefore highly valuable to the community.

  • Regarding W1, W2, W7 (Dataset Details, Related Work, and Missing Models):

    • (W1) Key details of Synthetic-Pop are provided in Lines 199-204. We state the 6 models used, the source of captions and real images, and the total size. We commit to releasing the dataset with full documentation.
    • (W2) A deep review of foundational models is beyond the scope of a paper focused on a novel detection method. Our related work section correctly focuses on synthetic image detection literature, which is standard practice.
    • (W7) Thank you for pointing this out. FLUX is indeed a significant and very recent model. We did consider its inclusion during the construction of our Synthetic-Pop benchmark. Unfortunately, at the time of our experiments, we encountered significant technical challenges in deploying the model in a way that would ensure a stable and fair comparison, a common issue with novel, complex architectures upon their initial release. We therefore proceeded with a set of six other highly popular, diverse, and representative text-to-image models that form a strong and challenging basis for evaluation. We are actively working to integrate it and will certainly include it in future iterations of our work.
  • Regarding W3 & W4 (Clarity and Justification of Section 3): Section 3 provides a high-level, unifying conceptual framework to motivate our search for common artifacts (latent space deviations and decoding inconsistencies). It is not intended as a rigorous mathematical proof of equivalence, but as an accessible intuition for why a universal detector might be possible. This framing is a valid and common rhetorical tool in scientific papers. Figure 1 serves as a standard visual aid for this two-stage process.


In summary, we believe this review contains critical factual errors and misinterpretations that led to an unfounded negative conclusion. Our paper introduces a lightweight, efficient, and novel method grounded in statistical principles. It demonstrably outperforms the state-of-the-art on modern diffusion models and on a new challenging benchmark, all while being over 500 times smaller. We have provided comprehensive experiments and ablations that you yourself have acknowledged as a strength (S1). We kindly ask the Area Chair to consider these points and re-evaluate the validity of this review.

评论

Tone and Professionalism in the Rebuttal

While I appreciate the authors's detailed rebuttal and their efforts to address my concerns point-by-point, I found the tone of the response to be occasionally dismissive and unnecessarily adversarial, particularly the appeal to the AC to disregard my feedback entirely. Phrases such as “This is incorrect,” “a severe mischaracterization,” and “this logic is flawed” read as confrontational rather than collegial. In the peer review process, I believe it is important for rebuttals to remain focused on clarification and factual correction while maintaining a tone of professional respect. The tone here detracted from response and may not foster productive dialogue.

Furthermore, I would like to highlight that questions in a review serve as an opportunity to clarify areas where the reviewer has flagged that they may not completely understand the work. Characterizing them as "Factual Misunderstandings in the Review" implies bad faith on the part of the reviewer, which is not the intent behind raising clarifying questions. Instead, it can be productive to take these as feedback that some aspects of the paper could be clarified, as future readers may be confused as well.

While this did not influence my technical assessment, I felt it was worth noting in the interest of maintaining a constructive and respectful review process. I also appreciate the feedback and acknowledge that, although I aimed to make my original comments clear and actionable, I could have provided more detail to better convey my intent.

Response to the technical aspects (questions and weaknesses)

W6 & Q2: Thank you for clarifying this. I was expecting the details to come closer to the architecture description, and I believe the gap with the Dataset Information in between may have caused me to overlook these details.

Q4: Lines 205–209 motivated me to ask about dataset origin because they describe a mix of subsampled datasets, which could be interpreted as a novel composition. Its placement (after your Synthetic-Pop dataset) adds to this ambiguity. Since Synthetic-Aesthetic is not clearly presented as either novel or previously established, I sought clarification to better understand its role in your contribution and the literature.

Q1: Thank you for the clarification. Again, tone awareness would be appreciated in question answering.

W5: I understand your rebuttal response, but I stand by my original feedback: explicitly motivating why you would expect the MRF to hold in this context (even on an intuitive level) would give a much stronger theoretical basis to why your method works.

Q3: This makes sense, the answer is satisfactory.

W8-9: This is a good point, thank you for reiterating that I was perhaps focusing on the wrong aspects. I agree with your statements.

W10: I do not believe my reasoning here is flawed. If performance on the dataset is already near saturation, it limits the potential for future models to build upon it (one of the typical purposes of releasing a dataset). That said, I appreciate the clarification that the dataset is intended primarily for differentiating among existing models. This does add more value than I had initially recognized, though still less than what is typical for datasets aimed at long-term benchmarking.

W1: I initially felt that Synthetic-Pop received relatively limited attention in the paper, especially given its framing as a key contribution in the introduction. However, on further reflection, I do not identify any specific missing details. In that case, the current level of coverage seems adequate, though the emphasis in the introduction may overstate its relative importance.

W2: Given that the task involves detecting images generated by foundational generative models, some context about these models is essential. Since Synthetic-Pop is presented as a key contribution, briefly situating the models used to generate it within related work is also important. A "deep review" isn’t required; just a short, high-level paragraph would suffice to provide the necessary context.

W7: This is understandable. Could you elaborate more on the challenges associated with FLUX?

W4: The intended purpose of this section (as “accessible intuition for why a universal detector might be possible”) was not clear to me as a reader. This lack of clarity may have distracted from, rather than supported, the intended intuition. Clarifying this upfront could help readers better engage with the argument.

W3: The response "Figure 1 serves as a standard visual aid for this two-stage process" does not provide any clarity for the figure. I stand by my initial feedback.

Final note

I hope my responses have helped clarify the concerns I raised, and that we can continue engaging in a constructive discussion of the paper's potential weaknesses rather than setting them aside, as the peer-review process is intended.

评论

Dear Reviewer,

Thank you for your detailed follow-up and for this continued discussion.

First and foremost, we would like to sincerely apologize for the tone of our previous rebuttal. Upon reading your feedback, we recognize our wording was overly confrontational and did not foster the collegial and constructive dialogue the peer-review process deserves. That was a mistake on our part. Your feedback is a valuable reminder that clarity and respect are paramount, and we are genuinely grateful for this opportunity to continue the conversation productively.

We appreciate you reconsidering and accepting our clarifications on several key points (W6/Q2, Q4, Q1, Q3, W8-9). For the remaining issues, we view your feedback as excellent suggestions for improving our paper's clarity. Below, we outline our plan to incorporate your suggestions into the final manuscript.

Regarding W5: Strengthening the MRF Motivation This is an excellent point. You are right that a stronger intuitive foundation would benefit the paper.

  • Our Plan: We will add a brief motivation to Sec. 4.1. We will explain that natural images exhibit strong local statistical consistency due to the physics of light and matter. We will then argue that generative models often struggle to perfectly replicate these subtle statistics, leading to microscopic disruptions that our LPD method is designed to quantify. This will provide the clearer theoretical grounding you suggested.

Regarding W10 & W1: Refining the Contribution of the Synthetic-Pop Dataset This is a very thoughtful point about the role of benchmarks, and we appreciate your perspective.

  • Our Plan: We agree our introduction may have slightly overstated its long-term role. We will revise our wording to more accurately frame Synthetic-Pop's primary value as a crucial "differentiator for current models." We will position it as a timely and necessary "stress test" that highlights the limitations of existing detectors against high-fidelity generators, rather than as a benchmark for long-term progress.

Regarding W2 & W4: Improving Clarity and Context for Section 3 Thank you; your feedback makes it clear this section needs better signposting.

  • Our Plan: 1) At the start of Section 3, we will add an introductory sentence clarifying its purpose as a high-level, intuitive framework to motivate our search for universal artifacts. 2) We will also add a short paragraph to the Related Work to briefly contextualize the models used in Synthetic-Pop, as you suggested.

Regarding W7: Elaborating on the Challenges with FLUX This is a reasonable request for more detail.

  • Elaboration: The primary challenges were twofold: 1) The prohibitive computational resources required by FLUX were unsuitable for the large-scale, stable image generation needed for a robust benchmark. 2) The available open-source implementations were still maturing at the time, which could introduce confounding variables into our analysis. To ensure fairness and replicability, we prioritized models with stable, well-established implementations. We hope to include FLUX in future work as its ecosystem matures.

Regarding W3: Improving the Clarity of Figure 1 We appreciate you standing by your feedback; our initial response was insufficient. The figure clearly needs more support.

  • Our Plan: We will substantially revise the caption of Figure 1. The new, more detailed caption will explicitly walk the reader through the diagram. It will clarify that the figure illustrates a conceptual, two-stage abstraction of the generation process, not a literal architectural diagram. We will explicitly state that the "Decoder" stage is the conceptual locus of the decoding-induced artifacts (like smoothing and color discontinuities) that our LPD method is designed to detect, directly linking the figure's components to the paper's core ideas.

We are very grateful for this constructive dialogue. We believe incorporating these changes, prompted directly by your insightful feedback, will make our paper substantially stronger. We hope our responses and our detailed plan for revision have fully addressed your remaining concerns.

评论

Dear Authors.

Thank you for your renewed commitment to maintaining a respectful and constructive dialogue.

Your latest responses have addressed my concerns more thoroughly, and I have no further points to raise at this time.

I appreciate your time and engagement.

评论

Dear Reviewer, Thank you for your final message and for the positive conclusion to our discussion. We are sincerely grateful for your thorough engagement throughout this process. Your feedback has been invaluable in helping us improve the paper, and we deeply appreciate the time and effort you dedicated to our work.

审稿意见
4

The paper presents FerretNet, a lightweight convolutional network that detects synthetic images by exploiting local pixel dependencies (LPD) derived from Markov-Random-Field-motivated, zero-masked median deviations between each pixel and its neighborhood. The paper also provides a large-scale benchmark dataset, Synthetic-Pop, covering six cutting-edge generators. Extensive experiments show that the method is effective.

优缺点分析

Strengths:

  1. The paper is easy to follow.
  2. The method is efficient.

Weaknesses:

  1. No experiments on common post-processing (JPEG, resizing, rotating ...).
  2. The idea of locality-based cues is somewhat incremental. This makes the novelty limited.
  3. The authors should explore the adaptive attacks of this method.

问题

  1. Please report results on common post-processing (JPEG, resizing, rotating ...) as these are ubiquitous in social media.
  2. Please clarify the conceptual differences between FerretNet and NPR [1].
  3. Generative models could explicitly minimize median deviation artifacts. Please explain why we should not consider this adaptive threat model.

[1] Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. Rethinking the up-sampling operations in CNN-based generative networks for generalizable deepfake detection. In CVPR 2024

局限性

yes

最终评判理由

All my concerns have been addressed.

格式问题

no

作者回复

Dear Reviewer,

Thank you for your valuable feedback. We are pleased that you found our paper easy to follow and our method efficient. Your questions regarding post-processing robustness, novelty, and adaptive attacks are critical, and we are pleased to provide new experimental results and clarifications below.

1. Regarding Robustness to Common Post-Processing

This is a critical point, as real-world images are rarely pristine. We fully agree and, as per your suggestion, have conducted a comprehensive robustness analysis on the ForenSynths test set against three of the most common post-processing operations: JPEG compression, image resizing, and rotation. The detailed ACC/AP results are as follows.

A note on the image resizing experiment: For this test, we adopted a more stringent and realistic evaluation protocol. Instead of resizing all images to a fixed input size, as is common for batch processing, we fed the arbitrarily resized images (with dynamic resolutions) directly into the models. As FatFormer's architecture does not support dynamic resolutions, it was excluded from this specific test.

MethodNo AttackJPEG (Q=100)JPEG (Q=75)Resize (S=0.75)Resize (S=1.25)Rotation D=[-45°, 45°]
FreqNet91.5/98.550.5/66.650.1/51.865.2/85.864.9/82.879.9/91.6
NPR92.5/96.155.0/59.350.0/49.183.9/84.978.9/81.886.7/90.7
FatFormer98.4/99.796.5/99.471.7/89.8------68.1/96.8
FerretNet95.9/99.355.1/67.850.2/49.481.4/94.380.8/95.488.2/98.0

Analysis of Results:

  • JPEG Compression: As expected, heavy JPEG compression (Q=75) severely degrades artifact signals, causing a significant performance drop across all methods. We note that lightweight methods without large-scale pre-training (including our FerretNet) are more affected than FatFormer, which likely learned features more robust to compression from its massive pre-training dataset. This highlights a common challenge for all lightweight detectors.
  • Image Resizing: In this more challenging dynamic-resolution test, our FerretNet demonstrates strong robustness, outperforming other lightweight methods in both down-sampling and up-sampling scenarios. This indicates that our method is not sensitive to scale changes.
  • Image Rotation: In the rotation robustness test, FerretNet achieves the best performance among all methods, even significantly surpassing the heavyweight FatFormer. This result provides strong evidence for the superiority of our design: LPD features are based on local, orientation-agnostic statistical dependencies, not on fixed spatial or frequency patterns, making them inherently immune to geometric transformations like rotation.

We thank you for your valuable suggestion. This analysis is crucial for evaluating the practical potential of our method, and we commit to including the full experimental details and the table in the appendix of the final version for readers' reference.

2. Regarding Novelty: From "Locality" to "Statistical Anomaly"

We understand your concern regarding the novelty of the "locality" concept. We wish to clarify that our core innovation is not simply "using local cues," but how we fundamentally model "locality."

Many prior works rely on detecting specific, heuristic-based local artifacts, such as checkerboard patterns from GANs or specific frequency anomalies. This approach is akin to "pattern matching" and can lose effectiveness as generator models evolve and artifact patterns change.

Our approach is philosophically different, shifting the detection paradigm from "pattern matching" to "statistical anomaly detection":

  • A Principled Statistical Foundation: Our LPD feature is built on a cornerstone of natural image statistics—the Markov Random Field (MRF) assumption. We do not hunt for any known artifact "signatures." Instead, we measure the degree to which an image violates the local pixel statistical dependencies that should exist in a natural scene.
  • A Model-Agnostic Artifact Signal: This method provides a model-agnostic and more universal artifact signal. Whether an artifact is caused by unnatural smoothness from a generator's decoder or an abrupt change from a diffusion denoising step, it will leave a trace at the local statistical level that LPD can capture.

We believe this principled, universally applicable artifact modeling approach based on statistical foundations represents a substantive contribution, not merely an incremental extension of locality-based ideas.

3. Conceptual Differences with NPR

This is an excellent question that helps to precisely position our work. The difference can be summarized with an analogy:

  • NPR looks for the "tool marks." It focuses on detecting the fixed, high-frequency patterns, like "fingerprints," left in an image by specific upsampling operations (e.g., nearest-neighbor interpolation). Its logic is: "I found traces left by Tool A, therefore this image is a forgery."

  • LPD checks if the "scene obeys the laws of physics." It does not care what "tool" was used. It checks if the final result is statistically plausible. Its logic is: "The distribution of pixels in this local region does not follow the statistical laws of the natural world, therefore it is a forgery." Whether the artifact was caused by upsampling, diffusion denoising, or some other unknown technique, LPD can detect the anomaly as long as it disrupts natural local statistics.

In short, NPR is fingerprinting a specific cause, while LPD is detecting a universal effect.

4. Regarding the Threat of Adaptive Attacks

This is a very forward-thinking question. An adaptive attack, where a generator is trained to explicitly fool a detector, is the ultimate test of robustness.

  • The Generator's Dilemma: While an attacker could theoretically add "minimizing LPD artifacts" as a loss term during training, this would place them in a dilemma. To completely eliminate LPD artifacts, the generator would need to perfectly replicate the extremely complex local statistics of natural images at the pixel level. This is an incredibly difficult optimization problem in itself. Forcing the model to do this would likely act as strong regularization that could harm the overall image quality, diversity, or semantic coherence of the generated images. The attacker must make a difficult trade-off between "undetectability" and "image fidelity."

  • Raising the Bar for Attackers: Our method essentially raises the bar for attacks. An attacker can no longer get away with patching a few superficial, specific artifacts; they must now solve a much more fundamental and difficult problem of statistical simulation.

  • Future Work: We completely agree that this is a vital research direction. As a defense, we plan to explore and enhance our model's robustness in future work by incorporating LPD as part of a discriminator in an adversarial training loop.

评论

I believe all my concerns have been addressed. Thank you to the authors' efforts.

评论

Dear Reviewer,

Thank you very much for your positive follow-up and for confirming that your concerns have been addressed.

We sincerely appreciate your time and the insightful feedback you provided throughout this process. Your efforts have been instrumental in helping us improve the quality and clarity of our work.

Thank you again.

审稿意见
5

The paper targets on the task of detecting if a image is a real image or a machine generated one. It utilize a very simple approach -- compute the difference of the image with it's median filtered version. A light classification network with small perfective field is trained with the input of this difference pixel map, to decide if the image is fake or real. Substantial experiments are conducted with generated images from various models including GANs and Diffusion Models. The comparison with existing works shows that this paper if more effective on detecting the generated image using a much smaller network.

优缺点分析

Strengths:

1/ The paper is easy to follow. The idea is simple and light weighted. It is pretty surprising that such a simple method could beat many existing approaches.

2/ The ablation experiments are well designed. It carefully compared different design choices. The ablation on the receptive field is especially helpful. The large margin in this ablation is a bit surprising, which might reveal the core of the approach -- local neighbor is all that matters.

3/ I like the analysis itself in Section 3 about the sources of the artifacts for generated images. Though it does not seem to tightly related to the method design with median filter etc.

Weakness:

The author claims the proposed FerretNet is novel because “it adopts a dual-path parallel architecture to increase the receptive field”. However this concept has existed long way back to the GoogleNet (prefer wide over depth in deep neural networks), and since then there has been a lot of variations of GoogleNet. Therefore I’m not sure if the proposed FerretNet is either novel or necessary. Although authors provide an ablation on FerretNet by comparing it to Xception and ResNet50, it is not quite convincing to me because it is very possible to the worse performance on Xception/Resnet50 is because they are overfitting the training data with 20x more parameters than proposed FerretNet. I’m very curious to see if you make FerretNet larger (e.g., 20x larger), if the performance will get better or worse.

问题

Would be nice if you could make a clear connection between the analysis in Section 3 and the proposed method in Section 4. If not I would suggest to move section 3 to appendix or later in the paper as an independent discussion. In its current form it feels like the motivation of the method exists in Section 3 but I couldn't get it.

The ablation on the FerretNet might worth to revisit. I can see the reason that a light weight shallow network works well because of limited receptive field. But I can't see why a deeper network works worse other than overfitting. If would be nice to analyze this more and show why.

局限性

Yes

最终评判理由

All my concerns have been resolved through the communication with the authors. I'm happy that the authors are willing to take my suggestions on revising how to better present their techniques. The paper itself provides a simple yet effective approach which is neat so I would recommend for acceptance.

格式问题

None.

作者回复

Dear Reviewer,

Thank you for your constructive feedback and for appreciating the simplicity, effectiveness, and well-designed ablation studies of our work. Your insightful questions have helped us to further probe the core principles of our method and clarify our contributions. We are pleased to provide the following responses and new experimental results.

1. Clarifying the Architectural Contribution of FerretNet

Thank you for this insightful question, which gives us the opportunity to precisely define our architectural contribution.

  • Clarifying our Contribution: We agree that the general concept of multi-branch architectures for expanding receptive fields, such as in GoogLeNet, is well-established. The key novelty of FerretNet, therefore, lies not in inventing the multi-branch paradigm itself, but in the principled and highly-specialized architectural engineering for a unique task: detecting subtle artifacts from LPD feature maps. Our contribution is a demonstration of how a carefully co-designed set of lightweight components (depthwise separable, grouped, and dilated convolutions) can create an extremely efficient yet powerful feature extractor for this specific type of low-level signal—a problem for which, as we show, standard deep networks are ill-suited.

  • New Experiment on Scaling FerretNet: To directly address your insightful question, "if you make FerretNet larger (e.g., 20x larger), if the performance will get better or worse," we conducted a new set of experiments by scaling our model up and down. We report the average results across the ForenSynths, Diffusion-6-cls, Synthetic-Pop, and Synthetic-Aesthetic test sets.

MethodChannelsBlocksParametersACC / AP
FerretNet-S(32, 64)(2, 2)0.13 M93.1 / 97.6
FerretNet-B (Ours)(96, 192)(2, 2)1.06 M97.1 / 99.6
FerretNet-L(96, 192, 384, 768)(2, 2, 6, 2)21.51 M96.6 / 99.4
  • Analysis and Conclusion: These results provide a clear answer and strongly support your intuition.
    • Making the model significantly larger (FerretNet-L, ~20x parameters) results in a slight performance degradation.
    • This confirms your hypothesis that overfitting is the primary reason why deeper, larger networks like Xception/ResNet50 (and even a bloated FerretNet-L) perform worse. Our detection task relies on identifying subtle, low-level, local statistical artifacts present in the LPD maps. These maps are more akin to residual signals than semantic images.
    • Large, deep networks are designed to learn high-level semantic features. When applied to LPD maps, they are prone to overfitting to spurious correlations or specific artifact patterns from the training generator (ProGAN), which ultimately harms their generalization ability to unseen artifacts. Our proposed FerretNet-B hits the "sweet spot," being sufficiently expressive to capture the necessary features without having the excess capacity that leads to overfitting.

We will incorporate this new ablation study and the accompanying analysis into the final version of the paper, as it provides a much stronger justification for our lightweight design.

2. Clarifying the Connection Between Section 3 (Artifact Analysis) and Section 4 (Method)

Thank you for pointing out the need for a clearer link between these two sections. We apologize if the connection was not sufficiently explicit. The logical flow from our theoretical analysis to our method design is as follows:

  • Section 3 (The 'Why'): This section establishes the fundamental reason why artifacts exist. We analyze how various generative models, despite their different architectures, share a common challenge: synthesizing high-frequency details from lower-dimensional information. This process inevitably leads to microscopic inconsistencies.

  • Section 4.1 (The 'What'): This section proposes LPD as a tool to quantify these microscopic inconsistencies. Based on the assumption that natural images exhibit strong local pixel dependencies (a Markov Random Field property), LPD is designed to measure the breakdown of these dependencies. Thus, LPD is the direct methodological response to the problem identified in Section 3.

  • Section 4.2 (The 'How'): This section designs FerretNet to effectively analyze the LPD maps generated in the previous step. As shown in our paper (e.g., Figure 2), the LPD artifacts are spatially non-uniform; their patterns and scales vary across the image. Therefore, a specialized network with multiple receptive field branches (our FerretNet) is required to capture these diverse, non-stationary artifact responses, from micro-local to meso-local scales.

In the final manuscript, we will revise the beginning of Section 4 to explicitly state this logical flow (Why -> What -> How) to ensure the motivation for our method is clear and well-connected to our analysis.


Once again, we sincerely thank you for your sharp and constructive feedback, which has significantly helped us strengthen the paper. We hope our response and the new experiments have addressed your concerns.

评论

Thank you for the response.

Based on authors' analysis on the proposed FerretNet, it is reasonable to assume that a shallow GoogleNet (for example 2 layers) could also achieve decent performance. So I'm still not convinced by the necessity of the proposed FerretNet. However if authors frame it as an implementation detail rather than a novelty/contribution, I would be fine with it.

"Why->What->How" is a good flow but I don't think it fits nicely with this paper. After reading section 3, I was expecting author to propose solutions based on those reasons behind the artifacts (e.g., 3.2 mentions latent distribution deviations, then I would expect author's solution is to detect such deviations). I think section 3 is nice as an background section, but not as an motivation section (currently it reads like a motivation section but then at section 4 I realized it is not). Would be great if authors could carefully revisit the paper writing in the revised version.

评论

Dear Reviewer,

Thank you so much for the detailed follow-up and for this very helpful clarification of your perspective. Your continued engagement is incredibly valuable to us.

Regarding the necessity of the proposed FerretNet:

You have raised an excellent and very precise point. We completely agree that the general principle of using a specialized, shallow network is key for this task. Your suggestion to frame FerretNet as a carefully engineered 'implementation detail' rather than a standalone architectural novelty is a much more accurate and appropriate way to present our work. Thank you for this excellent guidance.

To further validate this perspective and directly follow up on your suggestion, we conducted an additional experiment comparing FerretNet with a shallow GoogleNet. The results were highly illuminating:

MethodParameterACC / AP
FerretNet-S0.13 M93.1 / 97.6
FerretNet-B (Ours)1.06 M97.1 / 99.6
FerretNet-L21.51 M96.6 / 99.4
GoogleNet-S0.68M77.7 / 85.7
GoogleNet5.99M80.2 / 87.1

This new evidence beautifully reinforces your point. While a generic shallow network like GoogleNet-S is lightweight, it is not well-suited for this specific task. The results confirm that while the principle is "shallow architecture," the implementation is critical. Our FerretNet, though an "implementation detail," is a demonstrably superior one for processing LPD feature maps.

  • Our Plan for Revision: In the final manuscript, we will revise our claims accordingly. We will de-emphasize the novelty of the dual-path structure itself and incorporate this new table and analysis. We will position FerretNet as a purpose-built, highly efficient network designed specifically to process our LPD feature maps—a design choice whose effectiveness is now validated not only by our previous ablations but also by this direct comparison you inspired.

Regarding the narrative flow from Section 3 to Section 4:

You have also provided a very insightful analysis from a reader's perspective. We now understand how mentioning two potential artifact sources in Section 3 could create an expectation for two corresponding solutions, leading to a sense of disconnect. This is extremely helpful feedback.

  • Our Plan for Revision: To resolve this, we will carefully revise the writing in Section 3, following your advice. We will reframe it to serve more explicitly as a background section that establishes the general principle that all generative models introduce artifacts. We will clarify that while these artifacts have multiple high-level sources (e.g., latent space, decoding), our work chooses to focus on detecting a powerful and universal effect—the disruption of local pixel statistics—which is effectively captured by our LPD method. This will ensure a much smoother and more logical transition into Section 4.

Your feedback has been invaluable in helping us refine not just the technical details, but the core narrative of our paper. We are confident that these revisions, made directly in response to your thoughtful guidance, will significantly improve the paper's clarity and precision.

Thank you once again for your constructive engagement.

评论

Thank you for the response and additional results. I have no further question/suggestions at this point.

评论

Dear Reviewer, Thank you for your final message. We are very pleased that our response and the additional results were helpful. We are sincerely grateful for your insightful feedback and continued engagement throughout this process. Your guidance has been instrumental in helping us strengthen the paper's core narrative and technical analysis. Thank you once again for your time and expertise.

审稿意见
5

This paper proposes FerretNet, a lightweight method for detecting synthetic images by exploiting artifacts from generative models. The authors identify two artifact sources (i.e., latent distribution deviations and decoding-induced smoothing). They introduce Local Pixel Dependencies (LPD), a median-based reconstruction strategy rooted in Markov Random Fields, to capture texture, edge, and color inconsistencies. FerretNet uses depth-wise separable and dilated convolutions to process LPD features. Trained on 4-class ProGAN data, FerretNet achieves 97.1% accuracy across 22 generative models in an open-world benchmark, outperforming SOTA by 10.6%.

优缺点分析

Strength:

  1. This paper presents a well-structured methodology that proposes an effective cue LPD for detecting synthetic images, including visualization and comprehensive ablation studies to validate the design.

  2. The proposed method exhibits strong cross-model generalization ability with an accuracy of 97.1% across 22 generators and demonstrates high efficiency of 772 FPS on a single card, making it suitable for real-world deployment.

Weaknesses:

  1. The proposed method is trained exclusively on the 4-class data set from ProGAN, which includes cars, cats, chairs, and horses. This training approach limits exposure to the diverse artifacts typically produced by diffusion models. As a result, the performance of FerretNet shows a notable decline when evaluated against diffusion models. For instance, it achieves an accuracy of 91.4% on DALL-E (as shown in Table 2), compared to 98.8% for the SOTA method.

  2. Although Figure 2 clearly shows the discontinuities in the LPD, the paper does not provide intuitive interpretations of how the LPD map influences the final decision. Additionally, it lacks visual examples of the LPD maps near the decision boundaries, as well as instances of true negatives and false positives.

  3. The quantitative results reported in Section 5.3 on three different test sets use unaligned comparison methods: 9 on ForenSynth, 7 on Diffusion-6-cls, and 3 on Synthetic-Pop. This raises concerns about fairness in the comparisons.

  4. There are several missing citations and comparison methods (if applicable):

[1] CO-SPY: Combining Semantic and Pixel Features to Detect Synthetic Images by AI, CVPR 2025.

[2] Orthogonal Subspace Decomposition for Generalizable AI-Generated Image Detection, ICML 2025.

[3] Detect Fake with Fake: Leveraging Synthetic Data-driven Representation for Synthetic Image Detection, ECCV 2024.

[4] Improving Synthetic Image Detection Towards Generalization: An Image Transformation Perspective, KDD 2025.

问题

  1. The proposed method is based on the assumption that the image generation pipeline incorporates a latent space along with a decoder. While most contemporary image generation techniques utilize an encoder-decoder design to facilitate higher resolution training, some methods do not encode images into a latent space. Examples of such methods include vanilla DDPM and AR approaches. How would the proposed method perform in comparison to these types of methods that do not align with this assumption?

  2. Are there differences in how the proposed method detects diffusion-based and AR-based generative images? This includes the differences in the LPD features and the detection performances. Incorporating such comparisons enhances our understanding of the usability of the proposed method and the contributions to the generation community.

  3. In relation to the second weakness identified, Figure 2 illustrates the patterns of the proposed LPD maps. I am curious if there is a distinguishing "signature" that differentiates real images from synthetic ones. Providing clear examples (e.g., they could possibly be zoomed-in sections of edge or texture regions) could enhance the interpretability of the decisions made by the proposed method.

局限性

Yes. The paper discusses limitations regarding the effectiveness of compression-altered synthetic images and suggests improvements for detecting these altered images.

No potential negative societal impact is provided.

最终评判理由

My concerns have been addressed through the rebuttal and discussions. I believe the paper is technically sound, considering the explanations and additional results provided during the discussions. Therefore, I am inclined to accept this paper.

However, the authors should include the additional comparisons, analyses, and visualizations mentioned in the discussion in the final version, if the paper is accepted.

格式问题

In Section 5.3 Main Results, the quantitative comparison tables, Table 1 and Table 2, have their horizontal and vertical axes meanings reversed, making them difficult to read.

作者回复

Dear Reviewer,

Thank you for your detailed and insightful review of our work. Your thoughtful comments are invaluable for improving the quality of our paper. We have carefully considered each of your points and have provided clarifications and new experimental results accordingly. We hope that our response adequately addresses your concerns.

1. Regarding Training on ProGAN and Performance on Diffusion Models (Weakness 1)

We appreciate the reviewer's focus on this point, as it touches upon the core philosophy of our work.

  • Intentional Experimental Design: Our experimental setup was by design. We aimed to simulate a challenging and realistic scenario: training a lightweight model on a limited, known source of artifacts (ProGAN) to test its ability to generalize to unseen generators with entirely different artifact patterns (like diffusion models). This follows the stringent "Train on One, Test on Many" evaluation paradigm in the field of universal forgery detection, which is designed to assess a model's "zero-shot" generalization capability.

  • Performance vs. Efficiency Trade-off: We acknowledge the accuracy gap with the SOTA method, FatFormer, on DALL-E. However, we wish to highlight the context and significance of our result:

    • Extreme Efficiency: Our FerretNet has only 1.1M parameters, which is approximately 0.2% of FatFormer's size (577.3M). Furthermore, our method requires no large-scale pre-training. In contrast, FatFormer is a modified version of CLIP and relies on costly pre-training on large-scale datasets.
    • Excellent Generalization: Despite this massive efficiency advantage, our model still achieves a high accuracy of 91.4% on the completely unseen DALL-E dataset, significantly outperforming most baseline methods.

Therefore, our central contribution is not to surpass the SOTA accuracy on a specific dataset but to achieve an unprecedented and exceptional balance between computational cost, deployment efficiency, and cross-generator generalization ability. This makes our method highly valuable for practical, resource-constrained applications, which we believe is a significant contribution.

2. Regarding Applicability to Non-"Latent Space-Decoder" Architectures (Question 1)

This is a very insightful question that gets to the heart of generative model diversity.

  • Universality of the Core Mechanism: The "latent space + decoder" framework used in Section 3 of our paper serves as an accessible abstraction for mainstream generators. However, our core module, LPD (Latent-artifact-based Pseudocode Detection), is not fundamentally dependent on the existence of an explicit latent space. The essence of LPD is to capture local inconsistencies in the pixel domain that arise from any process of "up-sampling" or "generation." Whether it's a GAN decoding from a low-dimensional vector, a DDPM progressively denoising to create high-frequency details, or an autoregressive (AR) model making pixel-by-pixel (patch by patch) predictions, the process inevitably introduces local artifacts that do not conform to the statistical properties of natural images. LPD is designed to detect precisely these subtle, process-induced flaws.

  • New Experimental Verification: To robustly demonstrate the universality of LPD, we have conducted new experiments as you suggested on two models that do not strictly follow the classic "latent space-decoder" paradigm: the AR-based model VAR [1] and a vanilla DDPM. Each test set contains 500 synthetic and 500 real images (DDPM images from DIRE [2], VAR images generated by us). We evaluated our FerretNet, trained only on ProGAN, against them. The results are as follows:

MethodVARDDPM
ACC / APACC / AP
NPR82.9 / 83.799.2 / 100.0
Freq95.3 / 98.787.7 / 99.9
FatFormer83.9 / 91.157.0 / 86.5
FerretNet (Ours)97.8 / 99.999.7 / 100.0

Conclusion: The results clearly show that FerretNet achieves the best performance on both AR and DDPM models, even significantly outperforming the heavyweight FatFormer. This strongly supports our claim that the local pixel-domain inconsistencies captured by LPD are a more fundamental and universal type of artifact feature, which generalizes exceptionally well to unseen generators with diverse artifact patterns. We will add this experiment and a corresponding analysis to the final version of our paper.
[1] Visual autoregressive modeling: Scalable image generation via next-scale prediction, NeurIPS 2024.
[2] Dire for diffusion-generated image detection, CVPR 2023.

3. Regarding LPD Interpretability and Visualization (Weakness 2 & Question 2, 3)

We wholeheartedly agree that enhancing interpretability is crucial for understanding our method.

  • The "Signature" of LPD Artifacts: As the reviewer astutely observed, LPD does reveal a distinct artifact "signature." As shown in Figure 2 of our main paper:

    • The LPD maps of real images typically present as uniform, low-energy, noise-like textures, reflecting the local smoothness and consistency of natural scenes.
    • The LPD maps of synthetic images expose high-energy, structural artifacts, such as bright lines, grids, or patches, especially along object edges, on smooth surfaces (like skin), or in complex textures. These are the unnatural traces left by the generative model as it "invents" details.
  • Commitment to Add Visualization Analysis in Final Version: To more intuitively demonstrate how LPD influences the final decision and to address your concerns about decision boundaries and misclassifications, we commit to adding a new, dedicated section for visualization analysis in the appendix of the camera-ready version. This section will include:

    • Typical successful cases: Side-by-side comparisons of real/fake image patches, their corresponding LPD maps, and heatmaps to clearly showcase the artifact "signatures."
    • Decision boundary cases: Examples where the artifact signals in the LPD maps are weak, making them challenging for the model.
    • Failure case analysis: Examples of false positives (real images misclassified, perhaps due to strong natural periodic patterns) and false negatives (undetected fakes, perhaps with very subtle artifacts) to provide a deeper insight into the limitations of our method.

We believe these additions will greatly enhance the paper's interpretability and completeness.

4. Regarding Unaligned Comparison Methods (Weakness 3)

We understand the reviewer's concern about the fairness of comparisons. We would like to clarify that our comparison strategy strictly adheres to established and recognized evaluation benchmarks and protocols within the field.

  • Following Established Protocols: As mentioned in Section 5.1.1, our experimental setup follows the protocols established in prior works.
  • Evolution of Benchmarks: The benchmarks ForenSynths, Diffusion-6-cls, and Synthetic-Pop were introduced at different times to evaluate the then state-of-the-art (SOTA) or most representative methods. For instance, when FatFormer was evaluated for its generalization to diffusion models, it was compared against methods like DIRE and Ojha et al. on the Diffusion-6-cls benchmark. Therefore, our use of these same comparison methods on each benchmark is a standard academic practice to ensure direct and fair comparability with those foundational works.
  • Improving Clarity: To make this rationale clearer to readers, we will add a sentence in the experimental section of the final version to explicitly explain the basis for selecting comparison methods for each benchmark.

5. Regarding Missing Citations (Weakness 4)

Thank you very much for pointing out these highly relevant and recent papers! We have carefully reviewed them.

  • Commitment to Update: We will incorporate a detailed discussion of these four papers [1-4] in the "Related Work" section of the final version, analyzing their connections and differences with our work.
  • Regarding Experimental Comparison: Given the limited rebuttal period, a full and fair reproduction and comparison against these very recent methods is not feasible. However, we commit to checking for publicly available code for these methods while preparing the camera-ready version. If feasible, we will do our best to include them in our benchmark tests to provide an even more comprehensive evaluation.

6. Regarding Other Issues (Societal Impact and Table Format)

  • Societal Impact: Thank you for the reminder. We will add a "Broader Impact / Limitations" section in the final manuscript. We will state that while our work aims to be defensive, any detection method can potentially be used by attackers to evaluate the "undetectability" of their models, thus contributing to a technological "arms race."
  • Table Format: We apologize for the reduced readability of Tables 1 and 2, which were transposed due to page limitations. In the final version, we will prioritize readability and make every effort to reformat these tables to be more intuitive.
评论

Thank you for your detailed response, which has addressed most of my concerns. However, I still have a few remaining issues:

Interpretability: I understand the differences in LDP maps between real and synthetic images, and I can describe them in natural language. What I'm interested in knowing is whether your designed network can distinguish these differences and how it interprets them. This is why I would like to see some intuitive LDP examples related to the decision boundaries, specifically focusing on cases of false positives and false negatives. I do not need definitions of false positives and false negatives.

Performance: As the authors acknowledge, the purpose of this work is not to achieve state-of-the-art performance but rather to demonstrate extreme efficiency and excellent generalization. Currently, there is only one table (Table 4) in the paper illustrating the efficiency of the proposed method, and it compares just 3 methods across 2 metrics. Since the emphasis is on extreme efficiency, a more comprehensive comparison is needed.

Regarding excellent generalization, I assume this is represented in Tables 1, 2, and 3. However, it remains unclear how the comparison methods were trained. Were they trained on their own proposed data and then tested on different sets, or were they trained on the same data as this work? If it is the latter, then the proposed method may not truly represent the best in terms of generalization.

I suggest the authors carefully consider the whole point of the proposed method.

Missing comparison methods: If the authors only infer the comparison methods using their checkpoints, this would make the extra comparisons relatively straightforward, as many of the missing comparison methods have already released their checkpoints.

评论

Dear Reviewer,

Thank you for your follow-up and insightful feedback. We have addressed your remaining points with new analysis and experiments, and outline our revision plan below.

1. Regarding Interpretability: How the Network Interprets LPD Maps

You raised an excellent and much deeper point, prompting us to refine our analysis of how the network makes its decisions. Our goal is to provide an interpretation that is fully consistent with our method's proven high generalization capability.

  • Revisiting the Core Hypothesis: Our method's high accuracy across 22 diverse generators proves that the combination of LPD and FerretNet learns a highly generalizable representation of artifacts, not just ProGAN-specific patterns.

  • A More Scientific Analysis of Failure Cases: The failure cases do not contradict this general success but instead help define the boundaries of our model's decision-making.

    • False Positive Case (The Real Rock Formation): The natural geological strata in this real image create large-scale, regular patterns. Our LPD module correctly flags these as statistically unusual. The network, having learned that strong, structured statistical deviations are a primary indicator of artificiality, misclassifies it. This demonstrates the model's high sensitivity, but also reveals its limitation when faced with rare, real-world scenes that are themselves highly structured "outliers."

    • False Negative Case (The Synthetic Cows): In the synthetic image of cows, the LPD map correctly highlights widespread, high-energy artifacts. The failure is not one of feature detection but of decision thresholding. The network aggregates all detected artifact signals to produce a final "fakeness" score. In this specific case, despite numerous detected artifacts, their combined weighted sum did not push the final score over the threshold required for a "fake" classification. This is a rare instance where a state-of-the-art generator produced an image that sits precisely on the "real" side of our model's learned decision boundary.

  • Key Insight and Conclusion: These failures are best understood as rare edge cases. Our method's strength is its robust statistical analysis; its limitation is that this model, like any, has a decision threshold that can be narrowly missed by hyper-realistic fakes or wrongly triggered by statistically extreme real images. This analysis reinforces, rather than contradicts, our core claim of high generalizability.

  • Our Plan for Revision: We will incorporate this more consistent and scientifically sound analysis into the appendix, complete with visual examples.

2. Regarding Performance, Generalization, and Comprehensive Comparisons

  • Clarification on Fair Comparison Methodology: You asked critical questions about our comparison. To ensure the fairest and most rigorous comparison possible, we have standardized our evaluation protocol:

    1. Unified Training Data: All ** methods**, including our own, were trained or fine-tuned on the same limited ProGAN 4-class dataset. For the CO-SPY model, we use their official checkpoints trained on the larger ProGAN 20-class and SD-v1.4 datasets.
    2. Unified Parameter Calculation: In the same spirit, we have unified the parameter calculation method for all models in the table below. You may note a difference in FatFormer's parameter count compared to our main text for this reason, as this table ensures a true apples-to-apples comparison of model size.

    We will make these protocols explicit in our final manuscript.

  • Comprehensive Comparison Table: Following your suggestion, we expanded our comparisons to include the methods you mentioned.

MethodsRefImage sizeParamsFLOPsFPSACC / AP
CO-SPY [1]CVPR 2025384²963.05M644.80G26.376.5 / 83.8
FatFormerCVPR 2024224²492.59M269.92G88.686.1 / 91.0
---------------------
FreqNetAAAI 2024256²1.85M2.58G200.279.2 / 86.8
NPRCVPR 2024256²1.44M2.29G720.986.5 / 89.4
SAFE [2]KDD 2025256²1.44M2.29G770.296.8 / 99.3
FerretNet (Ours)-256²1.06M2.38G772.197.1 / 99.6
  • Our Plan for Revision: We will integrate this more comprehensive table and our clarified methodology into the paper.

  • Thank you again for your feedback. We have worked diligently to address your points. We welcome the opportunity to clarify any remaining concerns that might prevent a more positive assessment of our work.

References:

[1] CO-SPY: Combining Semantic and Pixel Features to Detect Synthetic Images by AI, CVPR 2025.
[2] Improving Synthetic Image Detection Towards Generalization: An Image Transformation Perspective, KDD 2025.

评论

May I ask where the The Real Rock Formation and The Synthetic Cows images are?

评论

Dear Reviewer,

Thank you for your follow-up question. We are very glad that you are interested in these specific examples, as they are indeed very insightful for understanding the model's behavior.

You've asked a very important question about their location. Due to the official NeurIPS rebuttal policy, which we must strictly adhere to, we are prohibited from including any external links or embedding images in this response.

However, to be as transparent as possible within these constraints, we can provide the exact origins of the images we analyzed for your reference:

  • The "Real Rock Formation" (False Positive): This is a real photograph sourced from the LAION-Aesthetics V2 (6.5+) dataset, which was used as part of the real image pool for our test sets.

  • The "Synthetic Cows" (False Negative): This is a synthetic image from our newly constructed Synthetic-Pop dataset. It was specifically generated by the RealVisXL-4.0 model.

We are fully committed to providing these crucial visualizations to the community. We promise to feature these exact examples, along with the detailed Grad-CAM analysis we discussed, prominently in the appendix of our final paper.


Thank you again for your valuable engagement, which is helping us prepare a much stronger final version of the paper. We have worked diligently to address all your points.

If, from your perspective, any concerns remain that you feel prevent a more positive assessment of our work, we would be very grateful for the opportunity to clarify them further.

评论

OK. Thank you for your responses and hard work. I don't have any more questions then.

最终决定

This paper proposes FerretNet, a lightweight CNN for synthetic image detection that leverages Local Pixel Dependencies (LPD) in the residuals of median-filtered results motivated by Markov Random Fields. The main claim of the paper is that forgery detection should be considere as a statistical anomaly detection problem that should generalize across unseen generative models. FerretNet performs at high frame rates and uses1.1M parameters when compared to competing approaches in the literature like FatFormer.

Strengths: The method is simple, it leverages a MRF assumption wherein the center pixel value in a given neighborhood is conditionally dependent and can be imputed via the median of the values in the neighborhood. The residual between the prediction and observed value is then fed as input to a lightweight detector lightweight deep net that uses depthwise separable and dilated convolutions. The experimental results demonstrate competitive accuracy to FatFormer while being lightweight and faster. Moreover, the generic nature of the algorithm enables the model to be trained using one fake dataset (ProGAN data) while testing on many generative model paradigms (including various GANS, diffusion, autoregressive model. Experimental methodology: Comprehensive ablations, robustness tests (e.g., resizing, rotation), and interpretability analyses are provided. Judging the detailed results on the tables 1, 2, and 3 the new algorithm seems to be competitive to FatFormer but mainly excels in the Synthetic-Pop dataset that the authors have constructed.

Weaknesses: a) Several reviewers noted that the Synthetic-Pop dataset is not clearly framed; while it differentiates current detectors, its long-term benchmarking value may be limited. b) Interpretability of results: While the authors clarified LPD “signatures,” reviewers requested more intuitive visualizations at decision boundaries (false positives/negatives). c) The architectural contribution (dual-path FerretNet) is careful engineering for LPD feature maps rather than a fundamentally new design (the authors acknowledge this and claim it is as a novelty). d) No error bars are presented in the tables so that one can clearly understand the statistical significance of the claimed results.

Reviewer feedback, rebuttal and consensus: While a number of reviewers rate the paper high and there was a heated dialogue with request for clarification of details from the other reviewer, the reviewers agree overall that the paper is worth accepting. As an area chair with extensive experience on statistical image analysis I find that there is lack of rigor in the theoretical justification for why this method works. I would have liked to see more detailed discussion on the nature of conditional probability distributions (in local neighborhoods) and why median residuals would perform better than alternatives (e.g. why not residuals from other robust predictors such as a trimmed mean, mode, etc).