PaperHub
7.3
/10
Spotlight4 位审稿人
最低3最高5标准差0.9
5
5
3
5
4.5
置信度
创新性3.3
质量3.0
清晰度3.3
重要性3.3
NeurIPS 2025

Dual Data Alignment Makes AI-Generated Image Detector Easier Generalizable

OpenReviewPDF
提交: 2025-04-29更新: 2025-10-29

摘要

关键词
AIGC Detection

评审与讨论

审稿意见
5

This paper proposes Dual Data Alignment (DDA), a method to improve the generalization of AI-generated image detectors by aligning real and synthetic images in both pixel and frequency domains. Unlike prior work that only aligns pixel-level content, DDA also corrects frequency-level biases that detectors may exploit. The authors demonstrate that detectors trained on DDA-aligned data significantly outperform baselines across diverse benchmarks, including unseen generative models and real-world datasets. Two new datasets, DDA-COCO and EvalGEN, are also introduced for robust evaluation.

优缺点分析

Strengths:

  1. Proposes a dual-domain alignment (pixel + frequency), addressing overlooked frequency-level biases in existing methods.
  2. Achieves strong generalization across 8 diverse benchmarks, outperforming prior state-of-the-art detectors.
  3. The DDA method is efficient to implement and significantly reduces training data generation time.

Weaknesses:

  1. Since using VAE reconstruction to build training datasets has already been proposed in [1], the main novelty of this paper, to my understanding, lies in two aspects: (i) Frequency-Level Alignment, which replaces high-frequency DCT coefficients of synthetic images with those from real images; and (ii) Pixel-Level Alignment, a mixup-style operation applied between real and synthetic images.
  2. Regarding the first point, I have two main concerns. First, it is unclear whether the authors apply the DCT transform at the original image resolution or in 8×8 blocks—this is not explicitly specified in the paper. Second, the authors appear to define the high-frequency region as a bottom-right rectangle in the DCT space. However, DCT coefficients increase in frequency in a zigzag pattern from the top-left to the bottom-right. As such, this rectangular selection might omit some high-frequency components, making the design potentially suboptimal.
  3. For the second point, while mixup is a well-established technique in image classification, it has not been applied in AIGI detection before, so I acknowledge this as a valid contribution. However, in Fig. 9(c), the ablation results show that even when rpixel=0rpixel=0 or rpixel=1rpixel=1, the detector still achieves high accuracy. This is counterintuitive—could the authors clarify why extreme values still perform well?
  4. The detector is fine-tuned using DINOv2 with an input size of 336×336. It would be helpful for the authors to include ablations on different commonly used backbones (e.g., CLIP, ResNet-50) and input sizes (e.g., 224×224), since most baselines are trained under these settings. Also, any justification for selecting DINOv2 as the backbone would be appreciated.
  5. The proposed method appears to be model-agnostic. Can the authors demonstrate whether DDA-aligned data could also enhance performance when used to train other existing detection methods?

In summary, this paper tackles a critical challenge in AIGI detection—robust generalization in real-world scenarios—and achieves state-of-the-art performance across multiple benchmarks. However, several technical and design questions remain to be addressed.

[1] Aligned Datasets Improve Detection of Latent Diffusion-Generated Images, ICLR'25

问题

See the weakness part.

局限性

See the weakness part.

最终评判理由

I have read the author's response, and most of the concerns have been addressed. The only remaining concern is the issue raised in Q1 regarding High-Frequency Region Selection. Although the rectangular-0.5 setting shows performance similar to the zigzag method, this choice is not theoretically convincing, as it does not align with the high-frequency distribution of the DCT and may miss some compressed high-frequency components.

I hope the authors will provide a more in-depth discussion or revise the corresponding method in the revised version. Compared to the overall contribution of this paper, this minor concern does not overshadow its merits. I am willing to update my rating to accept after discussion. Good luck!

格式问题

N/A

作者回复

Thank you for acknowledging strong gneralization and efficiency of our proposed DDA. We will alleviate your remaining concerns.


Q1: Regarding the first point, I have two main concerns. First, it is unclear whether the authors apply the DCT transform at the original image resolution or in 8×8 blocks—this is not explicitly specified in the paper. Second, the authors appear to define the high-frequency region as a bottom-right rectangle in the DCT space. However, DCT coefficients increase in frequency in a zigzag pattern from the top-left to the bottom-right. As such, this rectangular selection might omit some high-frequency components, making the design potentially suboptimal.

Thank you for your pointting that concern.

  • DCT Resolution: We clarify that the DCT transform in our method is applied in 8×8 blocks, consistent with the standard JPEG compression pipeline. This design reflects a practical consideration: real images in most datasets (e.g., GenImage, ForenSynth) are JPEG-compressed, while many synthetic images are stored in PNG format. Applying block-wise DCT allows us to more effectively capture and mitigate compression-related frequency biases.

  • High-Frequency Region Selection: Below we conduct additional experiments using a zigzag-pattern-based frequency selection. As shown in Table 1 below, DDA with $T_{\text{freq}} = 0.2$ (zigzag) achieves a comparable performance to $T_{\text{freq}} = 0.5$ (rectangular). This suggests that our method is robust to the precise frequency indexing scheme. We will clarify both the DCT resolution and the frequency selection strategy in the revised manuscript.

Table 1. Ablation study of DDA using zigzag-based vs. rectangular high-frequency region selection.

T_freqGenImageDRCT-2MDDA-COCOEvalGENSynthbusterChameleonSynthWildxAvg
zigzag-0.196.496.793.193.293.872.582.289.7
zigzag-0.296.498.594.894.193.673.883.190.6
zigzag-0.394.095.894.994.192.471.982.589.4
zigzag-0.492.393.394.994.692.672.580.888.7
zigzag-0.591.292.497.395.190.669.179.287.8
rectangular-0.595.597.494.394.094.674.384.090.6

Q2: For the second point, while mixup is a well-established technique in image classification, it has not been applied in AIGI detection before, so I acknowledge this as a valid contribution. However, in Fig. 9(c), the ablation results show that even when rpixel=0 or rpixel=1, the detector still achieves high accuracy. This is counterintuitive—could the authors clarify why extreme values still perform well?

Thank you again for your careful reading and constructive feedback. We apologize for the confusion caused by the x-axis notation in Fig. 9 (c). The label should be $R_{\text{pixel}}$, consistent with the figure caption. Specifically:

  • $R_{\text{pixel}} = 0.0$ means no pixel-level mixup is applied (i.e., frequency alignment only).
  • $R_{\text{pixel}} = 1.0$ means the pixel-level mixup ratio $r_{\text{pixel}}$ is sampled from a uniform distribution $U[0, 1]$ during training (see Eq 3 of main paper).

The strong performance at $R_{\text{pixel}} = 0$ is not contradictory. It reflects the fact that frequency-domain alignment alone is already highly effective—achieving 91% accuracy in our ablation—since VAE reconstructions provide substantial low-level alignment. Similarly, the strong performance at $R_{\text{pixel}} = 1.0$ aligns with expectations. We will correct the axis label in Fig. 9 and add this clarification in the revision.


Q3: The detector is fine-tuned using DINOv2 with an input size of 336×336. It would be helpful for the authors to include ablations on different commonly used backbones (e.g., CLIP, ResNet-50) and input sizes (e.g., 224×224), since most baselines are trained under these settings. Also, any justification for selecting DINOv2 as the backbone would be appreciated.

A Thank you for your comment. We clarify that we have already provided ablation studies on both input sizes (Appendix Table 6) and backbone architectures (Appendix Table 7).

  • Input Sizes: Table 2 presents results for input resolutions of 224, 252, 280, 336, 392, 448, and 504. The results show that DDA remains consistently effective across all tested resolutions.

  • Backbones: Table 3 compares DINOv2 and CLIP ViT-B/16. We observe that DINOv2 outperforms CLIP, likely due to its stronger focus on low-level, pixel-sensitive features that are more effective for capturing DDA-aligned artifacts. In contrast, CLIP is optimized for high-level semantics. We also attempted to train DDA with ResNet-50, but it failed to converge—likely due to insufficient representational capacity for modeling subtle DDA-induced artifacts.

Table 2. Ablation study of DDA across different input sizes.

Input SizeGenImageDRCT-2MDDA-COCOEvalGENSynthbusterChameleonSynthWildxAvg
22494.996.795.997.288.971.980.389.4 ± 9.8
25295.396.795.094.192.472.084.089.9 ± 8.9
28095.796.295.695.491.970.184.689.9 ± 9.7
39292.996.592.095.793.971.889.690.3 ± 8.5
44893.497.290.789.595.865.789.988.9 ± 10.6
50493.093.092.795.893.373.286.289.6 ± 7.8
33695.597.494.394.094.674.384.090.6 ± 8.4

Table 3. Ablation study of DDA across different backbones.

MethodBackboneGenImageDRCT-2MDDA-COCOEvalGENSynthbusterChameleonSynthWildxAvg
FatformerCLIP ViT-L/1462.852.249.945.656.151.252.152.8 ± 5.4
UnivFDCLIP ViT-L/1464.161.851.415.467.850.752.351.9 ± 17.5
DRCTCLIP ViT-L/1484.790.562.377.784.856.655.173.1 ± 14.8
C2P-CLIPCLIP ViT-L/1474.459.249.938.968.551.157.157.0 ± 11.9
AIDECLIP ConvNeXt61.264.650.015.053.963.148.850.9 ± 17.1
DDACLIP ViT-B/1695.280.397.996.255.546.662.076.2 ± 21.4
DDACLIP ViT-L/1497.080.498.899.268.367.771.883.3 ± 14.7
DDADINOv2 VIT-L/1495.597.494.394.094.674.384.090.6 ± 8.4

Q4: The proposed method appears to be model-agnostic. Can the authors demonstrate whether DDA-aligned data could also enhance performance when used to train other existing detection methods?

Thanks for your insightful comment. To assess whether DDA-aligned data benefits existing detection models, we conducted additional experiments that isolate the effect of DDA. Specifically, we replaced the synthetic training images in baseline methods with DDA-aligned counterparts, while keeping all other components—including model architecture, training settings, and loss functions—unchanged. The results, summarized in Table 4, show consistent and significant improvements in accuracy, confirming that DDA-aligned data enhances generalization

Table 4. Evluation of baseline methods with and without DDA-aligned synthetic data.

MethodGenImageDRCT-2MDDA-COCOEvalGENChameleonWildRFAVG
UnivFD64.161.851.415.450.755.349.8
UnivFD + DDA92.4 (↑28.3)76.1 (↑14.3)78.2 (↑26.8)98.7 (↑83.3)65.6 (↑14.9)56.6 (↑1.3)77.9 (↑28.1)
Fatformer62.852.249.945.651.258.953.4
Fatformer + DDA65.5 (↑2.7)58.9 (↑6.7)68.6 (↑18.7)77.0 (↑31.4)54.0 (↑2.8)51.3 (↓7.6)62.6 (↑9.2)
DRCT84.790.562.377.756.650.670.4
DRCT + DDA91.7 (↑7.0)86.2 (↓4.3)77.3 (↑15.0)97.2 (↑19.5)68.0 (↑11.4)54.8 (↑4.2)79.2 (↑8.8)
评论

Accepted

评论

Dear Reviewer,

Thank you for your continued support throughout the rebuttal phase! We deeply appreciate your confidence in our work and are grateful for your constructive feedback during the review process.

Best regards,

Authors

评论

I have read the author's response, and most of the concerns have been addressed. The only remaining concern is the issue raised in Q1 regarding High-Frequency Region Selection. Although the rectangular-0.5 setting shows performance similar to the zigzag method, this choice is not theoretically convincing, as it does not align with the high-frequency distribution of the DCT and may miss some compressed high-frequency components.

I hope the authors will provide a more in-depth discussion or revise the corresponding method in the revised version. Compared to the overall contribution of this paper, this minor concern does not overshadow its merits. I am willing to update my rating to accept after discussion. Good luck!

评论

Thank you so much for your positive feedback! It encourages us a lot.

We are glad our responses have addressed your concerns and appreciate your willingness to recommend acceptance after discussion! Your suggestion on high‑frequency region selection is valuable; we will refine the discussion and add further comparisons in the revision.

We sincerely thank you for your thoughtful comments and time, which have been essential in improving the quality of our work.

审稿意见
5

This paper proposes a new method for AI-generated Image (AIGI) detection. The paper hypothesizes that dataset bias, both in pixel-level semantics and in frequency domain is an important factor for the insufficient generalizability of the existing detection methods to unseen generative models. Therefore, this paper, for the first time introduces a dual data alignment (DDA) pipeline to minimize the effect of dataset bias and create a real vs. AI-generated dataset from MS-COCO (referred to as DDA-COCO). By fine-tuning a DINOv2 backbone using LoRA on DDA-COCO, the proposed method achieves state-of-the-art performance in terms of balanced accuracy on several existing benchmarks as well as two new benchmarks introduces in this paper.

优缺点分析

Strengths

  1. The paper is well-written, and it is easy to follow.
  2. The frequent use of clear visualizations and figures helps with understanding the concepts and the proposed novelties.
  3. The proposed method is properly motivated and is an easy-to-understand and elegant solution.
  4. The proposed method shows robust performance across several datasets, achieving state-of-the-art results.
  5. The experimental results are extensive and the proposed method's performance is compared against recently published papers in top-tier venues on recent benchmarks.

Weaknesses

  1. The main hypothesis of the paper (dataset bias is the problem and that the data alignment helps) is not verified in isolation. Although extensive experimental results show strong performance of the proposed method, the strong performance cannot be solely attributed to the use of the data alignment pipeline. There are several differences between the proposed method and existing approaches other than the use of DDA. For example, the original source of data used in this paper is different from those used in competitive methods. Additionally, the backbone, the fine-tuning strategy and the input size to the backbone are all different compared to existing methods. Given this, it is hard to be sure if DDA is the main reason behind the proposed method's strong performance.
  2. The paper does not report threshold-less metrics such as AP or AUROC which are commonly used in many published papers in this area. Threshold-less metrics are important measures of the separability between the representation of real vs AI-generated samples. Additionally, the difference in accuracy numbers can be attributed to the poor choice of the decision thresholds.
  3. The mechanism for choosing a decision threshold is not discussed in the paper.

问题

Referring to the weaknesses section I have the following questions:

  1. How would existing methods' performance change if DDA was used to minimize the training dataset bias?
  2. How AP or AUROC of the proposed method compare with that of the existing methods?
  3. How thresholds for different methods are chosen in this study?

Based on the authors' response, I would be willing to increase my rating to 5 (accept).

局限性

Yes

最终评判理由

All of my concerns are addressed in the authors' rebuttal, and I feel more confident in accepting the paper. Therefore, I raise my initial rating to 5.

格式问题

No major formatting concerns were identified.

作者回复

We are grateful for your positive recognition of our novelty, extensive experiments and writing! We will alleviate your remaining concerns.


Q1: The main hypothesis of the paper (dataset bias is the problem and that the data alignment helps) is not verified in isolation. Although extensive experimental results show strong performance of the proposed method, the strong performance cannot be solely attributed to the use of the data alignment pipeline. There are several differences between the proposed method and existing approaches other than the use of DDA. For example, the original source of data used in this paper is different from those used in competitive methods. How would existing methods' performance change if DDA was used to minimize the training dataset bias?

Thank you for this thoughtful question.

We respectfully clarify that, in line with established evaluation practices, we use the official checkpoints released by the original authors for baseline methods—a standard protocol also followed in prior works such as FatFormer, C2P-CLIP, AIDE, and DRCT—ensuring consistency and fairness in comparison.

To directly address your concern, we conduct a controlled one-to-one comparison, where we adopt the same architecture, training strategy, and real image source as the competitive method, but replace its synthetic images of its training set with DDA-aligned images. This setup isolates the impact of DDA while keeping all other factors constant. Preliminary results show clear performance improvements.

Table 1. Controlled one-to-one comparison of existing methods with and without DDA-aligned training data. Each baseline method (UnivFD, FatFormer, DRCT) is retrained under identical settings, replacing only the original synthetic training data with DDA-aligned samples. DDA consistently improves performance across all datasets.

MethodGenImageDRCT-2MDDA-COCOEvalGENChameleonWildRFAVG
UnivFD64.161.851.415.450.755.349.8
UnivFD + DDA92.4 (↑28.3)76.1 (↑14.3)78.2 (↑26.8)98.7 (↑83.3)65.6 (↑14.9)56.6 (↑1.3)77.9 (↑28.1)
Fatformer62.852.249.945.651.258.953.4
Fatformer + DDA65.5 (↑2.7)58.9 (↑6.7)68.6 (↑18.7)77.0 (↑31.4)54.0 (↑2.8)51.3 (↓7.6)62.6 (↑9.2)
DRCT84.790.562.377.756.650.670.4
DRCT + DDA91.7 (↑7.0)86.2 (↓4.3)77.3 (↑15.0)97.2 (↑19.5)68.0 (↑11.4)54.8 (↑4.2)79.2 (↑8.8)

Q2: The paper does not report threshold-less metrics such as AP or AUROC which are commonly used in many published papers in this area. Threshold-less metrics are important measures of the separability between the representation of real vs AI-generated samples.

Thank you for this thoughtful suggestion regarding threshold-independent evaluation metrics. We clarify that, in line with prior works such as C2P-CLIP, DRCT, AlignedForensics, and AIDE, our main paper reports balanced accuracy for comparability. Following your suggestion, we have additionally computed AP and AUROC scores for our method. Our method DDA achieves state-of-the-art performance, with average scores of 0.964(AP) and 0.967(AUROC)—outperforming all baselines by a non-trivial margin. These results confirm the superior performance of DDA.

Table 2. Overall Comparison of AP / AUROC. Bold numbers indicate the best score per row; values in parentheses denote the absolute improvement over the original method.

MethodDRCT-2MGenImageSynthbusterSynthWildxWildRFAIGCDetection BenchmarkForenSynthChameleonAVGMIN
NPR (CVPR'24)0.403/0.2710.501/0.4400.509/0.5150.529/0.5330.742/0.7020.464/0.3720.450/0.3380.517/0.5510.514/0.4650.403/0.271
UnivFD (CVPR'23)0.857/0.8640.825/0.8380.792/0.7970.521/0.4630.624/0.5410.868/0.8790.918/0.9210.477/0.5540.735/0.7320.477/0.463
FatFormer (CVPR'24)0.478/0.3860.715/0.6840.580/0.5600.572/0.5840.759/0.7070.920/0.9070.981/0.9750.614/0.6080.702/0.6760.478/0.386
SAFE (KDD'25)0.577/0.5540.539/0.5540.542/0.5270.496/0.4910.707/0.6210.520/0.5240.542/0.5450.506/0.5710.554/0.5480.496/0.491
C2P-CLIP (AAAI'25)0.707/0.6520.923/0.9090.876/0.8590.671/0.6850.751/0.7270.933/0.9210.982/0.9780.464/0.4420.788/0.7720.464/0.442
AIDE (ICLR'25)0.702/0.7050.755/0.7670.499/0.4480.466/0.4380.714/0.6470.792/0.8060.768/0.7400.430/0.4540.641/0.6260.430/0.438
DRCT (ICML'24)0.961/0.9650.939/0.9490.901/0.9030.576/0.5980.595/0.5340.907/0.9170.890/0.8980.663/0.7190.804/0.8100.576/0.534
AlignedForensics (ICLR'25)0.998/0.9980.930/0.9470.796/0.8050.870/0.8490.905/0.8540.807/0.7980.670/0.6500.835/0.8540.851/0.8440.670/0.650
DDA (ours)0.998/0.9980.990/0.9910.992/0.9930.972/0.9710.982/0.9810.989/0.9900.969/0.9720.824/0.8410.965/0.9670.824/0.841

Q3: The mechanism for choosing a decision threshold is not discussed in the paper.

Thank you for pointing this out. We clarify that our DDA-based binary classifier uses a fixed decision threshold of 0.5: samples with predicted logits greater than 0.5 are classified as synthetic, and those below as real. No threshold tuning or calibration is applied during evaluation.

评论

I thank the authors for their rebuttal and addressing my concerns. I think it would be very helpful to add these new results and clarifications to the paper or the supplementary material, especially Table 1 in the rebuttal. All of my concerns are addressed, and I feel more confident in accepting the paper. Therefore, I raise my initial rating to 5.

评论

Dear Reviewer,

Thank you for your thoughtful feedback and for raising the rating! We greatly appreciate your suggestions and will be sure to incorporate these updates into the revised manuscript.

Best regards,

Authors

评论

Dear Reviewer,

Thank you again for your valuable efforts and constructive advice in reviewing our paper. As the discussion period nears its end, we look forward to your feedback on our responses. We have made every effort to address all your concerns and are happy to clarify any points or discuss any remaining questions.

Best regards,

Authors

审稿意见
3

This study identifies that single reconstruction alone does not suffice to achieve comprehensive alignment between real and synthetic image pairs. To address this limitation, the authors propose a novel approach termed Dual Data Alignment (DDA), which aligns synthetic images with their real counterparts in both pixel and frequency domains, thereby reducing bias in AIGI detectors. Furthermore, two new AIGI datasets are presented to expand testing scenarios across diverse domains. Extensive evaluations conducted on eight benchmark datasets validate the effectiveness of the proposed methodology.

优缺点分析

Strengths:

  1. The writing of this manuscript is generally clear and easy to follow.
  2. The topic of detecting AIGC (AI-Generated Content) images is both timely and interesting.
  3. The authors have conducted extensive experiments, which demonstrate the effectiveness and relevance of the proposed method.
  4. The proposed method is novel in that it simultaneously considers both pixel-level and frequency-domain alignment.

Weaknesses:

  1. The authors claim that existing datasets suffer from biases in format, content, and size. However, these biases can often be mitigated through data augmentation or by expanding the dataset, without the need for complex reconstruction-based approaches. For instance, JPEG compression and cropping augmentation could address format and size biases effectively.
  2. The primary objective of reconstruction-based methods is typically to uncover the intrinsic differences between real and fake images, rather than to align real and fake data distributions.
  3. It is unclear whether the proposed method and the baselines in Table 3/4/5/6 were trained under the same conditions, particularly with respect to the training dataset. Clarification on this point would be helpful for a fair comparison.
  4. Since the proposed method uses DINOv2 as the backbone, whereas some baselines rely on CLIP or ResNet, it would be important to discuss how the choice of backbone affects the results. This would help ensure a fair and meaningful comparison.
  5. The proposed method has not been evaluated on the ForenSynths dataset (CNNSpot CVPR 2020), a commonly used benchmark for detecting CNN-generated images. This limits the completeness of the experimental validation.
  6. The generalization capability of the proposed DDA method to GAN-generated images is not discussed. It would be beneficial to evaluate its performance on such images to better understand its applicability across different types of generative models.

问题

  1. Please refer to weakness.
  2. The authors should clarify the fairness of the experimental comparisons. Specifically, in Table 3, it is unclear whether the proposed method and the baseline methods were trained on the same datasets and under comparable settings. This information is critical for a fair evaluation.
  3. The backbone used in the proposed method is DINOv2, while some baselines adopt different backbones such as CLIP or ResNet. The authors should discuss how the choice of backbone influences performance and whether it contributes significantly to the observed improvements.

局限性

The paper should provide a more comprehensive discussion on the motivations behind image reconstruction as well as address issues related to experimental fairness, ensuring greater transparency and equity in the evaluation process.

最终评判理由

I would like to thank the authors for their rebuttal. In their response, some of my doubts were addressed. However, I still have concerns about fair comparisons. The authors claim to have used the official checkpoints of the baselines, but different training data were used for different baselines in the table, which makes the comparison unfair. Overall, the dataset proposed by the authors is interesting, but the experimental comparisons are concerning. I keep my score.

格式问题

The NPR is presented in a paper at CVPR 2024.

作者回复

We appreciate your positive comments on our novelty, extensive experiments and writing! We will alleviate your remaining concerns.


Q1: The biases in format, content, and size can be mitigated through data augmentation or by expanding the dataset, without the need for reconstruction.

Thank you for raising this important concern.

Content and size bias: reconstruction-based methods able to generate aligned synthetic counterparts, preserving content while altering only generation-specific characteristics. in contrast, dataset expansion cannot guarantee precise semantic alignment in every detail (e.g., object types, textures, and layouts), leaving content bias unaddressed.

On the complexity of reconstruction-based approaches: We respectfully disagree that reconstruction-based methods are overly complex. With modern frameworks (e.g., diffusers), VAE reconstruction is accessible and computationally lightweight. Table 10 of our main paper shows DDA is more efficient than many existing baselines in terms of generation time.

Format bias: Due to the asymmetric encoding in real vs. synthetic training images (JPEG-compressed real vs. PNG synthetic), JPEG augmentation can result in double-compressed real images versus single-compressed synthetic images. Consequently, models may learn to associate stronger compression artifacts with authenticity. We empirically substantiate this in Tables 1 and 2:

  • Table 1 shows that VAE reconstruction + JPEG augmentation exhibits a significant drop (↓22.0) when tested on JPEG-format synthetic images, indicating format bias. In contrast, DDA maintains stable performance (↑3.0).

  • Table 2 evaluates the frequency-based detector SAFE. Even with JPEG augmentation, SAFE suffers a a significant drop (↓21.0) in accuracy on JPEG-format images. This highlights that augmentation-only methods fail to completely eliminate format bias.

Table 1. Evaluation of JPEG compression augmentation for mitigating format bias. VAE reconstruction with JPEG compression augmentation (VAE Rec. + JPEG Aug) versus VAE reconstruction with our proposed Dual Data Alignment (VAE Rec. + DDA). We report accuracies on detecting PNG-format and JPEG-format synthetic images on GenImage.

MethodFormatMidjourneySD14SD15ADMGLIDEWukongVQDMBigGANAVG ± STD
VAE Rec. + JPEG AugPNG86.5100.099.886.086.599.991.368.989.9 ± 10.6
VAE Rec. + JPEG AugJPG92.298.998.945.267.399.240.21.367.9 ± 36.3 (↓22.0)
VAE Rec. + DDAPNG93.599.799.586.084.299.589.593.693.2 ± 6.2
VAE Rec. + DDAJPG94.399.999.693.691.099.794.197.196.2 ± 3.4 (↑3.0)

Table 2. Evaluation of format bias mitigation for SAFE.

MethodFormatMidjourneySD14SD15ADMGLIDEWukongVQDMBigGANAVG ± STD
SAFEPNG91.299.599.464.793.397.293.396.691.9 ± 11.4
SAFEJPG0.51.72.01.58.23.02.74.63.0 ± 2.4 (↓88.9)
SAFE + JPEG AugPNG90.396.896.362.291.989.773.189.286.2 ± 12.1
SAFE + JPEG AugJPG60.761.561.080.683.163.376.235.165.2 ± 15.3 (↓21.0)

Q2: Unclear whether the proposed method and the baselines in Table 3/4/5/6 were trained under the same conditions.

Thank you for raising this important concern.

Clarification on training conditions: We respectfully clarify that for the comparisons in Tables 3–6, we follow established practices by using the official checkpoints released by the original authors for all baseline methods. This evaluation protocol is also adopted in prior work such as AIDE, and DRCT.

  • Fair comparison: We acknowledge that DDA may benefit from certain training setups in Table 3. To provide a more comprehensive and balanced view, we include an extended evaluation in Appendix Table 1 (see following Table 3). DDA achieves SoTA on 9 out of 10 datasets.

  • One-to-one comparisons: In Table 4 we conduct additional experiments where all training variables are held constant, and the only change is substituting the synthetic training data with DDA-aligned counterparts. These controlled results consistently show that DDA significantly enhances generalization performance, isolating the impact of our alignment strategy.

Table 3: Comprehensive evaluation of DDA against state-of-the-art detectors on 10 benchmark datasets comprising 561k images from 12 GANs, 52 diffusion models, and 2 autoregressive models, including 3 in-the-wild datasets.

MethodGenImageDRCT-2MDDA-COCOEvalGENSynthbusterForenSynthAIGCDetection BenchmarkChameleonSynthwildxWildRFAvgMin
NPR (CVPR'24)51.537.328.159.250.047.953.159.949.863.550.0 ± 10.728.1
UnivFD (CVPR'23)64.161.83.615.467.877.772.550.752.355.352.1 ± 24.23.6
FatFormer (CVPR'24)62.852.23.345.656.190.185.051.252.158.955.7 ± 23.63.3
SAFE (KDD'25)50.359.30.51.146.549.750.359.249.157.242.3 ± 22.30.5
C2P-CLIP (AAAI'25)74.459.22.038.968.592.181.451.157.159.658.4 ± 25.02.0
AIDE (ICLR'25)61.264.61.215.053.959.463.663.148.858.448.9 ± 22.31.2
DRCT (ICML'24)84.790.530.477.784.873.981.456.655.150.668.6 ± 19.430.4
AlignedForensics (ICLR'25)79.095.586.677.077.453.966.671.078.880.176.6 ± 11.253.9
DDA (ours)95.597.494.394.094.685.593.374.384.095.190.8 ± 7.374.3

Table 4. One-to-One comparisons.

MethodGenImageDRCT-2MDDA-COCOEvalGENChameleonWildRFAVG
UnivFD64.161.851.415.450.755.349.8
UnivFD + DDA92.4 (↑28.3)76.1 (↑14.3)78.2 (↑26.8)98.7 (↑83.3)65.6 (↑14.9)56.6 (↑1.3)77.9 (↑28.1)
Fatformer62.852.249.945.651.258.953.4
Fatformer + DDA65.5 (↑2.7)58.9 (↑6.7)68.6 (↑18.7)77.0 (↑31.4)54.0 (↑2.8)51.3 (↓7.6)62.6 (↑9.2)
DRCT84.790.562.377.756.650.670.4
DRCT + DDA91.7 (↑7.0)86.2 (↓4.3)77.3 (↑15.0)97.2 (↑19.5)68.0 (↑11.4)54.8 (↑4.2)79.2 (↑8.8)

Q3: The impact of backbone.

Thank you for this question. We respectfully point out that we have already conducted ablation studies on backbones in Appendix Table 7. Below, we provide a simplified version of the results. While DDA performs best with DINOv2, it still significantly outperforms all baseline methods when using CLIP.

Table 5. Ablation study on backbone.

MethodBackboneGenImageDRCT-2MDDA-COCOEvalGENSynthbusterChameleonSynthWildxAvg
FatformerCLIP ViT-L/1462.852.249.8545.656.151.252.152.8 ± 5.4
UnivFDCLIP ViT-L/1464.161.851.415.467.850.752.351.9 ± 17.5
DRCTCLIP ViT-L/1484.790.562.377.784.856.655.173.1 ± 14.8
C2P-CLIPCLIP ViT-L/1474.459.249.938.968.551.157.157.0 ± 11.9
AIDECLIP-ConvNeXt61.264.650.015.053.963.148.850.9 ± 17.1
DDACLIP ViT-L/1497.080.498.899.268.367.771.883.3 ± 14.7
DDADINOv2 VIT-L/1495.597.494.394.094.674.384.090.6 ± 8.4

Q4: The proposed method has not been evaluated on ForenSynths.

Thank you for raising this important concern. We respectfully clarify that DDA has been already evaluated on ForenSynth in Appendix Table 3.

评论

Dear Reviewer,

Thank you for your thoughtful feedback throughout the review process. As the discussion period comes to a close, we would like to inquire if there are any remaining concerns and would be happy to provide further clarification on any points.

Best regards,

Authors

审稿意见
5

This paper introduces Dual Data Alignment (DDA), a method to improve the generalizability of AI-generated image (AIGI) detectors by addressing dataset biases. Existing detectors struggle with new data because they often overfit on non-causal attributes like image format or size. DDA aligns synthetic and real images in both pixel and frequency domains, a crucial improvement over pixel-level alignment alone, which still leaves frequency-level discrepancies. The method involves VAE reconstruction, high-frequency fusion, and pixel mixup. DDA demonstrates significant performance improvements across diverse benchmarks, including new datasets like DDA-COCO and EvalGEN, highlighting its ability to create more unbiased and robust AIGI detectors

优缺点分析

Strengths: Quality based on robust experimental validation - The paper presents extensive evaluations across eight diverse datasets, including "in-the-wild" benchmarks, which significantly strengthens its claims of improved generalizability. The consistent outperformance of DDA over state-of-the-art methods (e.g., +12.4% on GenImage, +9.8% on Synthbuster, +17.7% on EvalGEN) is a strong indicator of its effectiveness. The inclusion of new, challenging test sets like DDA-COCO and EvalGEN is particularly valuable for rigorously assessing detector performance against new generative architectures and aligned data. The robustness analysis under various post-processing methods (JPEG compression, resizing, blurring) further validates DDA's practical utility, showing its ability to maintain high performance even when images are altered. This is critical for real-world deployment where images are frequently compressed or modified. The paper tackles the fundamental issue of dataset bias in AIGI detection, a well-recognized challenge that hinders the real-world applicability of detectors. By identifying and addressing both pixel-level and, crucially, frequency-level misalignment, the authors pinpoint a subtle yet significant source of bias that previous reconstruction methods missed.

Clarity - The paper clearly articulates the "frequency-level misalignment" problem with existing reconstruction methods, using Figure 3 and Figure 4 to visually and empirically demonstrate the issue. This strong motivation for DDA's dual-domain alignment is a major plus. The paper addresses very clear methodology, Figure 5 provides a clear visual pipeline of the proposed method.

Significance - The primary significance lies in the demonstrated improvement in generalizability for AIGI detectors. This is a crucial step towards more reliable and deployable fake image detection systems, addressing a pressing societal concern regarding misinformation and fraud.

Weaknesses: Quality - The paper Limited Explanation of T freq and R pixel Parameter Selection and needs more information.

Clarity - "Theory Assumptions and Proofs" Claim: The paper states "Yes" for providing full assumptions and complete/correct proofs in Section 3.2. However, Section 3.2 primarily describes the methodology and motivation, not formal theorems, lemmas, or rigorous mathematical proofs in the traditional sense. This checklist answer might be misleading and could be interpreted as a claim of theoretical contribution that isn't fully supported by the content of Section 3.2.

问题

  1. Clarify and Substantiate the "Universal Upsampling Artifact" Claim. Question: The paper states: "We hypothesize that this artifact arises during the VAE-based decoding process". While interesting, this remains a hypothesis. What specific characteristics define this "universal upsampling artifact"? Can the authors provide more empirical evidence or theoretical reasoning to support its "universality" across diverse generative models beyond the VAE decoding stage?

  2. Address Performance on Heavily Post-Processed Images (Chameleon Dataset). Question: The paper acknowledges that "Our method performs relatively lower on FLUX... and struggles to detect images with strong post-processing artifacts, as shown in the results on the Chameleon dataset". Given that real-world scenarios often involve aggressive post-processing, how do the authors envision mitigating this limitation? Is DDA inherently sensitive to certain types of artifacts, or are there planned extensions to improve robustness in these challenging cases?

  3. Re-evaluate "Theory Assumptions and Proofs" Claim for Clarity. Question: The NeurIPS checklist response states "Yes" for "For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?" with a justification referring to Section 3.2. However, Section 3.2 primarily describes the methodology and provides intuitive explanations, rather than formal theoretical results, theorems, or proofs. This might be a misunderstanding of what constitutes a "theoretical result" in this context.

局限性

Yes

最终评判理由

Accepted.

格式问题

NA

作者回复

We sincerely thank you for your valuable time and comments. We are encouraged by your positive comments on the significance and robust experimental validation of our work! We will alleviate your remaining concerns as follows.


Q1: The paper Limited Explanation of TfreqT_{freq} and RpixelR_{pixel} Parameter Selection and needs more information.

Thank you for the helpful suggestion. We clarify that $T_{\text{freq}}$ and $R_{\text{pixel}}$ control the degree of alignment in the frequency and pixel domains, respectively. Increasing their values enhances alignment strength, which can improve sensitivity to subtle generative artifacts. However, excessively strong alignment may shift the decision boundary too close to real images, potentially reducing true-positive accuracy. To balance this trade-off, we empirically set $T_{\text{freq}} = 0.5$ and $R_{\text{pixel}} = 0.8$, based on comprehensive validation in Figure 9 of the main paper.


Q2: Clarify and Substantiate the "Universal Upsampling Artifact" Claim. Question: The paper states: "We hypothesize that this artifact arises during the VAE-based decoding process". While interesting, this remains a hypothesis. What specific characteristics define this "universal upsampling artifact"? Can the authors provide more empirical evidence or theoretical reasoning to support its "universality" across diverse generative models beyond the VAE decoding stage?

Thank you for this insightful question.

• What the universal artifact is: We believe that the universal upsampling artifact stems from deterministic local correlations introduced by fixed, low-rank upsampling operations (e.g., bilinear interpolation, transposed convolution). These components, widely used in the decoders of VAEs, GANs, and diffusion models, project low-dimensional latents to high-resolution outputs. However, due to their limited representational capacity, they cannot fully capture the complexity of natural image statistics. As a result, generated images often exhibit reduced local rank and unnatural pixel dependencies—properties rarely seen in real images. These artifacts are therefore architectural in origin, rather than model-specific. Similar observations have been made in prior works such as NPR [1] and SPSL [2],

• Empirical evidence for university beyond the VAE decoding: To substantiate universality, we highlight the results in Appendix Table 1. Our DDA detector, trained only on aligned VAE-reconstructed images, achieves SoTA on 9 out of 10 benchmarks. This strong generalization suggests that the universality is not confined to a specific generation mechanism.

Table 1: Comprehensive evaluation of DDA against state-of-the-art detectors on 10 benchmark datasets totaling 561k images generated by 12 GANs, 52 diffusion models, and 2 autoregressive models, including 3 in-the-wild datasets. Generator types are indicated in parentheses (G = GAN, D = Diffusion, AR = Auto-Regressive). All detectors are evaluated using official checkpoints. To mitigate format bias, JPEG compression (quality 96) is applied to GenImage, ForenSynth, and AIGCDetectionBenchmark. "DDA (ours) + JPEG Aug" denotes training with additional random JPEG compression augmentation.

MethodGenImageDRCT-2MDDA-COCOEvalGENSynthbusterForenSynthAIGCDetection BenchmarkChameleonSynthwildxWildRFAvgMin
1G + 7D16D6D5D + 2AR9D11G7G + 9DUnknown3DUnknown
NPR (CVPR'24)51.537.328.159.250.047.953.159.949.863.550.0 ± 10.728.1
UnivFD (CVPR'23)64.161.83.615.467.877.772.550.752.355.352.1 ± 24.23.6
FatFormer (CVPR'24)62.852.23.345.656.190.185.051.252.158.955.7 ± 23.63.3
SAFE (KDD'25)50.359.30.51.146.549.750.359.249.157.242.3 ± 22.30.5
C2P-CLIP (AAAI'25)74.459.22.038.968.592.181.451.157.159.658.4 ± 25.02.0
AIDE (ICLR'25)61.264.61.215.053.959.463.663.148.858.448.9 ± 22.31.2
DRCT (ICML'24)84.790.530.477.784.873.981.456.655.150.668.6 ± 19.430.4
AlignedForensics (ICLR'25)79.095.586.677.077.453.966.671.078.880.176.6 ± 11.253.9
DDA (ours)95.597.494.394.094.685.593.374.384.095.190.8 ± 7.374.3
DDA (ours) + JPEG Aug94.397.992.898.388.883.289.681.788.094.991.0 ± 5.781.7

Q3: Address Performance on Heavily Post-Processed Images (Chameleon Dataset). Question: The paper acknowledges that "Our method performs relatively lower on FLUX... and struggles to detect images with strong post-processing artifacts, as shown in the results on the Chameleon dataset". Given that real-world scenarios often involve aggressive post-processing, how do the authors envision mitigating this limitation? Is DDA inherently sensitive to certain types of artifacts, or are there planned extensions to improve robustness in these challenging cases?

Thank you for highlighting this important concern. We address this limitation through two directions: (1) enhancing DDA's robsutness to heavily post-processing, and (2) integration DDA with vision-language models (VLMs) to incorporate semantic-level signals.

(1) Enhance DDA's robustness: While DDA’s performance declines on challenging datasets like Chameleon, it still outperforms all existing methods. Moreover, as shown in the Table 1 (referenced in our response to Q2), applying stronger data augmentations (e.g., random JPEG compression) could effectively improve robustness, enabling DDA to further achieve 81% balanced accuracy on Chameleon. To our knowledge, this marks the first detector to exceed 80% accuracy.

(2) Integration with VLM: To further enhance resilience to aggressive edits, we plan to incorporate semantic-level cues that persist through low-level corruption—such as implausible object configurations (e.g., “a person with three hands”) or physically impossible scenes. These high-level inconsistencies complement DDA’s pixel- and frequency-level modeling. In preliminary experiments, prompting Qwen2.5-VL-32B as “RealismNet, a multimodal expert who determines whether an image could be photographed in the real world without digital manipulation” allowed the model to consistently flag semantically implausible content, suggesting strong potential as a complementary detection signal.

Our future work will explore a hybrid detection framework, where a vision-language model helps localize reliable regions in the image that are less affected by post-processing. DDA can then focus on those regions to detect subtle generation artifacts. This synergy between semantic robustness and low-level generalization offers a promising path toward robust, real-world AI-generated image detection.


Q4: "Re-evaluate "Theory Assumptions and Proofs" Claim for Clarity.

Thank you for pointing this out. We acknowledge the misunderstanding regarding the checklist item on "Theory Assumptions and Proofs." Section 3.2 provides methodological intuition rather than formal theorems or proofs. We will revise our response to “[N/A]” to more accurately reflect the content of the paper.


[1] Rethinking the Up-Sampling Operations in CNN-based Generative Network for Generalizable Deepfake Detection, CVPR 2024.

[2] Spatial-Phase Shallow Learning: Rethinking Face Forgery Detection in Frequency Domain, CVPR 2021.

评论

Dear Reviewer,

Thank you for your encouraging recognition of our work. We truly appreciate the time, effort, and thoughtful advice you have provided throughout the review process. As the discussion period comes to an end, we look forward to your reflections on our responses and are happy to further clarify any points.

Best regards,

Authors

评论

Dear Authors and Reviewers,

As you know, the deadline for author-reviewer discussions has been extended to August 8. If you haven’t done so already, please ensure there are sufficient discussions for both the submission and the rebuttal.

Reviewers, please make sure you complete the mandatory acknowledgment AND respond to the authors’ rebuttal, as requested in the email from the program chairs.

Authors, if you feel that any results need to be discussed and clarified, please notify the reviewer. Be concise about the issue you want to discuss.

Your AC

最终决定

The recommendation is based on the reviewers' comments, the area chair's evaluation, and the author-reviewer discussion.

This paper proposes a dual data alignment (DDA) approach for both pixel and frequency domains. The resulting synthetically generated images are shown to be effective in improving the detection performance of various AI-generated image classifiers. All reviewers find the studied setting novel and the results provide new insights. The major concern of the initial version was on the fair evaluation versus baselines, given that the baseline detectors were trained on different datasets. The authors’ rebuttal has successfully addressed the major concerns of reviewers, by providing a controlled experiment (one-to-one comparison) on baseline models with and without DDA.

In the post-rebuttal phase, most reviewers were satisfied with the authors’ responses and agreed on the decision of acceptance. Overall, I recommend acceptance of this submission. I also expect the authors to include the new results and suggested changes during the rebuttal phase in the final version.

Also, given that the proposed method is quite generic and improves different AIGI detectors at large with notable gains, I recommend for spotlight presentation.