PaperHub
8.3
/10
Oral4 位审稿人
最低4最高5标准差0.4
5
4
4
4
ICML 2025

Orthogonal Subspace Decomposition for Generalizable AI-Generated Image Detection

OpenReviewPDF
提交: 2025-01-24更新: 2025-07-24
TL;DR

We introduce a novel approach via orthogonal subspace decomposition for generalizing AI-generated images detection.

摘要

关键词
AI-Generated Image DetectionFace Forgery DetectionDeepfake DetectionMedia Forensics

评审与讨论

审稿意见
5

In this paper, the authors explain the issue of lacking generalization in deepfake detection from the perspective of SVD. Specifically, they argue that due to the limited nature of fake features, models trained on deepfake datasets tend to produce low-rank matrices, which leads to a failure in capturing key components necessary for distinguishing real images and unseen samples. To address this issue, they propose freezing most of the parameters in a pre-trained model—preserving its ability to detect real images—while only adjusting a subset of parameters to adapt to forgery patterns. This parameter decomposition is achieved through SVD decomposition.

给作者的问题

Please see weaknesses

论据与证据

The proof that explains generalization loss using the rank of the SVD decomposition is complete.

方法与评估标准

Yes, the authors adopted standard datasets and evaluation metrics commonly used in deepfake detection.

理论论述

The reviewer examined the authors’ explanation of the “asymmetry phenomenon” and found it convincing.

实验设计与分析

[1] The experimental setting of cross-dataset and cross-method evaluation is feasible, and the chosen baselines are appropriate.

[2] The ablation study examines the impact of individual components, the effects of switching the backbone, and comparisons with other PEFT methods, making the experimental design relatively comprehensive.

补充材料

The reviewer has checked the expansion of the proof and supplementation of the experimental section, and find them convincing.

与现有文献的关系

Extensive research has been conducted on the generalization issue in deepfake detection, with numerous studies addressing its existence and potential causes—including the distribution differences between real and fake features, as discussed in this paper. However, this paper provides a novel perspective by further analyzing the problem through the lens of SVD. In particular, the idea can be useful in addressing challenges such as analyzing overfitting problems, examining the expressivity of feature spaces, and preserving pre-trained knowledge while adapting to downstream tasks.

遗漏的重要参考文献

N/A

其他优缺点

The reviewer does not have major concerns, some minor weaknesses are listed below:

[1] Many details in the analysis are missing. For instance, which testing datasets are used to generate the visualizations of Figure 5 and Figure 6

[2] The implementation and experimental details of GenImage are missing. It is not specified whether the authors used the same settings as those in the UniversalFakeDetect benchmark.

[3] In Figure 6, the number of main principal components is heavily influenced by the testing dataset. If the dataset changes, the values may differ. The authors should provide additional visualizations for different testing datasets. For instance, face deepfakes might be less diverse than general AI-generated images, leading to inconsistent results.

[4] The average results in Table 4 don’t align with those mentioned in the text.

[5] How is the retained rank r determined? Would adjusting r for different datasets lead to better or different results? The authors also have not provided a clear explanation for why n-r=1 achieves the best results.

[6] The font size in Figure 6 is too small, particularly in the legend.

其他意见或建议

The reviewer has no further comments on the paper. However, the reviewer would like to offer one additional suggestion on writing: the concept of the “asymmetry phenomenon in AI-generated image detection” seems like a rebranding of the generalization issue and comes across as somewhat overly elaborate. This does not affect the reviewers’ assessment of the paper’s quality. However, it would be ideal if this section can be condensed a bit.

作者回复

We sincerely thank Reviewer SVkD for the constructive comments, insightful questions, and useful suggestions. We address the reviewer's concerns below.

Q1. The authors should provide additional visualizations for different testing datasets. For instance, face deepfakes might be less diverse than general AI-generated images, leading to inconsistent results.

R1. Thanks for your valuable suggestion.

  • Indeed, the number of principal components (effective rank) varies across datasets because our method measures the "effective dimension" of the feature space, which depends on the input data distribution.
  • To validate this, we analyzed the effective rank on multiple test datasets, including CDF-v2 (face-focused) and UniversalFakeDetect (general-content). Key observations from Table 5 are:
    • General datasets exhibit a higher effective rank than face-specific datasets. This is because general datasets contain diverse objects, leading to a more complex feature space.
    • Our SVD-based method outperforms LoRA and full fine-tuning (FFT) in capturing a higher-rank feature space, preserving more discriminative information.

Table 5: Evaluation results on different testing datasets.

Tuning MethodsTrain DatasetsTest DatasetsEffective RankMean Accuracy (%)
svdFF++ (face)CDF-v2 (face)15995.60
loraFF++ (face)CDF-v2 (face)13789.40
fftFF++ (face)CDF-v2 (face)5785.70
svdUniversalFakeDetect (general)UniversalFakeDetect (general)31695.19
loraUniversalFakeDetect (general)UniversalFakeDetect (general)30493.03
fftUniversalFakeDetect (general)UniversalFakeDetect (general)23886.22

Q2. The authors have not provided a clear explanation for why n-r=1 achieves the best results.

R2. Thanks for the interesting questions. Our choice of using a lower rank (specifically, n-r=1) for fine-tuning DFD is primarily motivated by two critical factors:

  • First, the nature of the real-fake classification task itself makes it relatively straightforward. Specifically, fake samples in existing training sets tend to exhibit a limited number of distinctive forgery patterns (FF++ contains only four forgery types), each with relatively simple and consistent characteristics.
    • Due to this simplicity and limited diversity, a low-rank adaptation with a small rank (e.g., "n-r"=1, 4, or 16) is sufficient for the model to effectively learn these forgery patterns. As demonstrated by Table 5 in our paper, choosing ranks of 1, 4, or 16 yields very similar performance results. Given this observation, we prioritize efficiency and parameter economy, making rank 1 the optimal choice.
  • Second, the inherent characteristics of binary classification further justify selecting a smaller rank. Binary classification tasks typically do not require the model to learn extensive and nuanced patterns, but rather to identify just enough distinctive features to separate the two classes, making the learned feature space inherently constrained.
    • To illustrate, when distinguishing between cats and dogs, the classifier might achieve good accuracy simply by examining the tails, without needing detailed knowledge about specific breeds such as Corgis or Shibas. Thus, binary classification inherently simplifies the complexity of the learning problem, meaning that employing a higher rank would not provide significant additional benefit.
    • In contrast, more complex tasks like multi-class classification (as discussed in our extended framework proposed in our response R1 of the Reviewer cn1X) necessitate a higher rank (the value of "n-r"). Indeed, our experiments indicate that increasing the value of "n-r" in multi-class scenarios leads to improved performance, highlighting the relationship between task complexity and optimal rank selection.

Table 6: Optimal rank selection based on the task complexity. We evaluate all the models on the Chameleon dataset.

binary (n-r=1)binary (n-r=16)binary (n-r=32)binary (n-r=64)multi-task (n-r=1)multi-task (n-r=16)multi-task (n-r=32)multi-task (n-r=64)
70.2769.3469.0868.8772.6272.8973.7674.18

Q3. Many details in the analysis are missing, such as Figure 5 and Figure 6.

R3. Thanks for the kind mention. All analysis figures, including Figure 5 and Figure 6, are tested on FF_c23 (test), which is commonly used as the within-domain testing dataset. We will clarify it in the revision.


Q4. The implementation and experimental details of GenImage are missing.

R4. Thanks for your kind mention. We will add the implementation details and clarify them clearly in the revision.

审稿人评论

Thanks for the response. The authors have addressed the additional experiments and the setting of n-r thoroughly, which are key concerns of mine. Therefore, I am considering increasing my score. I also recommend that the authors carefully revise the camera-ready version in accordance with the clarifications provided in their rebuttal.

审稿意见
4

This paper proposes Effort, a novel SVD-based adapter tuning method, for generalizable AI-generated image detection. The key idea is constructing two orthogonal subspaces, where the principal components preserve the pre-trained knowledge from the vision foundation models while the residual components are utilized to learn new domain-specific forgery patterns for detection. The paper also leverages PCA to quantify the effective rank of the learned feature space, explaining the failure of conventional detectors.

给作者的问题

See the weakness

论据与证据

The authors have provided appropriate references, clear visualizations, and thorough analysis, making their work technically sound and well-justified.

方法与评估标准

The used evaluation methods and criteria are suitable and common in the related fields. The used benchmarks are commonly used in existing works.

理论论述

Although the authors provide a theoretical explanation to the asymmetric phenomena, I believe this theorem does not significantly contribute to the overall paper. Instead, I suggest moving this content to the supplementary material and replacing it with Algorithm 1, which is much more helpful in understanding the core concepts of the paper.

实验设计与分析

This paper has conducted a thorough evaluations on both deepfake and AIGC detection benchmarks, and most analysis experiments are insightful and carefully designed. However, I noticed in Table 6 that lower values of r yield better generalization results. This observation seems counterintuitive, and I believe the authors should provide a clear and reasonable explanation for this phenomenon.

补充材料

I don’t find any obvious issue in the supplementary.

与现有文献的关系

I believe the proposed method in this paper is quite general and may not be limited strictly to AIGC detection. I encourage the authors to provide a detailed discussion on how their approach could be applied to other related fields, which would further highlight its versatility and broader impact.

遗漏的重要参考文献

Most related and essential works are properly cited in this paper.

其他优缺点

Strengths:

  • This paper provides an in-depth and reasonable analysis for the failure reason of existing detectors. The proposed analysis approach is insightful and new to me.

  • This paper proposes a novel SVD-based method that can explicitly ensure the orthogonality between pretrained knowledge and deepfake-specific knowledge.

  • This paper conducts comprehensive evaluations on both deepfake detection and AIGC detection benchmarks, achieving high generalization performance over existing methods.

Weaknesses:

  • The two proposed constrained losses seem not very necessary, as the improvement shown in the ablation study is limited. Instead, why not make a constrain to the effective rank, as the rank is an important concept in the paper?

  • The paper does not provide an intuitive visualization of residual and principal components using PCA? It would provide some new insights and findings.

  • Some important implementation details of fine-tuning and evaluations are missing in the paper. For instance, how many layers are fine-tuned? What kinds of data augmentations are used in this paper?

其他意见或建议

Overall, I think the quality of the paper is high, with insightful analysis and new effective method proposed. My comments and suggestions are summarized below.

  • Moving Algorithm 1 to the main body of the paper would greatly enhance understanding, as it provides a clear and practical insight into the key methodology.

  • The observation in Table 6 that lower-rank SVD achieves better generalization results is intriguing. The authors should provide a detailed explanation for this phenomenon to clarify its implications.

  • The analysis of Figure 6 is particularly interesting and insightful. I encourage the authors to include more visualizations of existing detectors, as this could serve as a critical analytical tool for the entire field. It effectively reveals the "discrimination dimension" in deepfake detection, which is a valuable contribution.

  • To provide a more intuitive understanding, I suggest the authors include visualizations of the principal and residual components. For example, what specific regions or features do the residual components focus on? This would help readers better grasp the underlying mechanisms of the proposed approach.

  • Adding more implementation details into the main paper or supplementary, as it helps the readers better understand some technical details within the method.

作者回复

We sincerely thank Reviewer j2Rb for the constructive comments, insightful questions, and useful suggestions. We greatly appreciate and are encouraged by the reviewer's recognition of our insightful and in-depth analysis, methodological novelty, and comprehensive experiments with high generalization performance. Additionally, the reviewer raised several important concerns and questions, which we address in detail below.

Q1. The paper does not provide an intuitive visualization of residual and principal components using PCA. It would provide some new insights and findings.

R1. We genuinely appreciate the suggestion. We perform the visualization of the attention maps before and after our proposed orthogonal training on the UniversalFakeDetect dataset. Specifically, for each block of the vision transformer, we visualize the attention map, which represents the attention coefficient matrix calculated between the cls token and the patch tokens.

  • Specifically, the attention map is computed across the multiple heads on average and is presented alongside the principal weights, residual weights, and total weights, respectively.
  • We have three discoveries in general:
      1. The attention map of principal weights is almost identical to the attention map of the total weights for each block;
      1. Before and after training, the attention maps of the residual weights in the earlier blocks do not respond (e.g., for ViT-L of CLIP model, the first 22 blocks do not respond);
      1. Only the attention maps in the last blocks of the residual weights respond, which means they contain the real/fake discriminating information (e.g., for ViT-L of CLIP model, the last 2 blocks respond).

Following the reviewer's suggestion, we will add these visualization results and the corresponding analysis into our revision.


Q2. Some important implementation details of fine-tuning and evaluations are missing in the paper. For instance, how many layers are fine-tuned? What kinds of data augmentations are used in this paper?

R2. Thanks for the suggestion. We list the implementation details below.

  • Fine-tuned layers: By default, we fine-tune all query (Q), key (K), value (V), and output (Out) components across every layer of the given architecture, such as CLIP.
  • Data augmentation strategy: We adopt standard data augmentation techniques commonly used in each specific benchmark. For instance, we follow [1, 2] for deepfake detection, [3] for the UniversalFakeDetect benchmark, and [4] for the GenImage benchmark.

Following the reviewer's suggestion, we will present a detailed description in the revision.

[1] LSDA. CVPR 2024. [2] DeepfakeBench. NeurIPS 2023. [3] UnivFD. CVPR 2023. [4] NPR. CVPR 2024.


Q3. The observation in Table 6 that lower-rank SVD achieves better generalization results is intriguing. The authors should provide a detailed explanation for this.

R3. We genuinely appreciate the suggestion. Our choice of using a lower rank (specifically, "n-r"=1) for fine-tuning DFD is primarily motivated by two critical factors:

  • First, the nature of the real-fake classification task itself makes it relatively straightforward. Specifically, fake samples in existing training sets tend to exhibit a limited number of distinctive forgery patterns (FF++ contains only four forgery types), each with relatively simple and consistent characteristics.
    • Due to this simplicity and limited diversity, a low-rank adaptation with a small rank (e.g., "n-r"=1, 4, or 16) is sufficient for the model to effectively learn these forgery patterns. As demonstrated by Table 5 in our paper, choosing ranks of 1, 4, or 16 yields very similar performance results. Given this observation, we prioritize efficiency and parameter economy, making rank 1 the optimal choice.
  • Second, the inherent characteristics of binary classification further justify selecting a smaller rank. Binary classification tasks typically do not require the model to learn extensive and nuanced patterns, but rather to identify just enough distinctive features to separate the two classes, making the learned feature space inherently constrained. Thus, binary classification inherently simplifies the complexity of the learning problem, meaning that employing a higher rank would not provide significant additional benefit.
    • In contrast, more complex tasks like multi-class classification (as discussed in our response R1 of the Reviewer cn1X) necessitate a higher rank (value of "n-r"). Indeed, our experimental results indicate that increasing the value of "n-r" in multi-class scenarios leads to improved performance, highlighting the relationship between task complexity and optimal rank selection.

Q4. The improvement of the two proposed constrained losses shown in the ablation study is limited.

R4. Thanks for your comment. Please refer to our response of R4 of the reviewer 4w2k.

审稿人评论

Thanks for the author's detailed response. My concerns are all well solved. I remain my score and recommend to accept this paper.

审稿意见
4

This paper investigates the failure of generalization in AI-generated image detection, identifying an asymmetry phenomenon where detectors overfit to limited fake patterns, resulting in a low-rank and constrained feature space. To mitigate this, the authors leverage the vision foundation models and propose a novel SVD-based tuning approach that freezes the principal components while adapting the remaining components, explicitly ensuring orthogonality to maintain a higher feature rank, thereby alleviating the overfitting problem.

给作者的问题

No.

论据与证据

The claims in this paper are well-supported by evidence and detailed explanations. Specifically, Figure 1 validates the existence of the asymmetry phenomenon and shortcut overfitting in AIGI detection. Figure 2 illustrates that the constrained feature space is predominantly influenced by fake data. Figures 3 and 5 further confirm that a baseline model trained naively on the AIGI dataset tends to be highly low-ranked, whereas the proposed approach effectively preserves most of the pre-trained knowledge.

方法与评估标准

The evaluation criteria and chosen benchmarks/datasets are appropriate for the problem. The paper demonstrates the effectiveness of the proposed method, achieving SOTA performance on both deepfake and AIGC detection benchmarks (Tables 1, 2, and 3). Also, the evaluation protocols are consistent with those used in many existing studies, ensuring a fair comparison.

理论论述

The proposed theoretical claims are presented in Theorem 3.2, where the authors define the covariance similarity between real and fake images and prove that this similarity has a lower bound under certain assumptions. I did not identify any obvious errors or issues in the validity of the proof.

实验设计与分析

Overall, the experimental designs are suitable. However, I have identified the following issues: (1) The authors did not conduct an ablation study on the weights of the orthogonality loss and singular value loss, so it is unclear how the weights are allocated in the loss function; (2) The authors primarily use CNNs as baselines. Why not consider ViT-based baselines, such as ViT models trained on ImageNet-1K? (3) The main tables (Tables 1, 2, and 3) lack bold or underlined results, which affects readability.

补充材料

I have reviewed all contents in the supplementary. I have identified the issues below: (1) I believe Algorithm 1 is critical for the reader to understand the workflow of the proposed approach. I highly recommend that the authors put this into their main paper, not just as supplementary. (2) For results in the GenImage benchmark, the authors don’t provide the implementation details of how they implement and obtain these results.

与现有文献的关系

This paper introduces the concept of effective rank to quantify the overfitting problem in AIGI detection, which I find interesting and relevant to key concepts in other fields. Notably, effective rank is widely used in continual learning [1] and signal processing [2], where it mainly helps assess how much knowledge a model retains for given tasks. In this paper, the authors cleverly use this concept to compute the “dimensionality” of the feature space, which makes sense to me.

[1] Loss of plasticity in deep continual learning. Nature (2024).

[2] THE EFFECTIVE RANK: A MEASURE OF EFFECTIVE DIMENSIONALITY. European Association For Signal Processing (2007).

遗漏的重要参考文献

I hope the authors can cite references [1] and [2] and include a brief discussion on their relevance.

其他优缺点

I have listed the strengths and weakness in my above comments. So, not additional comment here.

其他意见或建议

  •   The authors did not conduct an ablation study on the weights of the orthogonality loss and singular value loss. How are these weights allocated in the loss function?
    
  •   The baselines primarily consist of CNN models. Why were ViT-based baselines, such as ViT models trained on ImageNet-1K, not considered?
    
  •   How can the authors implement their method in GenImage benchmark? It seems like the authors don’t provide any implementation details for that.
    
  •   Also, can the detection method be applied to detect more complex and advanced fakes, such as the talking-head generation contents?
    
  •   Furthermore, can this method be applied to broader fields beyond deepfake and AIGC detection? How does it perform in areas such as domain generalization and anomaly detection? I hope the authors can further clarify the broader applicability of their model.
    

I suggest and hope the authors can address my above concerns one-by-one carefully.

作者回复

We sincerely thank Reviewer 4w2k for the constructive comments, insightful questions, and useful suggestions. We greatly appreciate and are encouraged by the reviewer's recognition of our motivation with sufficient and reasonable evidence, methodological novelty, and interesting analysis method. Additionally, the reviewer raised several important concerns and questions, which we address in detail below.

Q1. What about the performance of ViT models trained on ImageNet-1K and other ViT-based VFMs?

R1. Thank you for raising this. Following the reviewer's suggestion, we conduct additional experiments comparing ViT models pre-trained on ImageNet-1K and SigLIP using different adaptation methods. Results are summarized clearly in Table 2 below:

Table 2: Results of other ViT-based baseline models.

Pre-trained ModelTuning MethodsMean Accuracy (%)
SigLIPsvd90.46
SigLIPlora83.42
SigLIPfft81.23
CLIPsvd95.19
CLIPlora93.03
CLIPfft86.22
DINOv2svd85.46
DINOv2lora83.42
DINOv2fft79.52
ViT-ImageNet-1Ksvd72.77
ViT-ImageNet-1Klora70.47
ViT-ImageNet-1Kfft68.41

From these results, we observe that our SVD-based adaptation method consistently achieves the best performance over the LoRA-based adaptation and full fine-tuning (FFT) across different pre-training models.


Q2. How can the authors implement their method in GenImage benchmark?

R2. We sincerely appreciate the kind comment. Our experiments primarily follow the implementation details and experimental settings described in recent SOTAs [1, 2]. This alignment ensures consistency and transparency, making our results directly comparable with existing studies.

[1] NPR. CVPR 2024. [2] FatFormer. CVPR 2024


Q3. Also, can the detection method be applied to detect more complex and advanced fakes, such as the talking-head generation contents?

R3. Thank you for your question. Following your suggestion, we have extended our evaluation to include highly realistic talking-head deepfake content, i.e., HeyGen, selected from the DF40 dataset. To ensure a fair comparison, we add four SOTA detection methods under identical experimental conditions for a fair comparison.

The results, summarized clearly in Table x below, demonstrate that our method achieves superior performance compared to the latest SOTA detectors, even on advanced commercial talking-head manipulations.

Table 3: Evaluation results on the advanced talking-head generation fake contents.

LSDA (CVPR'24)ProDet (NIPS'24)FSFM (CVPR'25)UDD (AAAI'25)Ours
46.741.070.875.479.7

From the results above, we find that some detectors trained on the face-swapping data fail to generalize the talking head contents. However, most previous works are conducted using only the face-swapping deepfake data. Inspired by the reviewer's comment, we plan to add this additional evaluation to the revision and further enlarge our evaluation in the future.


Q4. The authors did not conduct an ablation study on the weights of the orthogonality loss and singular value loss. How are these weights allocated in the loss function?

R4. Thanks for the comment. In our experiments, we set the weights for both the orthogonality loss and singular value loss terms to 1.0 by default. To further explore the influence of these loss terms, we adjust each hyper-parameter across a wide range.

As shown in Table 4 below, varying these parameters results in only slight fluctuations in performance, indicating the stability of our method. Importantly, the inclusion of these loss terms consistently improves performance compared to models without them.

Table 4: Ablation studies regarding different weights of the loss terms.

Orthogonality Loss WeightSingular Value Loss WeightAUC on SimSwap (%)
0.0 (No loss, SVD only)0.0 (No loss, SVD only)94.0
1.0 (Default)1.0 (Default)95.6
0.51.095.5
1.00.595.0
0.50.594.5
2.02.095.1
0.10.194.4

Q5. I hope the authors can cite references [1] and [2] and include a brief discussion on their relevance.

R5. Thanks for the valuable suggestion. Both [1] and [2] introduce the concept of effective rank to quantitatively measure the dimensionality of the feature space, which is similar to our case. We will provide a detailed discussion in our revision. Thank you again.

审稿人评论

I appreciate the authors' responses to my questions. The answers satisfactorily addressed all of my concerns. Therefore, I will maintain my initial rating.

审稿意见
4

The paper proposes a novel approach for detecting AI-generated images (AIGI), particularly deepfake and synthetic images. It highlights that existing detectors suffer from poor generalization ability when encountering unseen forgery methods, primarily due to overfitting to forgery patterns in the training set, resulting in a constrained and low-rank feature space.

To address this issue, the authors introduce a singular value decomposition (SVD)-based method that decomposes the feature space into two orthogonal subspaces: one for retaining pre-trained knowledge and the other for learning forgery-related patterns. This approach demonstrates superior generalization performance across multiple benchmark evaluations.

update after rebuttal. I thank the authors for carefully preparing the comments, my concerns have been addressed during the rebuttal, I would therefore be willing to raise my score.

给作者的问题

The paper treats all forgery methods as a single category during training. Has the consideration of the specificity and generalization of different forgery methods been explored? Could this approach result in the loss of unique characteristics for different forgery types, potentially degrading detection performance? For instance, compared to methods like MoE, could this lead to a lack of specialized knowledge?

In practical applications, there are often many adversarial scenarios and a variety of robustness tests. How well does the method perform under these conditions?

论据与证据

The paper’s primary claim is that orthogonal subspace decomposition enhances the generalization ability of AI-generated image (AIGI) detection. This claim is well-supported by experimental results, particularly in cross-dataset and cross-forgery evaluations, where the proposed method significantly outperforms state-of-the-art approaches.

The results demonstrate that the method effectively preserves pre-trained knowledge while simultaneously learning forgery patterns, leading to improved generalization performance in synthetic image detection.

方法与评估标准

The proposed method is based on SVD decomposition, where the principal components are frozen, and the remaining components are adjusted to preserve pre-trained knowledge while learning forgery patterns. This approach is theoretically sound and has been empirically validated through experiments.

The evaluation metrics include cross-dataset and cross-forgery tests, using AUC and accuracy as key indicators. These evaluation criteria are reasonable and align with existing research in AIGI detection.

理论论述

The paper presents a theoretical analysis explaining the asymmetry phenomenon in AIGC detection and demonstrates, through covariance spectrum analysis, the inevitable failure of symmetric classifiers in this task. The theoretical analysis is well-grounded and aligns with the experimental results.

实验设计与分析

The experimental design is well-structured, covering multiple datasets and forgery methods, effectively validating the generalization ability of the proposed approach.

The authors also conducted ablation studies, confirming the effectiveness of the SVD-based method and loss constraints. Experimental results indicate that the SVD method is the primary driver of performance improvement, while orthogonality constraints and singular value constraints further optimize generalization performance.

补充材料

The paper provides supplementary materials, including additional experimental results and ablation studies, further validating the effectiveness and robustness of the proposed method.

The supplementary materials also include a detailed description of the algorithm and theoretical proofs, enhancing the transparency and rigor of the study.

与现有文献的关系

The paper is closely related to the existing literature on AIGI detection, particularly studies addressing generalization challenges. The authors cite a wide range of relevant works and highlight the limitations of existing methods.

Building upon prior research, the proposed approach introduces an innovative solution by leveraging orthogonal subspace decomposition to enhance generalization performance.

遗漏的重要参考文献

"A Sanity Check for AI-Generated Image Detection" is a high-quality dataset that could serve as an additional benchmark for evaluating the proposed method. Testing on this dataset would further validate the generalization ability of the approach and provide a more comprehensive comparison with existing methods.

其他优缺点

Strengths: The paper proposes an innovative approach that addresses the generalization problem in AIGI detection through orthogonal subspace decomposition. The experimental design is comprehensive, covering multiple datasets and forgery methods, effectively validating the method's effectiveness. The theoretical analysis is insightful, providing an explanation for the asymmetry phenomenon in AIGI detection.

Weaknesses: The paper treats all forgery methods as a single category during training, which may overlook the specificities and commonalities of different forgery techniques. The proposed method may face additional challenges in real-world applications, such as shifts in data distribution and the emergence of new forgery techniques.

其他意见或建议

The overall quality of the paper is high, with thorough experimental and theoretical analysis. It is recommended that the authors include ablation studies on backbones and robustness testing in the experimental section, as well as add results from more high-quality datasets to further strengthen the findings.

伦理审查问题

NA.

作者回复

We sincerely thank Reviewer cn1X for the constructive comments, insightful questions, and useful suggestions. We greatly appreciate and are encouraged by the reviewer's recognition of our motivation, thorough experimental and theoretical analysis, methodological novelty, and superior experimental performance. Additionally, the reviewer raised several important concerns and questions, which we address in detail below.

Q1. The paper treats all forgery methods as a single category during training. Has the consideration of the specificity and generalization of different forgery methods been explored?

R1. Thanks for your very insightful question. As highlighted by the reviewer, treating all forgery methods as a single category in binary classification may potentially risk losing specificity and generalization—a concern we have not yet verified.

However, please note that this limitation is inherent to binary classification tasks in general, rather than specific to our proposed SVD-method in this work (our SVD-based approach is designed specifically for adapting to new forgery types while preserving pre-trained knowledge).

Following the reviewer's suggestion, we plan to propose an extended version (multi-task) based on our current binary framework, aiming explicitly at enhancing the balance between specificity (handling known forgeries effectively) and generalization (detecting unknown or unseen forgeries).

The preliminary idea of our future multi-task learning framework features:

  • (1) Dual-head structure: Two separate heads are employed—one dedicated to multi-class classification (specific head) and another for binary classification (general head).
    • Specific Head (multi-class): Learns fine-grained differences among known forgery types, focusing on fitting to the training distribution (specificity, IID).
    • General Head (binary): Captures shared features across forgery types, enabling the detection of unseen or novel manipulations (generality, OOD).
  • (2) Adaptive Dynamic Inference: Combining predictions from both heads based on confidence, balancing specificity and generality dynamically during inference.

Specifically, our dynamic inference strategy is as follows:

  • Compute the prediction probability (i.e., confidence) according to the specific head's logits.
  • If the maximal prediction confidence across classes is below a threshold (indicating uncertainty or a "flat" distribution), then making decision based on the binary general head, otherwise based on the specific multi-class head.

This adaptive strategy effectively maintains specificity for known forgery methods while providing robust generalization against unseen or evolving forgeries.

Due to the limited content in the rebuttal, we cannot provide exhaustive experimental details at this stage. But following the reviewer's suggestion, we evaluate our binary model and multi-task model on the Chameleon\text{Chameleon} dataset (see R2 below for details).


Q2. Chameleon is a high-quality dataset. Testing on this dataset would further validate the generalization ability.

R2. Thanks for introducing and highlighting the high-quality and challenging dataset, Chameleon\text{Chameleon}. Following the suggestion, we have conducted additional evaluations using this dataset. We follow the proposed setting in the original paper for experiments, i.e., training on GenImage (whole) and testing on Chameleon\text{Chameleon}.

As shown in Table 1 below, our method achieves superior generalization performance compared to baseline methods and other SOTA detectors. Also, the proposed extension version, i.e., multi-task framework, further boosts and refines the results, alleviating the potential loss of specificity and generality problem raised by the reviewer.

Table 1: Evaluation results on Chameleon. All detectors are trained on the GenImage dataset.

CNNSpotFreDectFusingGramNetLNPUnivFDDIREPatchNPRAIDEOurs (binary)Ours (multi-task)
60.8957.2257.0959.8158.5260.4257.8355.7057.8165.7770.2772.62

Furthermore, all evaluated methods experience a notable performance decline on Chameleon\text{Chameleon}, highlighting the dataset's significant difficulty. Inspired, we plan to conduct an in-depth analysis of Chameleon\text{Chameleon} in future research. Again, we greatly appreciate your suggestion.


Q3. About the robustness evaluation.

R3. Thank you for your question. We have already acknowledged this concern and have performed a robustness evaluation on the deepfake detection benchmark to assess our model's robustness, as presented in Figure 7 of our appendix. Following [1,2], we evaluate three types of image degradation: block-wise distortion, contrast changes, and JPEG compression. This experiment confirms the model's robustness against various perturbations.

[1] LSDA, CVPR 24.

[2] LipForensics, CVPR 21.

最终决定

This paper has received three accept and one strong accept scores. All the reviwers acknowledge reading the rebuttals. Nothing more needs to be said.