DiffusionFake: Enhancing Generalization in Deepfake Detection via Guided Stable Diffusion
摘要
评审与讨论
This paper proposes DiffusionFake which can harnesses the power of Stable Diffusion to guide the forgery detector in learning disentangled source and target features inherent in Deepfakes. The features of the detection networks are processed through the target and source transformation modules, and then injected into the Stable Diffusion to reconstruct the source and target images. Through the proposed strategy, DiffusionFake can enhance its ability to handle unseen forgeries without compromising efficiency.
优点
The idea of using SD-based reconstruction to enhance detection accuracy is rational.
缺点
-
The weighting module is supervised by the similarity between z and /. However, the latent representation of stable diffusion is not linear. I think the assumption of this constraint is not always correct.
-
There is no analysis for the influence of the loss weight in Eq. 13
-
There is no cross-model evaluation, i.e., the samples during testing are generated by a method that has not been seen during the training.
-
The evaluation is only conducted on the images with low resolution.
问题
-
How to decide the optimal loss weights in Eq. 13?
-
How is the performance if we conduct cross-model evaluation?
-
How is the performance when the input image has high resolution?
局限性
If some attackers have known the training details of the detector and use it as the metric to enhance the performance of deepfake, it will be harder to detect them.
We sincerely appreciate your positive acknowledgment of our method's rationality, as well as the meaningful questions you've raised. Below are our responses to your specific inquiries:
Q1: Concern about weight module.
Thanks for your question. We agree that Stable Diffusion's latent representation is non-linear. However, using cosine similarity of non-linear features to measure data relationships is common, especially in face recognition where non-linear model features are often used for similarity comparisons. Moreover, Stable Diffusion's paper[1] notes that the compression latent space preserves input details and allows good reconstruction. Thus, we argue that similarities in this compressed space can represent original image similarities, serving as a supervision signal for our weight module.
To support this assertion, we direct your attention to Figure 7 in our main paper. This figure demonstrates that the trained weight module derived from our method largely aligns with intuitive expectations of image similarity. This alignment suggests that, despite the non-linear transformation, our approach captures meaningful relationships between images.
We acknowledge that there is room for more rigorous assumptions in future work. Potential improvements could include using non-linear space metric functions or simpler, direct image similarity metrics such as SSIM or PSNR.
We appreciate your insightful comment, which has highlighted an important area for future refinement in our approach.
[1] High-Resolution Image Synthesis with Latent Diffusion Models.
Q2: Influence of the loss weight
We appreciate your observation regarding the loss weights in Equation 13. To address this, we conducted comprehensive ablation studies to determine the optimal values for and .
Following the ControlNet setup and considering that target reconstruction is relatively stable, we initially fixed at 1 and varied through values of 0.1, 0.3, 0.5, 0.7, and 1.0. Our experiments showed that = 0.7 yielded the best average performance across five test datasets. The quantitative results are shown in the table below:
| AVG-AUC | AVG-EER | |
|---|---|---|
| 0.1 | 77.13 | 29.01 |
| 0.3 | 78.99 | 27.51 |
| 0.5 | 80.27 | 26.36 |
| 0.7 | 81.88 | 25.97 |
| 1.0 | 79.31 | 26.77 |
We then fixed at 0.7 and conducted ablation studies on , finding the peak performance at = 1. The experimental results are summarized in the following tables:
| AVG-AUC | AVG-EER | |
|---|---|---|
| 0.1 | 75.95 | 30.75 |
| 0.3 | 77.30 | 28.15 |
| 0.5 | 79.25 | 27.77 |
| 0.7 | 81.09 | 26.15 |
| 1.0 | 81.88 | 25.97 |
| 1.2 | 80.38 | 26.98 |
These results align with our intuition. The source image often differs significantly from the fake image, so a slightly smaller loss weight for the source helps maintain training stability. We observed that if is too large, the loss becomes difficult to minimize.
We will include this detailed analysis in our revised manuscript to provide a clearer understanding of our model's behavior and optimization.
Q3: Concern about cross-model evaluation.
We appreciate your question. As mentioned in lines 246-250 of our paper, our method primarily aims to enhance model generalization. Consequently, the majority of our experiments are cross-dataset evaluations.
In Table 1, following the protocols established in papers such as DCL[2] and DeepfakeBenchmark[3], we train our model on the FFpp dataset and evaluate it on other test sets. This approach ensures that the model is tested on samples generated by methods not seen during training.
Furthermore, we have included additional cross-manipulation evaluation experiments in the Appendix (kindly refer to Table 4 and Table 5), which test attack types different from those used during training.
We acknowledge that we could have made this experimental setup clearer in the main text. In our revised manuscript, we will provide a more explicit explanation of our cross-dataset and cross-model evaluation strategy to avoid any confusion.
[2] Dual contrastive learning for general face forgery detection.
[3] DeepfakeBench: A Comprehensive Benchmark of Deepfake Detection.
Q4: Concern about high resolution image.
Our experimental setup follows the protocols established by standard setting used in many previous works, where images are uniformly resized to 224x224 for both training and testing. Following your suggestion, we conducted additional experiments with a higher resolution of 384x384 and 512x512. The results, as shown in the table below, demonstrate that our method still achieves improvements at this higher resolution:
| Resolution | Method | AVG-AUC | AVG-EER |
|---|---|---|---|
| 384 | en-b4 | 76.31 | 30.25 |
| ours | 82.34 | 25.41 | |
| 512 | en-b4 | 77.89 | 29.31 |
| en-b4+ours | 83.11 | 25.03 |
These findings indicate that our approach remains effective for larger image resolution.
Furthermore, we have tested our method on high-quality images using the DiffSwap dataset, which consists of high-resolution images (1024x1024) processed through diffusion-based face swapping. Our method showed a relative improvement of approximately 6% on this dataset, further demonstrating its generalization capability under high-quality, high-resolution conditions.
In our revised manuscript, we will include these additional experiments to provide a more comprehensive evaluation of our method's performance across various image resolutions and quality levels. Thank you for this valuable suggestion, which has helped us to more thoroughly validate our approach.
Dear Reviewer NDSD,
Thank you for your invaluable efforts and constructive feedback on our manuscript. We greatly appreciate your positive evaluation of our paper's rationale. We have endeavored to provide comprehensive responses to the concerns you raised in your review.
As the discussion period draws to a close, we eagerly await your thoughts on our response. We sincerely hope that our revisions align with your expectations. If there are any remaining concerns or aspects that require clarification, we are ready to address them as soon as possible.
Best regards,
The Authors
Some of my concerns have been resolved, and I decide to raise my score.
We sincerely appreciate the reviewer's recognition of our work. We are grateful for your careful consideration of our rebuttal and the time you've invested in evaluating our research.
This paper introduces DiffusionFake to address the challenge of generalization in face forgery detection by revisiting the generative process of deepfakes. DiffusionFake reverses the generative process by injecting features into a pre-trained Stable Diffusion model to reconstruct source and target images. The plug-and-play framework integrates with existing detectors, improving cross-domain generalization without adding inference parameters. Experiments demonstrate significant performance improvements across various architectures.
优点
- This paper presents a novel approach by designing a method from the deepfake creation perspective, introducing a reverse method, and utilizing pre-trained Stable Diffusion knowledge to address information loss during reconstruction. It effectively extracts features using a guide module to obtain source-related and target-related information through reconstruction, which is logically sound.
- The paper proposes a plug-and-play functionality that enhances the generalization ability of multiple detection networks without adding parameters, facilitating deployment.
- This paper is well-written. The authors offer an easy-to-follow presentation with a well-structured format, while important visualizations and figures are clearly provided.
缺点
- The training set seemingly contains instances where the source and target ground truth are identical for the same image. The paper should address how this trade-off is handled.
- The discussion on the weight module could be clearer and more detailed.
- In the visualization figures (Fig. 3), the reconstructed source images appear to differ from the ground truth (even the training sample). The authors should explain the reason for this discrepancy.
- More ablation studies should be conducted. For example, the structure of the feature-transform module is insufficiently evaluated.
问题
- Would the performance improve and convergence speed up if the parameters of Stable Diffusion were unfrozen during training?
- How significant is the impact of Stable Diffusion's pre-trained knowledge on the overall method? Have other pre-trained diffusion methods been tested?
For more questions please refer to the Weaknesses.
局限性
The authors have adequately addressed the limitations.
We sincerely appreciate your positive evaluation of our paper, particularly your comments that our work is "logically sound," "novel," and "easy-to-follow." We are committed to further improving our manuscript based on your valuable suggestions. Below, we address each of your questions in detail:
Q1: Concern about the same ground truth.
Thank you for this excellent question. Indeed, as the FFpp dataset contains four different attack methods, there are instances where different samples have the same reconstruction ground truth. However, the reconstruction weights differ significantly in these cases, which is one of the reasons we designed the weight module.
Figure 7 in our paper demonstrates the varying weights corresponding to different forgery types. As shown, even when the ground truths are identical, the dependence on source and target conditions differs. This variation leads to diversity during training.
We appreciate you bringing this to our attention. We will include this discussion in the main text of our revised paper to provide a clearer explanation of how our method handles this scenario.
Q2: Concern about quality of reconstruction.
We appreciate your perceptive question regarding the quality of reconstruction. Our method takes on the unique challenge of reconstructing source and target images from fake images, which inherently involves significant information loss. This loss can lead to inaccuracies in expression or blurring, as you noted in Figure 3A of our main paper.
It's important to emphasize that the primary aim of our DiffusionFake framework is not perfect reconstruction, but rather to compel the detection model to extract source-related and target-related features. This extraction process enhances the model's generalization capabilities, which is our ultimate goal. The quality of reconstruction serves as a means to this end, rather than being the end itself.
Interestingly, we've discovered that the input noise significantly influences the fine-grained expression control in reconstructed images. To illustrate this, we've included Figure 2 in our rebuttal PDF. This figure shows target images reconstructed using five different noise patterns, along with their corresponding PSNR and SSIM scores relative to the target ground truth. The final noise pattern, in particular, yields images that closely match the target ground truth in both expression and detail.
Q3: Concern about ablation studies.
Thank you for your question. Following your suggestion, we conducted ablation studies on our feature-transform module. Specifically, we ablated three components: space attention, channel attention, and the self-attention module used for fusion. The results are shown in the table below:
| Channel-Att | Space-Att | Self-Att | AVG-AUC | AVG-EER |
|---|---|---|---|---|
| ✗ | ✗ | ✗ | 77.88 | 27.94 |
| ✓ | ✗ | ✗ | 79.33 | 27.10 |
| ✗ | ✓ | ✗ | 79.81 | 26.75 |
| ✓ | ✓ | ✗ | 80.25 | 26.38 |
| ✓ | ✓ | ✓ | 81.88 | 25.97 |
As we can observe from the table, both space attention and channel attention show improvements compared to not using them. This indicates that feature selection in both dimensions is crucial for the results. Additionally, we found that using self-attention for fusion performs better than direct fusion. This is because the cross-attention operation allows for more comprehensive feature integration. Finally, we discovered that the combination of all three components achieved the best performance.
Q4: Question about unfrozen Stable Diffusion.
Thank you for your question. Following your suggestion, we attempted to fine-tune the decoder part of the Stable Diffusion model during training. The results are shown in the table below:
| Method | AVG-AUC | AVG-EER |
|---|---|---|
| froze | 81.88 | 25.97 |
| unfrozen | 81.97 | 25.71 |
We observed that while training time and GPU memory usage increased, the performance improvement was not significant. This may be because the pre-trained knowledge in the Stable Diffusion model is vast, and the dataset for this task is insufficient to adjust the entire Stable Diffusion model effectively. Moreover, our primary goal is to force the encoder to extract source-related and target-related features to improve generalization through reconstruction. Therefore, fine-tuning the Stable Diffusion part doesn't provide substantial benefits.
We appreciate your suggestion as it has led to these valuable insights. This experiment has helped us better understand the role and limitations of fine-tuning in our approach.
Q5: Concern about impact of Stable Diffusion.
Thank you for your insightful question. Due to time constraints, we experimented with two pre-trained SD models. First, we used a randomly initialized diffusion model without SD pre-training as a baseline. Second, we employed the SD 2.1 model for training. The results are shown in the table below:
| SD model | AVG-AUC | AVG-EER |
|---|---|---|
| w/o Pretrain | 73.57 | 33.12 |
| sd1.5 | 81.88 | 25.97 |
| sd2.1 | 82.30 | 25.81 |
As we can observe, without using the pre-trained SD model, information loss cannot be adequately compensated, leading to unsuccessful reconstruction. This results in unstable training and minimal performance gains.
On the other hand, using the latest SD 2.1 model slightly outperforms SD 1.5. This improvement may be attributed to SD 2.1's enhanced capabilities, which provide the DiffusionFake framework with stronger information recovery abilities. Consequently, this allows the encoder to extract more generalizable features.
Dear Reviewer b2gp,
We greatly appreciate your valuable feedback and the time you've taken to review our manuscript. We are sincerely grateful for your positive evaluation of our paper. Additionally, we would like to express our appreciation for the constructive experimental suggestions you've provided, which will undoubtedly make our paper more comprehensive.
As we approach the end of the discussion period, we eagerly await your thoughts on our response. We sincerely hope that our revisions align with your expectations. If there are any remaining concerns or aspects that require clarification, we are ready to address them as soon as possible.
Best regards,
The Authors
Thanks for the response. The authors have adequately addressed my concern. I wish the analysis and experiments in the rebuttal could be appropriately incorporated into the manuscript to make it more comprehensive and compact. I look forward to seeing the updated results and analysis in the camera-ready version.
We appreciate your recognition of our work and rebuttal. We will further improve the paper based on your suggestions. Thank you for your efforts in reviewing our paper.
This paper adopts a novel plug-and-play framework that reverses the generative process of face forgeries to enhance the generalization of detection models. Extensive experimental results from several datasets demonstrate that this method has achieved very competitive performance.
优点
- The motivation of this paper is clear and the approach to implementing it is straightforward and understandable.
- This paper attempts to introduce Stable Diffusion into deepfake detection, unifying the generation model and the detection model end-to-end, which is very interesting and innovative, and provides a new perspective for this field.
- The overall structure of this paper is relatively reasonable and the context of the paper is relatively clear.
缺点
- Some recently published papers about enhancing generalization in deepfake detection are not cited and discussed, such as:
[1] Dong, Shichao, et al. "Implicit identity leakage: The stumbling block to improving deepfake detection generalization." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
[2] Yu, Bingyao, et al. "Discrepancy-aware meta-learning for zero-shot face manipulation detection." IEEE Transactions on Image Processing (2023).
[3] Chen, Liang, et al. "Self-supervised learning of adversarial example: Towards good generalizations for deepfake detection." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.
-
Many of the experimental results in Table 1 provided by the authors in the paper are inconsistent with or may be lower than the experimental results in the original paper. It is best for the authors to provide detailed, clear and convincing explanations.
-
In order to better understand the performance of the proposed method, it would be best if the authors could provide sample visualization results of misprediction and give corresponding theoretical analysis.
-
Please unify the format of references. At least make sure that the reference formats of conferences and journals are consistent.
-
If the author can answer and revise the relevant questions in the final version, I will consider increasing the final score in the next round.
问题
Please standardize the capitalization of English letters in the references. Many abbreviations of proper nouns are incorrect, such as Aunet (AUNet).
局限性
The authors have adequately addressed the limitations and potential negative societal impact of their work.
We sincerely appreciate your recognition of our work and your valuable suggestions. Your feedback is precious to us, and we will carefully revise our paper based on your comments, especially addressing the references you mentioned and correcting any capitalization issues. Regarding your specific question, we offer the following response:
Q1: Insufficient comparative discussion:
Thank you for raising this point. We agree that the papers you mentioned are indeed significant contributions to this field. In our revised paper, we will thoroughly compare and discuss these articles, clearly articulating the differences and connections between our method and the approaches presented in these works. This will provide a more comprehensive context for our research within the current state of the field.
Q2: Concern about inconsistent results.
We appreciate you highlighting this important point. The discrepancies in experimental results arise from varying implementation details across studies. These differences include face extraction methods, cropping ratios, frame sampling, augmentation techniques, and testing protocols. Additionally, many papers lack results for certain test sets like DiffSwap and Wild Deepfake.
To ensure a fair comparison, we retrained all models with publicly available code under identical conditions. We strictly adhered to the original hyperparameters and settings specified in the open-source code for each reimplemented algorithm, ensuring a faithful performance comparison.
In our revised paper, we will provide a detailed explanation of our experimental setup and clearly indicate that the reported results are from our reimplementation under consistent conditions. We will also make the training configurations for other methods publicly available in our open-source code.
Q3: Visualization about misprediction results.
We appreciate your suggestion for a deeper analysis of our method's performance. Upon examining our misprediction results, we identified two main categories of errors, as illustrated in Fig. 4 of rebuttal PDF:
1). Profile view images: These images present a challenge during training, as they are difficult to reconstruct into source and target images due to significant information loss. This results in misclassification during inference. 2). Low-quality images: Our method encourages the detector to decouple source-related and target-related features to improve generalization. However, low-quality, blurry images hinder the network's ability to extract these features effectively, leading to misclassification.
In future work, we will focus on optimizing these two types of images, such as increasing the weight of low-quality data reconstruction during training, and using data augmentation to supplement the side faces in the training data.
We will include this analysis and the corresponding visualizations in our revised paper to provide a more comprehensive understanding of our method's strengths and limitations
Dear Reviewer VbH6,
We are deeply grateful for your thorough review and insightful comments on our paper. We greatly appreciate your evaluation of our work and assure you that we will comprehensively incorporate your feedback into our revised paper, enhancing its content and quality.
As the discussion period draws to a close, we eagerly await your thoughts on our response. We sincerely hope that our revisions align with your expectations. If there are any remaining concerns or aspects that require clarification, we are ready to address them promptly.
Best regards,
The Authors
We sincerely appreciate the reviewer's recognition of our work. Your positive feedback is greatly encouraging. We are grateful for your careful consideration of our rebuttal and the time you've invested in evaluating our research.
The paper investigates the task of deepfake (especially, face swapping) identification. It proposes to utilize current image generation model to reconstruct source and target profiles from embedded features. The authors did thorough experiments and prove quantitatively that the proposed DiffusionFake method outperforms previous baselines.
优点
- The problem definition is clear and well-motivated. The proposed solution is also quite simple and intuitive.
- All the figures and diagrams are well-designed. Specifically, the first few figures explain the pipeline clearly. Figure 4 and 5 visualize the results in a nice way.
- Quantitative results show that the proposed method significantly outperform baselines.
缺点
- Field of application is limited to face swapping with two identities, while claiming a "DeepFake detector".
- The Related Work section is not very clear in connecting the current work with previous ones. See questions.
- Writing quality is in general good. It might benefit from omitting some details (how many layers?) and provide more intuitive explanation (why this design works?)
- Experiments are thorough, but only on two detection metrics. It would be great to instead see something measuring reconstruction quality.
问题
- Why DFD dataset's best performance happens with a different model architecture than all others? Is there anything special about this dataset and its performance?
- In Figure 3A, even with training samples, we can see that the expression is not quite well reconstructed. The target GT expression (mouth slightly open) is "recovered" as source expression. Is this a common observation? How well does the model preserve expression?
- Using AUC and EER as metrics are good (and classical) regarding this task. However, as in the ablation study, it would be interesting to see some numerical metrics measuring the reconstruction quality. Is this quality highly-correlated with the detection accuracy?
- Is this the first ever work that takes advantage of generative model for face forgery detection? What are some of the most important baselines? Qualitatively how does the current methods outperform the baselines? (i.e. what are some characteristics of those challenging cases that baselines cannot do but this approach can?)
局限性
The authors already mentioned a list of limitations at the end of their paper.
- Not applicable to blending of multiple identities (>2) or partial manipulation.
- Require paired images, not self-supervised.
- Could be used as a discriminator to develop harder to distinguish deepfake images.
And also two more:
- The dataset they use might have limited diversity -- looks like all the examples in the paper are white. A bit concerned about the generalizability to all ethical groups, genders, ages, etc.
- The word "Deepfake" might have a slightly broader meaning than face swapping these days. This paper is only detecting face swapping, not "generated faces" in general. It might worth being a bit conservation in claiming the contribution.
We sincerely appreciate your positive feedback on our paper, particularly your comments that our work is "well-motivated" and "simple and intuitive". We are committed to further improving our manuscript based on your constructive suggestions. Below, we address each of your questions in detail:
Q1: Question about DFD dataset performance:
As you noted, the DFD dataset indeed shows higher performance compared to other datasets. Our analysis reveals that DFD images exhibit noticeable inconsistencies, as shown in Fig. 1 of the rebuttal PDF, which closely resemble the inconsistency patterns in the FFpp training set. This similarity stems from using the same forgery method and post-processing.
Furthermore, UIA-VIT performs exceptionally well on the DFD dataset (94.68%). This is because UIA-VIT leverages an Unsupervised Patch Consistency Learning module, which effectively models these inconsistencies by learning forgery location maps between real and fake image pairs. This approach is particularly effective for detecting inconsistencies similar to those in the training set.
Moreover, our detector uses the simplest EN-B4 and VIT-B models without any special design. Even in this scenario, it shows significant relative improvements without increasing parameters or inference time. Compared to baselines, we achieved a 4.14% improvement over En-b4 and a 5.98% improvement with VIT-B backbone on the DFD dataset. Moreover, on other higher-quality datasets, our method outperforms UIA-VIT by about 4% without increasing parameters.
Q2: Concern about expression reconstruction:
Thanks for your insightful question. Our approach uniquely reconstructs source and target images from fake ones, facing challenges due to information loss. This can lead to expression inaccuracies or blur, as seen in Figure 3A of the main paper. However, DiffusionFake's primary goal is to compel the detection model to extract source-related and target-related features, enhancing generalization. Reconstruction quality serves as a means to this end, not the ultimate objective.
Moreover, we've observed that fine-grained expression control in reconstructed images is closely related to the input noise. With suitable input noise, we can achieve better reconstruction results. In Fig 2 of the rebuttal PDF, we visualize target images reconstructed under five different noise patterns, along with their PSNR and SSIM scores compared to the target GT. Notably, the last noise pattern produces images with expressions and details closely matching the target GT. This finding provides valuable insights into our method's capabilities and potential for improvement. We will add the results in the revised version.
Q3: Question about reconstruction metrics and performance:
We appreciate your suggestion on reconstruction quality metrics. Following your recommendation, we calculated PSNR and SSIM for each model in the ablation study from Table 2. For each model, we used 10 random noise sets and their corresponding target GT, then averaged the values. The results are shown below:
| SD | Filter | Weight | Celeb-DF AUC | DFDC-P AUC | SSIM | PSNR |
|---|---|---|---|---|---|---|
| ✗ | ✗ | ✗ | 71.87 | 71.78 | 0.11 | 10.91 |
| ✗ | ✓ | ✗ | 73.87 | 72.41 | 0.15 | 11.35 |
| ✓ | ✗ | ✗ | 77.35 | 75.69 | 0.62 | 17.83 |
| ✓ | ✓ | ✗ | 80.79 | 76.17 | 0.64 | 18.53 |
| ✓ | ✗ | ✓ | 78.67 | 76.59 | 0.63 | 18.22 |
| ✓ | ✓ | ✓ | 83.17 | 77.35 | 0.67 | 19.95 |
Our analysis reveals a positive correlation between reconstruction quality and detection performance. Models with better reconstruction quality generally demonstrated higher detection accuracy. Notably, when the SD pre-trained model is not used, the generation quality is very poor, corresponding to significantly worse results. This finding supports the intuition that better reconstruction ability contributes to more effective feature extraction, which in turn leads to improved detection performance. We will add this results in the revised paper.
Q4: Comparison with generative methods:
Our work is not the first to use generative models for face forgery detection. Previous methods, such as RECCE, have used Auto-Encoder approaches to improve detector generalization by reconstructing real samples.
Our approach differs in motivation and reconstruction targets. Previous methods typically reconstruct samples directly to learn data distribution without information loss. In contrast, our motivation is to recover source and target images from fake images, enhancing the detection model's ability to capture decoupled features. This task is challenging due to the significant disparity between fake and original counterparts, often resulting in substantial information loss. We address this by incorporating a pre-trained Stable Diffusion model to compensate for lost information.
However, traditional generative methods like RECCE, using simple Auto-Encoder architectures, cannot handle such complex reconstruction tasks. Fig. 3 in our rebuttal PDF compares RECCE's and our method's reconstruction of source and target images. We can observe that RECCE fails to reconstruct due to information loss, while our method, leveraging the SD model, achieves better restoration. This demonstrates the key difference between our approach and other reconstruction-based methods.
We'll include these results and visualizations in our revised paper.
Q5: Concern about "Deepfake" Term.
We appreciate your insight regarding the term "Deepfake." We agree that the scope of Deepfake techniques has expanded beyond simple face swapping. To avoid confusion, we will clearly specify the applicable range of our method in the main text of our paper.
I'm happy with the additional clarifications made by the authors. I'm in general satisfied with this paper and raised my score.
We are deeply grateful for your reconsideration and the increased score for our work. Thank you for your time, expertise, and the constructive dialogue throughout this review process.
Dear Reviewer tkET,
We are deeply grateful for your thorough review and insightful comments on our paper. Your positive evaluation has been highly encouraging. We also appreciate your constructive suggestions, which have significantly improved our paper's depth and completeness.
As the discussion period draws to a close, we eagerly await your thoughts on our response. We sincerely hope that our revisions align with your expectations. If there are any remaining concerns or aspects that require clarification, we are ready to address them promptly.
Best regards,
The Authors
We thank all reviewers for their positive and constructive feedback, which will definitely help us improve the quality of this paper. We wish to address their concerns as follows. We have included a PDF with some visualization results mentioned in the rebuttal for your reference.
This work was accepted by all four reviewers (Reviewer tkET, Reviewer VbH6, Reviewer NDSD, Reviewer b2gp). The ethics reviewer psNv has raised concerns regarding the security and privacy issues when training AI models with datasets containing human images, which are used without their consent. The authors have also responded with ways to mitigate this issue.
Based on the overall assessment from all the reviewers, I am inclined to accept this paper based on the technical contribution and solid experiments. The authors need to address both the technical concerns and ethical concerns from the reviewers in their final version.