3.7

/10

Rejected3 位审稿人

最低3最高5标准差0.9

3.7

置信度

ICLR 2024

AntifakePrompt: Prompt-Tuned Vision-Language Models are Fake Image Detectors

You-Ming Chang,Chen Yeh,Wei-Chen Chiu,Ning Yu

OpenReview PDF

提交: 2023-09-20更新: 2024-02-11

摘要

关键词

DeepFake detectionvision-language modelsprompt tuningdiffusion modelsGANs

评审与讨论

审稿意见

评分: 3置信度: 52023-10-24

In this paper, the authors propose to apply VLM to detect fake images. They add a pseudo-word S* to the template prompt, and guild the VLM responding 'Yes' or 'No' for real and fake images, respectively.

In general, detecting fake images with generalizable detector is a popular topic, and the authors have made a nice try for this. However, there are several concerns remain, including issues for main contribution, experiments, datasets, and baselines. Please see the Weaknesses.

优点

Fake image detection is a trending topic and utilize VLM to detect deepfakes is a commendable attempt.
The method is straightforward and appears to be effective.
The authors have conducted extensive validation, providing evidence of the effectiveness of their approach.

缺点

The contributions are somewhat limited. The prompt tuning technique employed in this paper is not a highly original advanced methods, and it lacks sufficient adaptation and analysis in the downstream task, i.e., fake image detection. In other words, the proposed method could be applied to various other visual tasks, raising doubts about its specific contribution to fake image detection.
The experiments lack comparison with SOTA baselines, such as [1], which also focuses on developing a general diffusion detector. Furthermore, there are numerous methods in the field of deepfake detection that specialize in generalization across various forgery types, and these should also be considered in this paper.
Despite the authors create their own dataset, it is advisable to validate their method on other benchmark datasets, such as the one introduced by De-fake. Additionally, the fifth category in the dataset constructed in this paper, namely "Deeperforensics," should be accurately labeled as "Deepfakes" or "Face Swap."

Considering that the proposed approach lacks novelty and suffers from shortcomings in the experimental setup, I currently lean toward a rejection.

问题

Enriching the experiments can enhance the quality of this paper. However, as a detection method, the innovativeness of the approach must be a crucial consideration. I hope the authors can delve deeper into the analysis of the characteristics of forged images to guide the judgment of VLM.

伦理问题详情

n/a

评论- Response to Reviewer 4Y5R

2023-11-17

We thank all the reviewers for their constructive suggestions, which help improve the completeness of our submission, and the revised part of the paper are marked in red. We are encouraged that reviews are positive in the following four levels:

The paper is "well written" (Reviewer NiiK).
The problem we are researching is “a topic of active research interest” (Reviewer eMV8) and “trending” (Reviewer 4Y5R).
Our solution is “novel”, "highly generalizable" (Reviewer NiiK), “innovative” (Reviewer eMV8), “straightforward” and “effective”(Reviewer 4Y5R).
Our experiments are “extensive” (Reviewer 4Y5R).

We now address individual concerns of Reviewer 4Y5R below.

1. [The contributions are somewhat limited.]

This comment differs from all the other reviewers, who agree on our contributions are "novel" (Reviewer NiiK), “innovative” (Reviewer eMV8).
First, we think that there may be a misunderstanding of our goal. Our goal is to find a more general solution to deepfake detection problem, and then chose VLM as the backbone model. Hence, the review "the proposed method could be applied to various other visual tasks" is out of our original motivation, making it not a strong reason for claiming limited contribution.
It is true that prompt tuning technique is not a highly original advanced method. However, we pioneer to employ prompt tuning technique on VLM, and propose a simple yet effective solution to deepfake detection, which is a "novel attempt in this area" (NiiK). In Abstract section in the revised submission, our method demonstrates better performance on held-out datasets comparing to the baselines (61.79 to 92.72) with less learnable parameters (~4M vs. 4.96K), highlighting the resource efficiency and effectiveness of our method. This is a non-trivial and significant contribution on its own.
We respectfully disagree about the comment of "it lacks sufficient adaptation and analysis in the downstream task". As mentioned in Section 3 in the original submission, we carefully formulate the deepfake detection into a visual question answering problem with clear definition and equations. Also, in Section 4.3 in the original submission, we conducted and analyzed several ablation studies (e.g. position of pseudo-word, prompt tuning for Qformer or LLM, number of training images), demostatrating our extensive experiments and complete analysis.

2. [The experiments lack comparison with SOTA baselines]

Thank you for your comment. We sincerely hope that you can provide the reference of [1] so that we can compare it with our method as soon as possible.
We chose Wang 2020 and DE-FAKE as the baselines due to their highlight in generalizability and their representativeness in this field. Furthermore, following the comments by reviewer eMV8, we have introduced the results of other deepfake detections methods in Table 1 in the revised submission.

3. [Validate our method on other benchmark datasets]

Thank you for the advice. As the exact testing dataset of DE-FAKE is not open-sourced, in our original submission we have done our best to follow their experimental setting to adopt COCO, Flickr, LD, SD, and DALLE in our testing dataset except for GLIDE. In order to better fulfill the reviewer's suggestion, we then further introduce GLIDE to our testing dataset with providing the corresponding in the Table 1 of the revised submission.
What's more, both DE-FAKE and our method are trained on COCO+SD2, while our method is tested on more difficult and more diverse types of datasets (e.g. the held-out testing datasets from image inpainting, super resolution, image attacks, etc), which make our superior experimental results more conclusive and convincing.

评论- Follow-up of Discussion

2023-11-20

Dear Reviewer 4Y5R,

Thank you so much for your valuable time in reviewing this submission. This is a friendly reminder that the final discussion ends soon. We have tried our best to address your concern in our responses, which, hopefully, answered your questions. If you have any further concerns, please feel free to let us know.

Regards,

Authors of Submission2329

评论- Discussion

2023-11-21

The revision partially addressed the reviewer's concerns, including experiments of new baselines and deepfake datasets. However, the reviewer still hold concerns about the contribution. Admittedly, the attempt to employ VLM as a foundational model for generalized deepfake detection is commendable. Yet, the proposed method lacks significant technical contribution (the reviewer concur with Reviewer NiiK’s observation that the proposed method is "quite straightforward"). Deepfake detection has evolved over many years. A robust deepfake detection study should, at a minimum, either unearth unique visual features of deepfake videos [1][2] or identify blind spots in current detection models[3]. The proposed approach merely involves feeding deepfake videos into the VLM model. Similarly, the authors have not demonstrated deepfake characteristics in the text prompts. The method could be easily transferred to other task by altering text prompts, for instance, changing "Is this photo real" to "Is this people playing football," thus prompting the VLM to switch to an action recognition model. Given the lack of in-depth analysis of deepfake characteristics, the reviewer believes that this paper does not yet meet the publication standards of ICLR.

[1] Li et al. "Face x-ray for more general face forgery detection." CVPR 2020.

[2] Wang et al. "DIRE for Diffusion-Generated Image Detection." ICCV 2023.

[3] Dong et al. "Implicit Identity Leakage: The Stumbling Block to Improving Deepfake Detection Generalization." CVPR 2023.

Ps. It should be noted that the reference [1] in the initial comments pertains to DIRE. Many apologies for any confusion.

评论- Follow-up Response for 4Y5R

2023-11-23

Dear Reviewer 4Y5R,

We sincerely appreciate your thoughtful responses and the acknowledgment of our paper. Here are our replies addressing your concerns:

1. [Lack of significant technical contribution.]

We understand this is a subjective judgement, and the debate would never end. We would rather demonstrate our contributions by proving the novelty of the proposed method using Prof. Michael Black’s interpretation about novelty from Novelty in Science. We quote some:

About the novelty of our connection between two established fields, visual question answering and deepfake detection, Prof. Michael Black says
- “The novelty arose from the fact that nobody had put these ideas together before.”
- “To see the connections for the first time, before others saw them, was like breathing for the first time.”
- “The resulting paper embodies the translation of the idea into code, experiments, and text. In this translation, the beauty of the spark may be only dimly glimpsed. My request of reviewers is to try to imagine the darkness before the spark.”
We pioneer to formulate the deepfake detection problem as a visual question answering problem, in which no prior work has put these two ideas together, making the proposed method an impressing "spark" in this field.
About the novelty of our simple and reasonable reuse of VLM and prompt tuning on deepfake detection problem, Prof. Michael Black says
- “I value simplicity over unnecessary complexity; the simpler the better. Taking an existing network and replacing one thing is better science than concocting a whole new network just to make it look more complex.”
- “If a paper has a simple idea that works better than the state of the art, then it is most likely not trivial. The authors are onto something and the field will be interested.”
- “The inventive novelty was to have the idea in the first place. If it is easy to explain and obvious in hindsight, this in no way diminishes the creativity (and novelty) of the idea.”
The proposed method simply adds a pseudo-word and optimizes the word embedding, and then reaches the SOTA performance on various types of generated images, making it not a trivial idea, while also demonstrating the simplicity and effectiveness of the idea.
Additionally, the proposed idea is “straightforward” and “effective” (by Reviewer 4Y5R), which supports the value of the proposed method as well.

We therefore appeal the reviewer’s re-evaluation about the “contribution” rating and final rating.

2. [Lack of in-depth analysis of deepfake characteristics.]

Thank you for your comment. Actually, we have conducted some analysis on the prompt-tuned text (pseudo-word). We first used t-SNE to reduce the dimensionality of every word embeddings into 2-dimension representations. Then, we found some nearest neighbors of the pseudo-word embedding using Euclidean distance metric, trying to figure out the closest meaning of the pseudo-word. However, most of the neighbors were meaningless symbols or punctuation marks such as "-", "of", "and" or even unrelated words in other languages.
We also conducted some analysis on the visual features. We have used GradCAM to analyze the important region in the images. Although we did observe AntifakePrompt focused on certain region in some samples to consider them as fake, we have not found any significant or consistent visual characteristic throughout all GradCAM results of our samples to draw a concrete conclusion.
What's more, since soft prompt tuning technique aims to tune the continuous embedding for the simplicity of optimizing, it is hard and meaningless to find a corresponding human-readable word of the tuned embedding.
Therefore, following [1][2], we demonstrate the effectiveness of the proposed method by analyzing the outstanding experimental results instead of the meaning of the pseudo-word.

[1] Chen, Lichang, et al. "InstructZero: Efficient Instruction Optimization for Black-Box Large Language Models." arXiv preprint arXiv:2306.03082 (2023).

[2] Zou, Andy, et al. "Universal and transferable adversarial attacks on aligned language models." arXiv preprint arXiv:2307.15043 (2023).

Best, Authors of Submission2329

审稿意见

评分: 5置信度: 32023-10-31

This paper proposes a new method to improve the generalizability of fake image detection models by taking advantage of large pretrained vision-language models. Specifically, the proposed method reformulates the fake image detection task as a VQA task, i.e. asking the vision-language model to answer whether the input image is real. To achieve this, the authors proposed insert some learnable task-specific embeddings into the pretrained vision-language model and train these newly-inserted parameters with a prompt tuning algorithm. Experiments on real image datasets and some model-generated fake images show the superiority of the proposed model over existing models.

优点

(1) This paper is well-written. Most of the technical details are clearly presented, it would be easy for readers to understand the proposed method and for followers to reproduce or improve the proposed method.

(2) Formulating the fake image detection task as a visual question answering task is a quite novel attempt in this area. Even though such idea has been adopted other areas, I believe the attempt in this paper should be encouraged.

(3) By applying the proposed method to pretrained vision-language model, the resulted model achieves superior fake image detection performance on a wide range of tasks over existing models, and it also has high generalizability as shown in the experiments.

缺点

Novelty: although the idea of adopting vision-language models for fake image detection could be novel in this area, the idea itself is quite straightforward.

The experiments are not enough to validate the superiority of the proposed method, they only validate the superiority of the resulted model. To be specific, (1) the authors compared the proposed method with existing methods (i.e. Wang 2020 and DE-FAKE), the pretrained vision-language model without finetuning, and the pretrained model finetuned by LoRA. However, since Wang 2020 is trained on different datasets, the comparison is not fair. On the other hand, the DE-FAKE model in this experiment has a quite different backbone, therefore the comparison between the proposed method and DE-FAKE is also unfair. As a result, it is not clear how much the pretrained image encoder contribute to the superior performance of the resulted model. It is likely that Wang 2020 and DE-FAKE can achieve similar fake image detection performances by replacing the training data and backbone adopted by the proposed method. However, this assumption is not evaluated in the experiments. (2) The LoRA-finetuned alternative performs poorly on the three attack tasks, while it performs better than the proposed method on the other data (93.04 vs. 91.09). The authors did not give enough insights into this phenomenon, and such phenomenon suggests that the superior performance of the proposed method is not very solid.

问题

(1) It would be necessary to give the results of Wang 2020 and DE-FAKE with the same training data and backbone as the proposed method. (2) It would be nice to see more discussions about the possible reasons for the good performance of the proposed method. For example, where does the performance gain comes from, the pretrained image encoder, the training algorithm, or some other possible factors? (3) As shown in Table 1, except for the three attack tasks, the LoRA-finetuned model achieves better performance than the proposed method. What’s the possible reason of this phenomenon? Is it possible to obtain a better model by just replacing LoRA with some other finetuning algorithms? (4) It seems that directly using average accuracy over different datasets and tasks might not be a proper metric, since different tasks have different numbers of test images and different tasks might have different importance in real-world applications. It would be better if the authors could give some results in other metrics, e.g. weighted average accuracy, AUC, etc.. (5) This is just for discussion: it seems that the proposed method is not limited to the preset question (i.e. Is this photo real) adopted in this manuscript. If we provide more specific information in the question (e.g. Is this photo real or generated by deep learning models), is it possible that the model can attend to more task-specific visual features and achieve better performance under this circumstance? Or is it possible that simpler finetuning techniques might achieve similar performance as the proposed one in this situation?

2023-11-17

(Continued)

3. [Why the LoRA-finetuned model achieves better performance than the proposed method except for the three attack tasks?]

We explained this in the last paragraph of Section 4.3 in the original paper. We state that LoRA-finetuned model introduces more learnable parameters (around 4 millions) into LLM, so it is more likely to overfit to artifacts of the held-in dataset. The images from three attacks are edited from real images by a small difference on pixels, which have artifacts differ from those images generated by other generative models, including the held-in dataset. Hence, we conclude that the accuracy drops of LoRA-finetuned model while being applied to the images from three attacks (in which these attack images have different traits from the held-in dataset) could attribute to the overfitting.
Here is another high-level insight. We can consider the prompt tuning technique to be "extracting" some useful information in the LLM. On the other hand, LoRA-finetuned model can be seen as adding extra component to "adapt" the specific task, leading to better performance on training set but worse generalizability.

4. [Is it possible to obtain a better model by just replacing LoRA with some other finetuning algorithms?]

We choose LoRA as one of our baseline as it is one of the most popular, and the most widely-used parameter-efficient finetuning methods of LLM.
Nevertheless, we are happy to follow your suggestion by replacing LoRA with another finetuning algorithm, AdaLoRA[1]. The results are shown below:

	COCO	Flickr	SD2	SDXL	IF	DALLE2	SGXL
LoRA	95.73	91.83	98.03	96.33	86.60	99.57	97.67
AdaLoRA	96.10	88.73	97.00	95.23	88.67	99.10	98.93
AntifakePrompt	95.37	91.00	97.83	97.27	89.73	99.57	99.97

	GLIDE	ControlNet	DeeperForensics	DFDC	FF++	LaMa	SD2(IP)
LoRA	95.9	92.87	98.80	90.03	94.70	59.50	93.03
AdaLoRA	96.27	90.63	98.53	99.60	95.90	38.23	84.40
AntifakePrompt	99.17	91.47	97.90	100	97.43	39.03	85.20

	LTE	SD2(SR)	AdverAtk	BackdoorAtk	DataPoisonAtk	Avg. Acc
LoRA	99.53	99.97	64.30	53.40	50.87	87.30
AdaLoRA	99.33	99.83	70.53	78.30	76.57	89.05
AntifakePrompt	99.90	99.93	96.70	93.00	91.57	92.74

As being observable from the results, AdaLoRA-finetuned model also suffers (as LoRA does) from accuracy drop when applied to three attack tasks, which underscores the generalizability of our proposed AntifakePrompt again.

5. [Using average accuracy over different datasets and tasks might not be a proper metric.]

We respectfully disagree. First, it has to be clarified that the numbers of images used in every task are actually the same, so the accuracies can be averaged without any weights. Secondly, the importance of each task can not be formally defined, so we consider every task to have the same importance. Lastly, since we take VLM as our backbone, which directly outputs a text instead of a probability distribution of each class, so the AUC metric may not be suitable in our case.
Still, we really appreciate your comment upon the relation between the importance of tasks and the weighted accuracy. Therefore, we provide the weighted accuracy, including "Real Accuracy" (average of all real datasets), "Fake Accuracy" (average of all fake datasets), and "Weighted Accuracy" (average of Real Accuracy and Fake Accuracy), considering the same importance of real datasets and fake datasets.

	Real Acc	Fake Acc	Weighted Acc
Wang 2020	99.95	11.02	55.49
DE-FAKE	88.32	57.78	73.05
AntifakePrompt	93.19	92.69	92.94

No matter using what kind of metric, our method shows superior performance than the baselines.

[1] Zhang, Qingru, et al. "Adaptive budget allocation for parameter-efficient fine-tuning." arXiv preprint arXiv:2303.10512 (2023).

评论- Response to Reviewer Niik

2023-11-17

The paper is "well written" (Reviewer NiiK).
The problem we are researching is “a topic of active research interest” (Reviewer eMV8) and “trending” (Reviewer 4Y5R).
Our solution is “novel”, "highly generalizable" (Reviewer NiiK), “innovative” (Reviewer eMV8), “straightforward” and “effective”(Reviewer 4Y5R).
Our experiments are “extensive” (Reviewer 4Y5R).

We now address individual concerns of Reviewer NiiK below.

1. [Unfair to compare the two baselines.]

The main goal of our work is to propose a solution to deepfake detection with stronger generality, and thus we mainly focus on the performance on the held-out datasets. Therefore, we chose the two baselines (Wang 2020 and DEFAKE) due to their highlighted generalizability. When testing, those held-out datasets are unseen data for both our method and the baselines, so we think it is a fair comparison for our method and the two baselines.
We respectfully disagree with the idea of changing backbones. Wang 2020 and DE-FAKE use ResNet as their backbone model, while we choose InstructBLIP (VLM) as our backbone model. Because of the difference of input and output between ResNet and InstructBLIP, we think the backbone is not transferable.
Still, we really appreciated your suggestion about fairness, so we have trained the model of Wang 2020 and our model with a subset (due to the insufficient time) of training data (COCO+SD2, not including SD2IP). We did not train DE-FAKE as its training code is not open-sourced, however please note that official DE-FAKE was actually trained on COCO+SD2 which is the same as our training dataset. The results are shown below:

	COCO	Flickr	SD2	SDXL	IF	DALLE2	SGXL
Wang 2020	99.83	99.97	66.10	35.30	30.43	5.83	12.03
DE-FAKE	85.97	90.67	97.10	90.50	99.20	68.97	56.90
AntifakePrompt	97.73	95.20	98.33	98.13	96.57	88.30	88.00

	GLIDE	ControlNet	DeeperForensics	DFDC	FF++	LaMa	SD2(IP)
Wang 2020	10.83	6.97	0.37	0.60	0.63	0.10	1.27
DE-FAKE	76.50	63.97	86.97	56.13	78.90	13.03	16.00
AntifakePrompt	75.57	80.50	97.13	95.10	94.47	6.37	28.73

	LTE	SD2(SR)	AdverAtk	BackdoorAtk	DataPoisonAtk	Avg. Acc
Wang 2020	0.63	4.30	0.27	0.40	1.90	19.88
DE-FAKE	9.97	29.70	60.40	22.23	55.87	61.00
AntifakePrompt	23.80	60.47	54.10	44.07	55.07	72.51

Following the suggestion of reviewer eMV8, we test the methods on more fake datasets (DFDC, FF++). The results demostrate that Wang 2020 suffers from insufficient generalizability, leading to poor performance on held-out datasets (datasets other than COCO & SD2). We think the reason behind is that Wang 2020 chooses ResNet as the backbone, which may result in overfitting to the training due to the large number of trainable parameters (instead, noting that our proposed method does not finetune the entire VLM model but only optimizes for the pseudo-word $S_\ast$ ).

2. [Reasons for our good performance]

Thank you for your comment. As mentioned in Section 2.3 in the original submission, VLM helps multimodal comprehension through utilizing the strong generality of LLMs. Specifically, the strong generality of LLM comes from the large scale of its training data (e.g. corpus), enabling the LLM to answer the questions that it has never seen before. Therefore, VLM can also achieve strong generality with the help of its LLM component. This detailed discussion has been added in red to Paragraph 5 of Section 1 and Paragraph 4 of Section 4.2 in the revised submission.

2023-11-17

(Continued)

6. [Other preset questions may achieve better performance?]

The influence of the preset question is quite small. Actually, we have tried different preset question in Section 4.3 in the original paper, namely "prefix" (" $S_*$ Is this photo real?"), "postfix" ("Is this photo real $S_*$ ?"), and "replace" ("Is this photo $S_*$ ?"). The experimental results of "postfix" and "replace", which have different preset questions, show that the accuracy (88.94 / 91.59) is not sensitive to the preset question or the position of the pseudo-word.
Still, we also conduct more experiments on other preset questions during rebuttal, which we set the question to be "Is this photo real or generated by deep learning models $S_*$ ?" (denoted as "Extended" in the results below) and " $S_*$ " (denoted as "One word" in the results below), and here are the results:

	COCO	Flickr	SD2	SDXL	IF	DALLE2	SGXL
Extended	95.30	91.50	98.10	97.73	92.13	98.70	99.43
One word	95.77	89.70	97.10	96.40	89.13	99.40	99.40
AntifakePrompt	95.37	91.00	97.83	97.27	89.73	99.57	99.97

	GLIDE	ControlNet	DeeperForensics	DFDC	FF++	LaMa	SD2(IP)
Exetended	99.17	92.53	99.83	99.97	99.40	38.80	84.47
One word	98.73	89.47	100.00	100.00	98.30	36.07	80.03
AntifakePrompt	99.17	91.47	97.90	100.00	97.43	39.03	85.20

	LTE	SD2(SR)	AdverAtk	BackdoorAtk	DataPoisonAtk	Avg. Acc
Extended	99.97	99.93	85.60	84.33	74.73	91.14
One word	99.97	99.80	96.67	86.67	83.70	91.45
AntifakePrompt	99.90	99.93	96.70	93.00	91.57	92.74

The difference between the three average accuracies is less than 2%, which implies that the performance is not sensitive to the preset question.

评论- Follow-up of Discussion

2023-11-20

Dear Reviewer NiiK,

Regards,

Authors of Submission2329

评论- Discussion

2023-11-21

I partially agree with the authors' response. Specifically, I am satisfied with the answers in part 2, 3, and 4. As for the other parts:

(1) In part 1, it seems that Wang 2020 with VLM backbone performs even worse than the ResNet-based version. The results supports the superiority of the proposed method. However, it is still likely that Wang 2020 could achieve much better performance by carefully tuning the training schedule. I suggest the authors to explore such possibility in future works.

(2) In part 5, the authors gave the results under more metrics, which I appreciate. However, more metrics should be considered, e.g. false positive rate (FPR), false negative rate (FNR) etc.. Such metrics could help users get a better understanding of the possible problems of the current method. On the other hand, I still believe that average accuracy is not a good metric for comparing different methods. To be specific, the choice of evaluation datasets might severely affect the results. For example, if we don't care about the model's performance on adversarial attack images, and we choose not to evaluate the models on such datasets, the conclusion could be totally different. Or if we have more image datasets generated by diffusion models, the average accuracy metric would put more weights on such datasets, and the final metric would be biased. Therefore, more considerations should be taken.

(3) In part 6, there is no wonder that the proposed question (Is this photo real?) achieves the best performance, since the model and the additional token S are trained with this question. However, I believe the results in this parts is enough to validate that the model is insensitive to the specific question, which addresses my concern well enough.

评论- Response to Reviewer Niik

2023-11-23

Dear Reviewer NiiK,

We sincerely appreciate your thoughtful responses and the acknowledgment of our paper. Here are our replies addressing your concerns:

1. [Acknowledging the superiority of our model and suggestions about Wang2020]

We are delighted that you recognized the superiority of our model through the additional comparisons we made in part 1. It is important to clarify that we only re-trained our model and Wang2020 with ResNet backbone for comparisons without changing the backbone of Wang2020 to VLM.
Please note that, in order to reproduce the best results of Wang2020 baseline as the ones reported in the original Wang2020 paper, we adopted the original setting of hyperparameters as Wang2020 (provided in their official source code), where the experimental results of Wang2020 presented in our rebuttal, our original submission, and our revised submission all follow exactly the same setting. Nevertheless, we thank the reviewer's suggestion and will try to tune the hyperparameters of Wang2020, and we will update the numbers for Wang2020 in our final version if there is another better result showing up.

2. [Concerns about choosing average accuracy as evaluation metric]

We acknowledge that the choice of testing datasets can largely affect the results, so we carefully included six types of fake images sources in our paper to cover as many as fake images we might encounter in real life. By evaluating our method and other baseline methods on such diverse datasets, we are able to make more convincing comparisons and conclusions without biasing towards specific types of fake images or misleading our readers.
The reasons we adopted average accuracy as our evaluation metric are to align with the metric used in DIRE [1] and to serve as a straightforward indicator for deepfake detection. Since our model will directly output the predicted text ("Yes" or "No") rather than the probability of each class, metrics requiring thresholds or confidence values (such as AUC, AP, or mAP) might not be suitable in our case. Nevertheless, we appreciate your suggestion, and have provided FPR, FNR, and F-score for all combinations of every real and fake datasets here:
COCO + (fake dataset)

(FPR/FNR/F-score)	SD2	SDXL	IF	DALLE2	SGXL	GLIDE
Wang 2020	0.03/ 1.00/ 0.00	0.03/ 1.00/ 0.00	0.03/ 0.81/ 0.31	0.03/ 0.97/ 0.06	0.03/ 0.21/ 0.87	0.03/ 0.83/ 0.29
DE-FAKE	0.14/ 0.03/ 0.92	0.14/ 0.10/ 0.88	0.14/ 0.01/ 0.93	0.14/ 0.31/ 0.75	0.14/ 0.43/ 0.67	0.14/ 0.24/ 0.80
DIRE	0.18/ 0.96/ 0.06	0.18/ 0.82/ 0.27	0.18/ 0.93/ 0.11	0.18/ 0.98/ 0.03	0.18/ 0.55/ 0.55	0.18/ 0.95/ 0.07
LASTED	0.25/ 0.41/ 0.64	0.25/ 0.49/ 0.58	0.25/ 0.42/ 0.64	0.25/ 0.42/ 0.64	0.25/ 0.36/ 0.68	0.25/ 0.46/ 0.61
J. Ricker 2022	0.04/ 0.19/ 0.87	0.04/ 0.00/ 0.98	0.04/ 0.07/ 0.94	0.04/ 0.48/ 0.67	0.04/ 0.00/ 0.98	0.04/ 0.16/ 0.89
QAD	0.34/ 0.62/ 0.44	0.34/ 0.54/ 0.51	0.34/ 0.65/ 0.42	0.34/ 0.61/ 0.45	0.34/ 0.70/ 0.37	0.34/ 0.44/ 0.59
InstructBLIP	0.01/ 0.60/ 0.57	0.01/ 0.77/ 0.37	0.01/ 0.79/ 0.34	0.01/ 0.58/ 0.58	0.01/ 0.31/ 0.81	0.01/ 0.62/ 0.55
InstructBLIP + LoRA	0.04/ 0.02/ 0.97	0.04/ 0.04/ 0.96	0.04/ 0.13/ 0.91	0.04/ 0.00/ 0.98	0.04/ 0.02/ 0.97	0.04/ 0.04/ 0.96
AntifakePrompt	0.05/ 0.02/ 0.97	0.05/ 0.03/ 0.96	0.05/ 0.10/ 0.92	0.05/ 0.00/ 0.98	0.05/ 0.00/ 0.98	0.05/ 0.01/ 0.97

2023-11-23

(Continued)

(FPR/FNR/F-score)	ControlNet	Deeperforensics	DFDC	FF++	LaMa	SD2(IP)
Wang 2020	0.03/ 0.89/ 0.20	0.03/ 1.00/ 0.01	0.03/ 1.00/ 0.01	0.03/ 0.95/ 0.10	0.03/ 0.93/ 0.14	0.03/ 1.00/ 0.00
DE-FAKE	0.14/ 0.36/ 0.72	0.14/ 0.13/ 0.87	0.14/ 0.44/ 0.66	0.14/ 0.21/ 0.82	0.14/ 0.87/ 0.20	0.14/ 0.84/ 0.25
DIRE	0.18/ 0.90/ 0.15	0.18/ 1.00/ 0.01	0.18/ 0.40/ 0.67	0.40/ 0.18/ 0.35	0.18/ 0.87/ 0.20	0.18/ 0.89/ 0.18
LASTED	0.25/ 0.49/ 0.58	0.25/ 0.14/ 0.82	0.25/ 0.30/ 0.72	0.30/ 0.25/ 0.72	0.25/ 0.40/ 0.65	0.25/ 0.43/ 0.63
J. Ricker 2022	0.04/ 0.25/ 0.84	0.04/ 0.86/ 0.24	0.04/ 0.53/ 0.62	0.53/ 0.04/ 0.33	0.04/ 0.36/ 0.76	0.04/ 0.41/ 0.72
QAD	0.34/ 0.64/ 0.43	0.34/ 0.37/ 0.64	0.34/ 0.23/ 0.73	0.23/ 0.34/ 0.82	0.34/ 0.63/ 0.43	0.34/ 0.66/ 0.41
InstructBLIP	0.01/ 0.66/ 0.50	0.01/ 0.86/ 0.24	0.01/ 0.86/ 0.24	0.86/ 0.01/ 0.61	0.01/ 0.89/ 0.19	0.01/ 0.56/ 0.61
InstructBLIP + LoRA	0.04/ 0.07/ 0.94	0.04/ 0.01/ 0.97	0.04/ 0.10/ 0.93	0.10/ 0.04/ 0.95	0.04/ 0.41/ 0.73	0.04/ 0.07/ 0.94
AntifakePrompt	0.05/ 0.09/ 0.93	0.05/ 0.02/ 0.97	0.05/ 0.00/ 0.98	0.01/ 0.05/ 0.96	0.05/ 0.61/ 0.54	0.05/ 0.15/ 0.90

(FPR/FNR/F-score)	LTE	SD2(SR)	AdverAtk	BackdoorAtk	DataPoisonAtk
Wang 2020	0.03/ 0.85/ 0.26	0.03/ 0.99/ 0.03	0.03/ 0.95/ 0.09	0.03/ 0.85/ 0.26	0.03/ 0.99/ 0.02
DE-FAKE	0.14/ 0.90/ 0.16	0.14/ 0.70/ 0.41	0.14/ 0.40/ 0.69	0.14/ 0.78/ 0.33	0.14/ 0.44/ 0.66
DIRE	0.18/ 0.88/ 0.19	0.18/ 0.97/ 0.05	0.18/ 0.98/ 0.03	0.18/ 0.98/ 0.03	0.18/ 0.99/ 0.02
LASTED	0.25/ 0.28/ 0.73	0.25/ 0.40/ 0.65	0.25/ 0.41/ 0.64	0.25/ 0.47/ 0.59	0.25/ 0.48/ 0.59
J. Ricker 2022	0.04/ 0.69/ 0.45	0.04/ 0.26/ 0.83	0.04/ 0.92/ 0.15	0.04/ 0.66/ 0.50	0.04/ 0.93/ 0.12
QAD	0.34/ 0.62/ 0.44	0.34/ 0.68/ 0.39	0.34/ 0.68/ 0.38	0.34/ 0.66/ 0.40	0.34/ 0.65/ 0.42
InstructBLIP	0.01/ 0.03/ 0.98	0.01/ 0.31/ 0.81	0.01/ 0.95/ 0.10	0.01/ 0.97/ 0.06	0.01/ 0.98/ 0.03
InstructBLIP + LoRA	0.04/ 0.01/ 0.98	0.04/ 0.00/ 0.98	0.04/ 0.36/ 0.76	0.04/ 0.47/ 0.68	0.04/ 0.49/ 0.66
AntifakePrompt	0.05/ 0.00/ 0.98	0.05/ 0.00/ 0.98	0.05/ 0.03/ 0.96	0.05/ 0.07/ 0.94	0.05/ 0.08/ 0.93

Flickr + (fake dataset)

(FPR/FNR/F-score)	SD2	SDXL	IF	DALLE2	SGXL	GLIDE
Wang 2020	0.03/ 1.00/ 0.00	0.03/ 1.00/ 0.00	0.03/ 0.81/ 0.31	0.03/ 0.97/ 0.06	0.03/ 0.21/ 0.87	0.03/ 0.83/ 0.29
DE-FAKE	0.09/ 0.03/ 0.94	0.09/ 0.10/ 0.91	0.09/ 0.01/ 0.95	0.09/ 0.31/ 0.77	0.09/ 0.43/ 0.68	0.09/ 0.24/ 0.82
DIRE	0.23/ 0.96/ 0.06	0.23/ 0.82/ 0.26	0.23/ 0.93/ 0.11	0.23/ 0.98/ 0.03	0.23/ 0.55/ 0.54	0.23/ 0.95/ 0.07
LASTED	0.24/ 0.41/ 0.64	0.24/ 0.49/ 0.59	0.24/ 0.42/ 0.64	0.24/ 0.42/ 0.64	0.24/ 0.36/ 0.68	0.24/ 0.46/ 0.61
J. Ricker 2022	0.04/ 0.19/ 0.88	0.04/ 0.00/ 0.98	0.04/ 0.07/ 0.94	0.04/ 0.48/ 0.67	0.04/ 0.00/ 0.98	0.04/ 0.16/ 0.89
QAD	0.35/ 0.62/ 0.44	0.35/ 0.54/ 0.51	0.35/ 0.65/ 0.42	0.35/ 0.61/ 0.45	0.35/ 0.70/ 0.37	0.35/ 0.44/ 0.59
InstructBLIP	0.00/ 0.60/ 0.57	0.00/ 0.77/ 0.37	0.00/ 0.79/ 0.34	0.00/ 0.58/ 0.59	0.00/ 0.31/ 0.82	0.00/ 0.62/ 0.55
InstructBLIP + LoRA	0.08/ 0.02/ 0.95	0.08/ 0.04/ 0.94	0.08/ 0.13/ 0.89	0.08/ 0.00/ 0.96	0.08/ 0.02/ 0.95	0.08/ 0.04/ 0.94
AntifakePrompt	0.09/ 0.02/ 0.95	0.09/ 0.03/ 0.94	0.09/ 0.10/ 0.90	0.09/ 0.00/ 0.95	0.09/ 0.00/ 0.96	0.09/ 0.01/ 0.95

2023-11-23

(Continued)

(FPR/FNR/F-score)	ControlNet	Deeperforensics	DFDC	FF++	LaMa	SD2(IP)
Wang 2020	0.03/ 0.89/ 0.20	0.03/ 1.00/ 0.01	0.03/ 1.00/ 0.00	0.03/ 0.95/ 0.10	0.03/ 0.93/ 0.14	1.00/ 0.03/ 0.00
DE-FAKE	0.09/ 0.36/ 0.74	0.09/ 0.13/ 0.89	0.09/ 0.44/ 0.68	0.09/ 0.21/ 0.84	0.09/ 0.87/ 0.21	0.84/ 0.09/ 0.26
DIRE	0.23/ 0.90/ 0.15	0.23/ 1.00/ 0.00	0.23/ 0.40/ 0.66	0.23/ 0.75/ 0.34	0.23/ 0.87/ 0.19	0.89/ 0.23/ 0.17
LASTED	0.24/ 0.49/ 0.58	0.24/ 0.14/ 0.82	0.24/ 0.30/ 0.72	0.24/ 0.29/ 0.73	0.24/ 0.40/ 0.66	0.43/ 0.24/ 0.63
J. Ricker 2022	0.04/ 0.25/ 0.84	0.04/ 0.86/ 0.24	0.04/ 0.53/ 0.62	0.04/ 0.80/ 0.33	0.04/ 0.36/ 0.76	0.41/ 0.04/ 0.72
QAD	0.35/ 0.64/ 0.43	0.35/ 0.37/ 0.64	0.35/ 0.23/ 0.73	0.35/ 0.06/ 0.82	0.35/ 0.63/ 0.43	0.66/ 0.35/ 0.41
InstructBLIP	0.00/ 0.66/ 0.51	0.00/ 0.86/ 0.24	0.00/ 0.86/ 0.25	0.00/ 0.56/ 0.61	0.00/ 0.89/ 0.20	0.56/ 0.00/ 0.61
InstructBLIP + LoRA	0.08/ 0.07/ 0.92	0.08/ 0.01/ 0.95	0.08/ 0.10/ 0.91	0.08/ 0.05/ 0.93	0.08/ 0.41/ 0.71	0.07/ 0.08/ 0.92
AntifakePrompt	0.09/ 0.09/ 0.91	0.09/ 0.02/ 0.95	0.09/ 0.00/ 0.96	0.09/ 0.03/ 0.94	0.09/ 0.61/ 0.53	0.15/ 0.09/ 0.88

(FPR/FNR/F-score)	LTE	SD2(SR)	AdverAtk	BackdoorAtk	DataPoisonAtk
Wang 2020	0.03/ 0.85/ 0.26	0.03/ 0.99/ 0.03	0.03/ 0.95/ 0.09	0.03/ 0.85/ 0.26	0.03/ 0.99/ 0.02
DE-FAKE	0.09/ 0.90/ 0.17	0.09/ 0.70/ 0.43	0.09/ 0.40/ 0.71	0.09/ 0.78/ 0.34	0.09/ 0.44/ 0.68
DIRE	0.23/ 0.88/ 0.19	0.23/ 0.97/ 0.04	0.23/ 0.98/ 0.03	0.23/ 0.98/ 0.03	0.23/ 0.99/ 0.02
LASTED	0.24/ 0.28/ 0.73	0.24/ 0.40/ 0.65	0.24/ 0.41/ 0.65	0.24/ 0.47/ 0.60	0.24/ 0.48/ 0.60
J. Ricker 2022	0.04/ 0.69/ 0.45	0.04/ 0.26/ 0.83	0.04/ 0.92/ 0.15	0.04/ 0.66/ 0.50	0.04/ 0.93/ 0.12
QAD	0.35/ 0.62/ 0.44	0.35/ 0.68/ 0.39	0.35/ 0.68/ 0.38	0.35/ 0.66/ 0.40	0.35/ 0.95/ 0.42
InstructBLIP	0.00/ 0.03/ 0.98	0.00/ 0.31/ 0.82	0.00/ 0.95/ 0.10	0.00/ 0.97/ 0.06	0.00/ 0.98/ 0.03
InstructBLIP + LoRA	0.08/ 0.01/ 0.96	0.08/ 0.00/ 0.96	0.08/ 0.36/ 0.75	0.08/ 0.47/ 0.66	0.08/ 0.49/ 0.64
AntifakePrompt	0.09/ 0.00/ 0.96	0.09/ 0.00/ 0.96	0.09/ 0.03/ 0.94	0.09/ 0.07/ 0.92	0.09/ 0.08/ 0.91

Overall average

	Avg. FPR (↓)	Avg. FNR (↓)	Avg. F-score (↑)
Wang 2020	0.03	0.89	0.16
DE-FAKE	0.12	0.42	0.64
DIRE	0.21	0.87	0.17
LASTED	0.25	0.39	0.66
J. Ricker 2022	0.04	0.44	0.64
QAD	0.35	0.55	0.49
InstructBLIP	0.01	0.66	0.45
InstructBLIP + LoRA	0.06	0.13	0.89
AntifakePrompt	0.07	0.07	0.92

We designated label "fake" as positive and label "real" as negative.
In the overall average table, InstructBLIP demonstrates the lowest average FPR (0.01), yet its average FNR is pretty high (0.66). This indicates that InstructBLIP tends to classify most of the images as real, resulting in its relatively low F-score. In contrast, our AntifakePrompt strikes a better balance between the abilities of detecting real images and fake images, and thus shows the best performances on the average FNR and the average F-score. It also highlights the superiority of our AntifakePrompt.

3. [The model is insensitive to the specific question]

We are pleased that we successfully addressed your concern :)

[1] Wang, Zhendong et al. “DIRE for Diffusion-Generated Image Detection.” ArXiv abs/2303.09295 (2023)

Best,

Authors of Submission2329

审稿意见

评分: 3置信度: 32023-10-31

The paper explores the potential of using a Visual Question Answering (VQA) model as a deepfake detector and proposes soft-prompt tuning as efficient finetuning method for this purpose. Specifically, the authors finetune InstructBLIP, a VQA model, using soft-prompt tuning to improve its deepfake detection capabilities. And the paper shows that deepfake detection performance of this finetuned model’s is pretty good in various use cases using a generative diffusion model.

优点

The paper innovatively uses soft-prompt tuning to improve deepfake detection performance in a VQA model without altering the original parameters.
The paper addresses the issue of deepfake detection across a wide range of applications using diffusion models, currently a topic of active research interest.
The paper provides a formal framework for utilizing a VQA model for deepfake detection, and presents the potential of using a VQA model as a deepfake detector and offers a viable finetuning technique for this purpose.

缺点

Deepfake detection is a subject of extensive research with many related papers available. Research works that address the performance degradation on the Diffusion models, and across various cross-datasets, are not incorporated [1,2,3,4,5]. There are also studies that deal with the detection of low-quality, low-resolution deepfakes [6]. There is a need to consider those and compare analysis with other studies.
The test dataset used in the paper is biased towards diffusion model-generated data. It would be great to evaluate the performance with well-known other deepfake datasets such as DFC[7], DFDC[8], and FF++[9].
The paper could benefit from leveraging unique features of VQA models beyond merely using them as large-scale detectors. For example, experiments that visualize or explain the detection reasoning using VQA's capabilities could offer a significant contribution to the community.

[1]Ma, Ruipeng et al. “Exposing the Fake: Effective Diffusion-Generated Images Detection.” ArXiv abs/2307.06272 (2023): n. pag. [2] Wu, Haiwei et al. “Generalizable Synthetic Image Detection via Language-guided Contrastive Learning.” ArXiv abs/2305.13800 (2023): n. pag. [3] Wang, Zhendong et al. “DIRE for Diffusion-Generated Image Detection.” ArXiv abs/2303.09295 (2023): n. pag. [4] Lorenz, Peter et al. “Detecting Images Generated by Deep Diffusion Models using their Local Intrinsic Dimensionality.” ArXiv abs/2307.02347 (2023): n. pag. [5] Ricker, Jonas et al. “Towards the Detection of Diffusion Model Deepfakes.” ArXiv abs/2210.14571 (2022): n. pag. [6] Le, Binh M., and Simon S. Woo. "Quality-Agnostic Deepfake Detection with Intra-model Collaborative Learning." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. [7] benpflaum, Brian G, djdj, Irina Kofman, JE Tester, JLElliott, Joshua Metherd, Julia Elliott, Mozaic, Phil Culliton, Sohier Dane, Woo Kim. (2019). Deepfake Detection Challenge. Kaggle. https://kaggle.com/competitions/deepfake-detection-challenge [8] Dolhansky, Brian, et al. "The deepfake detection challenge (dfdc) dataset." arXiv preprint arXiv:2006.07397 (2020). [9] Rossler, Andreas, et al. "Faceforensics++: Learning to detect manipulated facial images." Proceedings of the IEEE/CVF international conference on computer vision. 2019.

问题

Please address the comments and questions in the weakness section.

评论- Response to Reviewer eMV8

2023-11-17

The paper is "well written" (Reviewer NiiK).
The problem we are researching is “a topic of active research interest” (Reviewer eMV8) and “trending” (Reviewer 4Y5R).
Our solution is “novel”, "highly generalizable" (Reviewer NiiK), “innovative” (Reviewer eMV8), “straightforward” and “effective”(Reviewer 4Y5R).
Our experiments are “extensive” (Reviewer 4Y5R).

We now address individual concerns of Reviewer eMV8 below.

1. [Compare analysis with other studies]

Thank you for suggesting these related works. The reason why we chose the two baselines (Wang2020 & DE-FAKE) is that both works highlighted their generalizability, and they are well-known or SOTA method in this field before the submission.
We investigated the listed related works: [1] proposes a detection method using reverse and denoising computation error of intermediate steps; [2] proposes a method via language-guided contrastive learning; [3] proposes a method related to diffusion reconstruction error at the initial timestep; [4] proposes a method using their local intrinsic dimensionality; [5] conducts extensive experiments on many different datasets; [6] proposes a collaborative learning framework to conduct detection of different quality of deepfakes.
We follow reviewer's suggestion to include the comparison with the methods of [2,3,5,6], where the experimental results and the corresponding analysis are added to Table 1 and Section 4.2 respectively in the revised submission. We did not do experiments on [1,4] since they are regrettably not open-sourced, but we will introduce the experimental results into our final version as long as their codes are published.

2. [Test dataset is biased towards diffusion model-generated data.]

Thank you for pointing this out. Since diffusion-based models are known for their outstanding ability of generating more genuine and more photorealistic images than those from GAN-based models, we naturally incoporated more diffusion-based models to generate various types of fake images to conduct more persuasive evaluations of our method and other baseline methods. Moreover, we chose DeeperForensics instead of other deepfake datasets because it is a newer and thus more difficult dataset than DFDC and FF++ datasets, which makes our performance more convincing. Out of similar reasons, we selected SGXL to represent GAN-based models. The rationale is that SGXL can generate images with lower FID and higher IS score than those generated by other GAN-based methods.
Also, we have followed your suggestion to test the methods on DFDC and FF++. The results are presented in Table 1 in the revised submission.
The reason why we did not test on DFC is that the link of DFC directs us to the dataset of DFDC, so we considered DFC and DFDC as the same dataset. If there is any misunderstanding, please let us know.

3. [Leverage unique features of VQA models beyond merely using them as detectors.]

We sincerely appreciate the idea! Actually, we have conducted some experiments trying to leverage the unique features of VLM.
First, we have used GradCAM to analyze the important region in the images. Although we did observe AntifakePrompt focused on certain region in some samples to consider them as fake, we have not found any significant or consistent visual characteristic throughout all GradCAM results of our samples to draw a concrete conclusion.
Secondly, we tried to ask the VLM to give additional "reasons" when classifying the images by asking the question "Is this photo real $S_*$ ? Why?". However, the response still contain only "Yes" or "No" without any extra reason, which may results from the significant influence of the prompt-tuned question ("Is this photo real $S_*$ ?").
Despite not getting the promising result from the aforementioned experiments, we believe that there does exist more abstractive but valuable reasons hidden behind. We would delve further into it in the coming future, thanks to your constructive suggestions.

评论- Follow-up of Discussion

2023-11-20

Dear Reviewer eMV8,

Regards,

Authors of Submission2329

AC 元评审

2023-12-16

This work explores the problem of deepfake detection (distinguishing between real and fake images) on unseen data. To address this problem, the authors propose to use vision-language models and specifically their zero-shot abilities and formulate deepfake detection as a visual question answering problem. The authors propose to learn task-specific embeddings with prompt tuning and use InstructBLIP in their evaluations. While this work presents an interesting approach to an important problem, there are some shortcomings that should be addressed before it can be accepted. As reviewers pointed out, including relevant work and a more careful experimental evaluation that controls for training datasets and compares to state-of-the-art models would significantly improve this work.

为何不给更高分

Missing related work and issues with experimental evaluation

为何不给更低分

N/A

最终决定Reject

2024-01-16

Reject