Guard Me If You Know Me: Protecting Specific Face-Identity from Deepfakes
摘要
评审与讨论
The paper proposes a novel and highly relevant approach which redefines personalized deepfake detection using fine-grained detection mechanisms along with MLLMs. The authors also introduce a new benchmark VIP-Bench as a tool for evaluating these types of identity-informed deepfakes, with ablation studies of multiple types of deepfake generation.
The paper redefines deepfake detection as a fine-grained face recognition task specifically tailored for "VIP individuals" whose authentic facial data is known. This contrasts with most existing general-purpose deepfake detectors that do not leverage identity-specific prior knowledge.
VIPGuard Framework: The authors also introduce VIPGuard, a unified multimodal framework designed for personalized deepfake detection and explanation. It incorporates both global identity information and detailed structural facial attributes of the VIPs.
Three-Stage Training Pipeline: VIPGuard operates through a progressive fine-tuning process:
-
Facial Attribute Learning: An MLLM (specifically Qwen-2.5-VL-7B) is fine-tuned on a Facial Attribute Description Dataset (DFA) to enhance its understanding of detailed and structural facial attributes.
-
Identity Discrimination Learning: The model undergoes further fine-tuning using an Identity Discrimination Dataset (DID) to learn to distinguish subtle differences between highly similar faces, including real and fake variations, by reasoning over global and local facial priors.
-
User-Specific Customization: A lightweight, learnable VIP token is introduced to model the unique characteristics of a target VIP's face identity, enabling personalized and explainable deepfake detection. This stage refines the model's ability to perceive and distinguish the target VIP user.
VIPBench Benchmark: To rigorously evaluate the method, the paper introduces VIPBench, a comprehensive identity-aware benchmark. This dataset fundamentally differs from conventional deepfake benchmarks by explicitly focusing on scenarios where prior knowledge of the target individual is available. VIPBench includes 22 unique target identities and a total of 80,080 images generated by 14 state-of-the-art manipulation techniques (7 face-swapping and 7 entire-face synthesis methods), providing a diverse and realistic evaluation setting.
优缺点分析
Strengths---
Research Quality:
Novelty and Originality: The paper introduces a novel approach to deepfake detection by focusing on "VIP individuals" and leveraging prior knowledge of their authentic facial identities. This is a significant departure from most existing general-purpose deepfake detection methods that overlook this valuable identity-specific information. It uniquely reforms forgery detection as a fine-grained face recognition task.
Technical Soundness:
The proposed VIPGuard framework employs a well-structured three-stage pipeline: Stage 1: Facial Attribute Learning fine-tunes a Multimodal Large Language Model (MLLM) (specifically Qwen-2.5-VL-7B) to understand detailed facial attributes. Stage 2: Identity Discrimination Learning enables the model to distinguish subtle differences between highly similar faces, including real and fake variations, by reasoning over global and local facial priors. Stage 3: User-Specific Customization models unique characteristics of the target face identity using a lightweight VIP token, facilitating personalized and explainable deepfake detection. This progressive fine-tuning approach appears logically sound and addresses a clear gap identified in existing methods.
Strong Experimental Results:
The paper presents extensive experiments demonstrating that VIPGuard significantly outperforms existing general deepfake detectors and other ID-aware methods.
It shows superior detection performance for both face-swapping and entire face synthesis techniques on the newly introduced VIPBench dataset, achieving high AUC and low EER values.
The proposed framework is shown to consistently outperform other MLLM-based methods in accuracy.
The ablation studies confirm the effectiveness and necessity of each of the three training stages, highlighting their synergistic contribution to VIPGuard's high performance.
The model also demonstrates explainability, providing human-understandable reasoning for its predictions. Reproducibility: The authors commit to providing comprehensive experimental details in the supplementary material. The appendix indeed includes detailed implementation settings such as the backbone model, input image size, optimizer, learning rate, batch sizes, gradient accumulation, training epochs, and the use of mixed-precision computation within the Swift framework.
New Dataset (VIPBench):
The creation of VIPBench is a significant contribution, providing a comprehensive, identity-aware benchmark specifically designed for personalized deepfake detection. The appendix thoroughly details its construction, including the Facial Attributes Description Dataset (DFA), Identity Discrimination Dataset (DID), and VIPEval, along with data sources, preprocessing steps, VQA data construction, and the diverse range of 14 state-of-the-art generation techniques used
Weaknesses---
Statistical Significance: The submitted paper checklist indicates "No" for statistical significance (error bars) and justifies it by space constraints, promising full variability analysis in the supplementary material and public code release. However, as noted in our conversation, the provided appendix does not yet contain these standard deviations or error bars for ablation studies.
Safeguards Checklist vs. Content: The checklist surprisingly marks "NA" for safeguards, implying no high-risk assets requiring them. This is directly contradicted by the included appendix which details explicit safeguards for the VIPBench dataset, including a request form and manual review to prevent misuse. This inconsistency between the checklist answer and the appendix's content should be reconciled.
Compute Resources Information: In the paper's checklist, in response to the question "For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?", the authors marked "No".
The NeurIPS guidelines recommend that papers indicate the type of compute workers (CPU or GPU), internal cluster or cloud provider, relevant memory and storage, and the amount of compute required for individual experimental runs and total compute. This information is currently lacking in the submission
问题
- Computational Resource Reporting:
Question/Suggestion: The NeurIPS Paper Checklist indicates "No" for providing sufficient information on computer resources needed to reproduce experiments, with the justification that "Detailed compute specifications (hardware type, memory usage) are not included in this submission" due to space considerations. To enhance reproducibility, please provide comprehensive details on the computational resources used for training and evaluation. It is worth curating current included content for the sake of providing this detail, in order to facilitate reproducibility.
Proposal for improvement: This should include the specific type of GPUs (e.g., NVIDIA A100, V100, number of cards), total GPU memory used, and an estimate of the training time per stage (in hours/days) and total inference time for the VIPBench dataset. This level of detail is crucial for other researchers to gauge the feasibility of reproducing your work.
- Statistical Significance and Variability Analysis:
Question/Suggestion: The paper's checklist states "No" for reporting error bars, noting that full variability analysis with standard deviations will be in the supplementary material. However, the included appendix does not appear to contain these standard deviations or error bars for the ablation studies or main performance tables. Please include standard deviations or appropriately defined error bars for all key experimental results, especially in Table 1 (ablation study) and Tables 1-3 in the main paper.
Proposal for improvement: Clearly state the number of independent runs used to calculate these statistics and how the error bars are defined (e.g., standard deviation or standard error). This will provide a more robust assessment of the method's consistency and reliability.
- Discrepancy in Safeguards Documentation:
Question/Suggestion: The NeurIPS Paper Checklist indicates "NA" for safeguards, stating that the work "does not release any pretrained models or scraped datasets with high misuse risk". However, Section A of the appendix explicitly details safeguards for the VIPBench dataset, including a request form and manual review to prevent misuse. This presents a contradiction between the checklist and the appendix's content. Please clarify this inconsistency and update the checklist to accurately reflect the responsible data release mechanism you have implemented for VIPBench.
Proposal for improvement: If safeguards are in place for the dataset, the answer to the checklist question "Does the paper describe safeguards...?" should be "Yes", referencing the detailed explanation in the appendix.
- Usage and Selection of Facial Recognition Models:
Question/Suggestion: In Stage 2 (Identity Discrimination Learning), the paper mentions employing "face recognition models to provide face similarity scores". It's unclear from the text whether all three referenced models (CosFace, ArcFace, TransFace) are used simultaneously, or if a specific one was chosen, or how they were combined if multiple were utilized. Could you please specify which facial recognition model(s) were ultimately used to generate the "face similarity scores" and provide a brief rationale for this choice or combination? Clarifying the specific model and its integration will provide better insight into the technical implementation of Stage 2 and its reliance on external facial priors.
- Novelty of "Fine-Grained Face Recognition" Formulation:
Question/Suggestion: The paper states it "reforms forgery detection as a fine-grained face recognition task" for the first time, and also mentions existing "identity-aware detection methods", including ICT-Ref, which it critiques for "fail[ing] to fully utilize the detailed identity-specific information". Could you further elaborate on the specific aspects that make your reformulation of deepfake detection as a fine-grained face recognition task fundamentally novel and distinct from previous "identity-aware" approaches?
Proposal for improvement: Explain precisely how your "fine-grained" approach differs in its conceptualization or technical execution from prior identity-aware methods that also aimed to leverage identity information. This will help clarify the unique contribution of your core problem formulation beyond simply utilizing more detailed facial attributes.
局限性
Yes.
最终评判理由
The paper's clarity issues were addressed by the authors satisfactorily. However, the main concern with this paper, which led to the initial score, was not sufficiently addressed: novelty. The authors contend that simply combining an identity aware technique and adding a semantic layer from MLLM use constitute a major leap forward, but the empirical results don't support that conclusion.
Additionally, if the use of generative workflows is the major intervention to existing technique they should center the paper on that discovery rather than an ensemble.
格式问题
None observed.
We sincerely thank the reviewer for the positive evaluation of our work, including its novelty, technical soundness, strong experimental results, and the contribution of the VIPBench dataset. Below, we provide detailed responses to address the concerns raised and clarify any remaining questions.
Q1:
Which face-recognition model(s) did the authors actually use to compute similarity, and why were they chosen?
R1:
Thank you for your insightful comment. Table D1 reports the comparative results across the tested face-recognition models. Our ablation study shows that TransFace gives the best performance, so we used it empirically. We will add this explanation to the revised manuscript.
TABLE D1: Performance (AUC %) of VIPGuard using different face models as components.
| Face Model | BlendFace | InSwap | Arc2Face | PuLID | Average |
|---|---|---|---|---|---|
| CosFace | 95.12 | 83.32 | 83.31 | 91.74 | 88.37 |
| ArcFace | 98.14 | 83.17 | 94.41 | 97.88 | 93.40 |
| TransFace | 99.48 | 96.40 | 98.05 | 98.96 | 98.23 |
Q2:
What is truly novel about casting deepfake detection as a fine-grained face-recognition task, and how does this differ from earlier identity-aware methods such as ICT-Ref?
R2:
Thank you very much for your insightful comment. Our method differs from previous identity-aware detection approaches in the following key aspects:
-
User-specific identity modeling: Our method uniquely models the protected user’s face by directly leveraging full facial information for detection. In contrast, prior methods focus on inconsistencies introduced by face-swapping operations rather than on the specific identity itself. For example, ICT-Ref detects discrepancies between inner and outer face identity embeddings, while Diff-ID compares differences before and after self-swapping. However, fully generative forgery methods often preserve identity consistency, causing these inconsistency-based methods to degrade significantly. As shown in Table 2 of our paper, our method maintains strong performance even under such conditions, highlighting the strong generalization ability of our method across different forgery types.
-
Use of local facial attributes: Unlike previous approaches that treat the face as a whole and rely mostly on global features, our method explicitly learns and utilizes local facial attributes. This enables it to capture fine-grained details that forgery methods struggle to replicate, leading to improved detection performance through the combination of global and local cues.
-
Leveraging MLLMs for semantic reasoning: By leveraging MLLMs to model high-level semantic facial attributes, our method effectively captures identity-related features and benefits from the strong generalization ability of large models. Moreover, the textual description of facial attributes enhances our model’s interpretability, which previous methods do not achieve.
Q3:
Could the authors provide the compute-resource details for reproducibility?
R3:
Thank you for highlighting this point. In the revision, we will add a table describing our hardware (multi-GPU workstation, multi-core CPU, and system memory) and software environment (recent Linux distribution, mainstream deep-learning framework with the corresponding CUDA/cuDNN stack).
Q4:
The appendix still lacks error bars or standard deviations for the experiments.
R4:
Thank you for your comment. In the revision, we will append error bars or standard deviation values to experiments, and briefly describe the procedure used to compute them. This will make the reported results more transparent and statistically sound.
Q5:
The checklist lists safeguards as “N/A,” yet the appendix outlines explicit safeguards for VIPBench, creating a contradiction that needs clarification.
R5:
Thank you for catching this inconsistency. We will update the checklist to “Yes” and ensure its description matches the safeguards already detailed in the appendix in the revised manuscript.
Dear Reviewer jxQd,
Thank you very much for your valuable and constructive feedback. We have carefully reviewed your comments and have tried our best to address your concerns thoroughly.
Below is a brief summary of our responses to the questions you raised:
-
Selection of Face Models: We clarify that TransFace is the face recognition model used in our method, and we have provided a clear explanation of our selection criteria. Additionally, we have included an ablation study to demonstrate the performance differences when using different face models. (R1)
-
Elaborating the Novelty of Our Fine-Grained Reformulation: We have thoroughly described the innovative perspective of our method, which reformulates personalized deepfake detection as a fine-grained face recognition problem. Specifically, in terms of Identity Modeling, Local Detail Awareness, and Semantic Attribute Reasoning, we have clarified the key distinctions between our approach and existing methods. (R2)
-
Checklist Modification Suggestions: We will revise the manuscript in accordance with your suggestions regarding the checklist. (R3, R4, R5)
We are happy to address any additional concerns you may have. And We hope our responses have addressed your comments satisfactorily.
Best regards,
The Authors
For Q1, Q3, Q4, Q5: thank you for your responses, please ensure that these changes are made upon the resubmission of the manuscript
For Q5, I understand your explanation about the differentiation between existing identity-aware techniques, however none of the three points significantly expands on why the techniques are a step forward in SoTA. I maintain my score.
Thank you for your feedback and question! Below, we provide further clarification and an additional experiment to answer your question.
First, we list the key reasons why the proposed three new strategies make our method perform superior, as follows:
-
- Explicit user-specific facial identity modeling enables us to detect more realistic fake faces: Existing ID-aware detection method such as ICT-Ref focuses solely on the mismatch between inner and outer faces (limited to the specific pattern), without comprehensively considering the structural facial information.
- This overlook results in a notable performance drop when detecting the realistic fake faces with better consistency between inner and outer faces (e.g., Arc2Face and PuLID). In contrast, our method explicitly and fully leverages the facial identity information for detection, learning more comprehensive discriminative patterns and thus achieving large improvements.
-
- Learning the detailed local facial attributes enables us to mine a more fine-grained patterns: Our method effectively captures the flaws by generative models in fine-grained facial attributes, enabling better detection.
- Generative models struggle to perfectly replicate subtle facial details, such as small wrinkles or slightly enlarged noses, as shown in Figure 1 of our paper. This means that local, user-specific facial attributes play a crucial role in enhancing detection, particularly for fully generated faces.
-
- Leveraging MLLMs for semantic reasoning enables us to combine all cues for a more robust detection: Our method leverages the strong semantic reasoning capability of MLLMs to flexibly process the faces in different states, integrating information such as specific facial conditions and image content. By combining these elements, we are able to make more accurate and holistic judgments on dynamic and diverse faces.
Second, we provide an additional experiment to demonstrate the effectiveness of each proposed strategy:
- (Setting-1): "User-specific identity modeling" and "Use of local facial attributes".
- (Setting-2): Setting-1 w/ additional "Leveraging MLLMs for semantic reasoning".
Compared to ICT-Ref, Ours (Setting-1) achieves a significant advantage, with consistent detection performance across both realistic fake faces: face-swap (FS) and entire face synthesis (EFS). Furthermore, by incorporating semantic reasoning based on training, Ours (Setting-2) further improves performance, resulting in more robust detection.
We hope our clarification and experiment can address your questions, and feel free to ask for further details if needed. Thank you once again for the valuable comment.
Table D5. Detection performance (AUC (%)) of our method on face-swap (FS) and entire-face synthesis (EFS) under different setting. ICT-Ref is selected for comparison, as it relies on specific patterns without leveraging detailed local facial attributes or semantic reasoning.
| Method | BlendFace (FS) | InSwap (FS) | Arc2Face (EFS) | PuLID (EFS) | Average |
|---|---|---|---|---|---|
| ICT-Ref | 88.67 | 84.34 | 70.27 | 72.36 | 78.91 |
| Ours (1, 2) | 98.45 | 92.91 | 94.35 | 97.72 | 95.86 |
| Ours (1, 2, 3) | 99.48 | 96.40 | 98.05 | 98.96 | 98.22 |
This paper proposes a new deepfake detection framework, called VIPGuard. Unlike existing methods that typically focus on general-purpose detection, VIPGuard emphasizes individualization. Several modules are thus designed, including a fine-tuned multimodal large language model (MLLM), an identity-level discriminative learner, and a user-specific customization step. Experimental evaluations show clear advantages of the proposed framework over existing baselines.
优缺点分析
Strengths:
-
Developing effective deepfake detection techniques is a timely and important topic.
-
The paper tackles the problem from an individualized perspective, which is new.
-
The paper makes good efforts in constructing a comprehensive benchmark for identity-aware deepfake detection, which can be of independent interest for future research.
-
Experimental results show impressive improvement over existing deep fake detection methods.
Weaknesses:
-
It is unclear how well the proposed VIPGuard method generalizes to unseen forgery schemes.
-
It is unclear how the proposed framework can be used for a new identity.
-
The technical novelty of the proposed method seems limited.
-
The paper does not discuss how robust the detection method is when there are image transformations involved.
-
There is a lack of discussion about the failure cases.
问题
Below, I elaborate on my comments regarding weaknesses:
-
In Step 2 of the proposed framework, the identity discrimination function is learned using a large dataset of positive and negative face pairs, which essentially assumes knowing the forgery scheme that will be employed by adversaries. In my opinion, this is a strong assumption. How well does your method perform against a new forgery scheme (or an unknown deepfake generation method)?
-
The paper focuses on 22 unique identities in the dataset and evaluation. How can your method be applied to a new identity beyond the 22 considered identities? Do we need to retrain the whole model? If the identity-specific characteristics are different, how can you make sure the learned detection scheme generalizes to unseen identities?
-
The proposed framework consists of several modules adapted from existing literature. It is unclear how novel the proposed method is. Please clarify your work's key technical contributions.
-
It is unclear whether the proposed detection method can still perform well when there are natural image transformations (i.e., blurring, Gaussian noise, compression, etc.). Please clarify.
-
Are there any failure modes you observed that are worth mentioning? Please discuss potential adaptive forgery strategies that may render the proposed detection method less effective.
局限性
Yes.
最终评判理由
The authors' rebuttal addressed most of my primary concerns, so I will increase my evaluation from "borderline reject" to "borderline accept". The authors are suggested to include the additional experiments and discussions in the revised version of their work.
格式问题
The paper is generally well-written. I didn't spot a particular paper formatting issue.
Thank you for your valuable feedback. We appreciate your recognition of the novelty of our problem and the uniqueness of our approach to addressing it, as well as the strength of our method, and the contribution of our benchmark. Below, we respond to your comments point by point:
Q1:
- This paper assumes knowing the forgery scheme that will be employed by adversaries, which is a strong assumption.
- How does the proposed method perform against an unknown forgery method?
R1:
Thank you for your comment. We list our response as follows:
- We have already presented the experiments involving unknown forgery detection in our paper (generalization), with no leakage of identity information or prior knowledge of the forgery methods. Specifically, we have evaluated the performance of our method on the following 12 unknown forgery methods: BlendFace, Ghost3, HifiFace, InSwap, MobileSwap, UniFace, ConsistentID, PuLID, GPT-4o, Jimeng AI, Tongyi, and Kling AI. We just utilized the SimSwap and Arc2Face as the training fake images in our method.
- Based on the proposed method, we enable MLLM to learn facial priors and reason with fine-grained facial information. Tables 1–4 in our paper show that our method consistently achieves the best generalization performance across all evaluated forgery methods.
Q2:
The paper focuses on 22 unique identities in the dataset and evaluation.
- How can your method be applied to a new identity beyond the 22 considered identities?
- Do we need to retrain the whole model?
- If the identity-specific characteristics are different, how can you make sure the learned detection scheme generalizes to unseen identities?
R2:
Thank you for your comment. We respond to your question as follows:
- To apply VIPGuard to a new user, we need to train a user-specific VIP token using a few of his/her authentic images. We generate positive and negative face pairs from these images to train the VIP token , while keeping the rest of the model frozen.
- No, the whole model remains frozen. Only the lightweight VIP token (0.11M parameters) is trained for each new identity, highlighting excellent adaptability across different identities.
- Our paper focuses on ID-aware deepfake detection, which is defined as detecting facial forgeries by effectively utilizing known identity information, aligning [1,2]. Therefore, handling unseen identity conflicts with this topic. For VIPGuard, at least one authentic image of the user is required for forgery detection. As the number of available images increases, the model’s detection performance improves (see Figure 1 in the appendix), demonstrating the scalability of our method.
[1] Diff-id, TDSC, 2024.
[2] ICT-Ref, CVPR, 2022.
Q3:
The proposed framework consists of several modules adapted from existing literature. It is unclear how novel the proposed method is. Please clarify your work's key technical contributions.
R3:
Thank you for your comment. We provide a detailed explanation to clarify our novelty as follows:
-
Firstly, to better address your concern, we sincerely ask if you could provide more specific details about the mentioned existing literature, such as relevant references or method names.
-
Secondly, we would like to re-emphasize that our work introduces novel contributions that have not been proposed in previous research. The key aspects of our novelty are as follows:
-
Task-Specific Formulation:
- We reformulate deepfake detection as a face comparison task guided by identity priors—a paradigm not explored in prior MLLM-based detection methods. This enables the model to detect subtle forgery artifacts by comparing a query image against the known facial characteristics of a specific individual ("VIP"). This shift in perspective is itself a conceptual innovation.
-
Comprehensive and Custom Data Construction Pipeline:
- Unlike existing methods that rely on off-the-shelf LLMs or generic annotations, we introduce a dedicated facial annotation model trained to extract precise identity, attribute, and similarity signals from faces. This allows us to generate high-quality, identity-centric training data that explicitly teaches the MLLM to reason about facial consistency—a critical capability missing in standard training setups.
-
Progressive Learning Strategy:
- We design a three-stage progressive learning framework—facial attribute learning → general face comparison → VIP-specific customization—that gradually equips the MLLM with the fine-grained facial understanding necessary for personalized detection. This structured curriculum addresses the MLLM’s initial lack of facial expertise and ensures effective adaptation to the target task.
- As demonstrated in Table C3, naive end-to-end training (i.e., standard MLLM fine-tuning) performs significantly worse, confirming that our improvements are not due to scale or data alone, but to our tailored methodology.
-
-
Finally, while our method is partially inspired by Cross Attention [3, 4] and Yo’LLaVA [5] in the design of certain modules, there is no overlap with these works in terms of target tasks or pipeline. We clarify as follows:
- Cross Attention [3, 4]: We use cross-attention to compare two facial images. However, this doesn’t lessen our method’s novelty, since cross-attention is just a standard deep-learning building block.
- Yo’LLaVA [5]: We propose the use of a VIP token to capture specific user identity information, which is used to discriminate between the subtle identity shifts, AI-generated or real. In contrast, Yo’LLaVA focuses on enhancing model performance for conversational tasks (such as certain personalized conversational tasks). To further clarify the distinction, we provide experimental results in Table C3, where Qwen-2.5-VL-7B serves as the baseline. The significant improvement we observe strongly verifies the novelty and effectiveness of our approach.
TABLE C3: A comparison (AUC (%)) between our method, naively training MLLM, and Yo’LLaVA. Our method shows the best results of detection.
| Method (AUC (%)) | BlendFace | InSwap | Arc2Face | PuLID | Average | Improvement (+) |
|---|---|---|---|---|---|---|
| Qwen-2.5-VL-7B (Baseline) | 51.00 | 49.79 | 49.85 | 50.30 | 50.24 | + 0.0 |
| Naive Training MLLM | 71.96 | 59.10 | 62.12 | 68.18 | 65.34 | + 15.10 |
| Yo'LLaVA [5] | 65.32 | 56.56 | 62.85 | 66.61 | 62.84 | + 12.60 |
| VIPGuard (Ours) | 99.48 | 96.40 | 98.05 | 98.96 | 98.23 | + 47.99 |
[3] Transformer, NIPS 2017.
[4] CAT, ICME 2022.
[5] Yo'llava. NIPS 2024.
Q4:
Does our detector stay robust after typical image degradations—blur, Gaussian noise, and compression?
R4:
Thank you for the insightful comment. We have added a robustness study (Table C4-1) showing that VIPGuard remains highly effective under common degradations. ** Because the detector relies on high-level semantic features such as facial information, it is inherently less sensitive to these low-level distortions.** The degradation settings are listed in Table C4-2.
TABLE C4-1: VIPGuard's performance (AUC %) under image degradations. Higher levels indicate stronger degradation. The results represent the average performance of VIPGuard in detecting BlendFace, InSwap, Arc2Face, and PuLID.
| Level | Gaussian Noise Color | Gaussian Blurring | JPEG Compression |
|---|---|---|---|
| None | 98.23 | 98.23 | 98.23 |
| 1 | 97.07 | 98.17 | 98.03 |
| 2 | 96.53 | 98.16 | 98.10 |
| 3 | 94.12 | 98.05 | 97.78 |
TABLE C4-2: Degradation settings included: Gaussian noise ( ) applied in YCbCr space; Gaussian blur defined by kernel () and standard deviation (); and JPEG compression with quality factor (QF).
| Level | Gaussian Noise Color | Gaussian Blurring | JPEG Compression |
|---|---|---|---|
| 1 | QF 90 | ||
| 2 | QF 60 | ||
| 3 | QF 30 |
Q5:
Are there any failure modes you observed that are worth mentioning?
Please discuss potential adaptive forgery strategies that may render the proposed detection method less effective.
R5:
Thank you for your valuable comment. We provide a brief explanation below.
- Significant variations in head pitch and yaw angles, such as noticeable rotation of more than 50°, affect facial features and identity information. However, this limitation is shared by all face-based methods, including face recognition systems.
- Significant age differences between training and test data. For example, training is based on younger facial images, but testing involves older facial images. However, this issue, stemming from notable identity shifts, can be effectively mitigated by expanding the dataset.
Inspired by your comment, we plan to address these potential failure cases in future work by incorporating 3D facial information, learning identity-related temporal cues from videos, and exploring other complementary sources of information. We will include additional analysis related to these aspects in the revised version.
I sincerely thank the authors for their detailed responses to my review, which have addressed my primary concerns. I've updated my score to "borderline accept" to reflect it. Please include the additional experiments and discussions provided in the rebuttal in the final version of your paper.
We sincerely appreciate your efforts and insightful comments, and we are pleased to have addressed your concerns! We will thoughtfully incorporate the suggested experiments and discussions into the revised manuscript to enhance the quality of our paper.
This paper presents VIPGuard, a framework designed to detect and explain Deepfakes when the user’s identity is known. The authors detail a comprehensive data preparation pipeline and a three-stage training framework that employs MLLM. This framework integrates crucial aspects of facial recognition, comparisons between arbitrary pairs of faces, and user-specific customization. Additionally, a novel evaluation benchmark is proposed to assess identity-aware Deepfake detection methods. The findings suggest promising potential for real-world applications, providing an effective solution for Deepfake detection in scenarios where the individual’s identity is known. Overall, this work introduces sufficient novelty and contributes to the advancement of identity-aware Deepfake detection technologies.
优缺点分析
Strengths
-
This paper introduces a novel framework, VIPGuard, designed to detect deepfake images by leveraging identity information. It effectively uses an MLLM (Qwen-2.5-VL-7B) to analyze facial semantic features and assess global similarities among different face models.
-
It presents a well-structured three-stage training pipeline that gradually adapts the MLLM for detailed facial analysis. This progression from general facial recognition to specific individual identification is both logical and technically sound.
-
A new dataset, VIPBench, for ID-aware deepfake detection is proposed. Diverse manipulation approaches are considered in collecting the dataset, including recently released commercial methods, which better align with real-world complexity.
-
Experimental results demonstrate that the proposed framework achieves competitive performance compared to existing approaches.
Weaknesses
-
Stage 3 of VIPGuard still relies on VIP user annotation via Gemini or similar APIs, which may hinder practical deployment. An experiment using only image inputs would enhance the evaluation.
-
Some errors in Eq. (5) and Eq. (6) lead to inconsistencies with Fig. 4. In Eq. (5), the symbol represents the features produced by the Cross-Attention module; however, its definition is missing in the equation. Moreover, the symbol should also be included in Eq. (6).
问题
-
How does the number of available authentic images of each VIP user affect the final performance?
-
It is suggested that the authors place the problem formulation in the appendix to the main content, which could help readers better understand the specific setting of this work.
-
It is suggested that the authors detail the inference process of the method in the manuscript and also report the inference time.
局限性
Yes
最终评判理由
The authors' response addresses my initial concern. I have no further questions.
格式问题
No paper formatting concerns.
We sincerely appreciate your positive feedback on the novelty and logic of our method, the proposed benchmark, and the overall performance. Below, we provide detailed responses to the concerns you raised.
Q1:
Assessing VIPGuard without relying on textual annotations during training is helpful for understanding its real-world applicability.
R1:
Thank you for your valuable comment. As shown in the newly added Table B1, VIPGuard performs well in Stage 3 using only image inputs. Despite being trained without textual annotations, it shows only a slight performance drop, demonstrating its practicality for real-world use. We will include this experiment in the revised paper to improve it further.
TABLE B1: Performance (AUC(%)) of VIPGuard under different settings in the Stage 3. Images + Annotation indicates the joint use of both images and their corresponding textual annotations during training. Only Image refers to training with images alone, without accompanying annotations.
| Variant | BlendFace | InSwap | Arc2Face | PuLID | Average |
|---|---|---|---|---|---|
| Only Images | 98.45 | 92.91 | 94.35 | 97.72 | 95.86 |
| Images + Annotation | 99.48 | 99.43 | 98.05 | 98.96 | 98.98 |
Q2:
There are some symbol errors in Eq. (5) and Eq. (6) of the paper, which should be fixed.
R2:
Thank you for your comment. We apologize for the errors in Eq. (5) and Eq. (6). They will be been corrected in the revised version as follows:
Eq. (5):
L(\theta)=-\sum^N_{i=1}log[p_\boldsymbol{\theta}(x_i|x_{i-1}, f_{ref}, f_{query},g)]
Eq. (6):
Moreover, we provide the equation that describes the computation of the feature representation from the Cross-Attention module, as follows:
where and are the visual features of the reference image in Eq. (5) or the VIP token in Eq. (6), after the linear projection layer; and is the visual feature of the query image , also after the linear projection layer.
Q3:
How does the number of available authentic images of each VIP user affect the final performance?
R3:
We appreciate your comment. We would like to clarify that this experiment was already included in Figure 1 of our appendix. In this experiment, using just three authentic images per VIP user leads to a significant improvement compared to the one-shot setting (i.e., one authentic image per user, equivalent to using only the Stage 1+2 model). Our results show a progressive performance gain as the number of available authentic images increases from 3 to 20. In conclusion, providing more authentic images per VIP user enhances the model’s ability to detect fake images.
Q4:
It is suggested that the authors detail the inference process of the method in the manuscript and also report the inference time.
R4:
Thank you for your comment. To perform detection, VIPGuard will use the following pre-defined template. In this template, FACE_SCORE is the facial similarity score and <|face_pad|> is the placeholder of the output of the cross attention module.
Please determine whether the person in the input image is VIP user. The face similarity of these two face is {FACE_SCORE}/100. The face tokens are shown as follows, <|face_pad|> . You should directly answer 'yes' or 'no' without any explanation.
For each VIP user, we input a set of authentic images into the face model and calculate a centric vector by the following formulation.
where is the input image and is the size of the set .
In inference, we will calculate the facial similarity score between the feature of the current input image and , as shown following
Finally, the input image, facial similarity score, VIP token, and textual template are fed into the MLLM to produce the final output. Moreover, when Qwen-2.5-VL-7B is selected as the backbone, the inference time is about 2.09 seconds per sample without any acceleration techniques.
Q5:
It is suggested that the authors place the problem formulation in the appendix to the main content, which could help readers better understand the specific setting of this work.
R5:
Thank you for your valuable suggestion. We will revise the manuscript accordingly.
Thank you for the response. You have addressed my major concerns. Please make sure to include the rebuttal content in the revised paper.
Thank you very much for your valuable feedback! We are pleased that our responses have addressed your concerns, and we will carefully incorporate the rebuttal content into the revised manuscript to ensure clarity and completeness.
This paper proposes VIPGuard, a multimodal large language model (MLLM) designed for personalized and explainable deepfake detection, specifically for known individuals. It reformulates deepfake detection as a fine-grained face recognition task based on detailed facial attributes. The authors also present VIPBench, a comprehensive identity-aware deepfake benchmark to evaluate personalized deepfake detection models. Based on this benchmark, VIPGuard shows superior deepfake detection performance.
优缺点分析
Strengths:
- VIPGuard demonstrates superior deepfake detection capabilities, outperforming existing methods.
- VIPGuard offers human-understandable explanations for its deepfake detections.
- This paper introduces VIPBench, a comprehensive identity-aware benchmark built to evaluate personalized deepfake detection.
Weaknesses:
- The novelty of the proposed method appears limited. VIPGuard follows standard training process of MLLMs.
- The experimental results seem insufficient. It would be more compelling if the VIPGuard were applied to other MLLMs, such as LLaMA-3.2-Vision, Qwen2.5-VL-3B, etc.
- The ablation studies provided in D.3 seem insufficient to fully explain the individual contribution of each stage. Additional results would be beneficial. (for Stage 2, Stage 3, and Stage 2, 3)
问题
- Fine-grained facial attributes, such as wrinkle and skin tone, are compared between the first image and the second image in the explanation. These attributes can vary when the conditions, such as lighting, cosmetics, etc., changed. Therefore, proposed method seems to be sensitive to the first image input. Would you suggest analysis about it?
- VQA data for VIPBench, which is used to train VIPGuard, is generated by GPT-4o and Gemini 2.5 Pro. If those models are already able to explain visual attributes well, why are their deepfake detection capabilities lower than VIPGuard? Is it possible to get better results from those models by prompt engineering?
局限性
- It would be inefficient to utilize MLLM, which has parameter size of 7~8B.
最终评判理由
Although the rebuttal addresses most questions, the method’s novelty still seems limited to a standard MLLM fine-tuning procedure. Moreover, the use of VIP tokens also raises practicality concerns: if multiple tokens are trained for different VIPs, VIPGuard cannot select the correct one when a VIP image is given. The original paper seems to assume that the user should 1) determine whether the input belongs to previously-seen VIP or a new one and 2) set an appropriate VIP token manually or decide to train new VIP token. Such assumptions do not reflect practical scenarios.
Regarding the remaining issues, the rating remains unchanged at borderline reject.
格式问题
N/A
We sincerely appreciate the reviewer's detailed feedback and are encouraged by the positive assessment of the SOTA detection performance and explainability of our method, and the novelty of the proposed personalized deepfake detection benchmark.
Below, we provide point-by-point responses to the concerns raised by the reviewer.
Q1:
VIPGuard follows standard training process of MLLMs, making the novelty limited.
R1:
We sincerely appreciate the reviewer’s feedback. However, we would like to clarify that, our approach is fundamentally not a naive application of standard MLLM training pipelines, but rather a purpose-built framework specifically designed for the unique challenges of our task, i.e., personalized deepfake detection.
While MLLMs provide a foundational architecture, directly applying standard MLLM training paradigms to this task leads to poor performance, as shown in our ablation studies (Table A1). This is because personalized deepfake detection requires fine-grained understanding of facial identity priors—such as subtle anatomical structures and local attributes—that vanilla MLLMs, trained on general vision-language tasks, inherently lack.
To overcome this limitation, we propose VIPGuard, a novel and comprehensive framework that goes far beyond standard MLLM training in both design and execution. Our key innovations include:
-
Task-Specific Formulation:
- We reformulate deepfake detection as a face comparison task guided by identity priors—a paradigm not explored in prior MLLM-based detection methods. This enables the model to detect subtle forgery artifacts by comparing a query image against the known facial characteristics of a specific individual ("VIP"). This shift in perspective is itself a conceptual innovation.
-
Comprehensive and Custom Data Construction Pipeline:
- Unlike existing methods that rely on off-the-shelf LLMs or generic annotations, we introduce a dedicated facial annotation model trained to extract precise identity, attribute, and similarity signals from faces. This allows us to generate high-quality, identity-centric training data that explicitly teaches the MLLM to reason about facial consistency—a critical capability missing in standard training setups.
-
Progressive Learning Strategy:
- We design a three-stage progressive learning framework—facial attribute learning → general face comparison → VIP-specific customization—that gradually equips the MLLM with the fine-grained facial understanding necessary for personalized detection. This structured curriculum addresses the MLLM’s initial lack of facial expertise and ensures effective adaptation to the target task.
- As demonstrated in Table A1, naive end-to-end training (i.e., standard MLLM fine-tuning) performs significantly worse, confirming that our improvements are not due to scale or data alone, but to our tailored methodology.
In summary, our work does NOT naively apply MLLMs to the task. Instead, we identify the limitations of standard MLLM training in the context of personalized forgery detection and introduce a series of task-driven innovations—from data construction to model training—to overcome them. The substantial performance gains in Table A1 and the ablation studies strongly support the necessity and effectiveness of our proposed framework.
TABLE A1: Comparison (AUC %) between VIPGuard and Naive MLLM Training across four forgery methods.
| Method | BlendFace | InSwap | Arc2Face | PuLID | Average | Improvement (+) |
|---|---|---|---|---|---|---|
| Qwen-2.5-VL-7B (Baseline) | 51.00 | 49.79 | 49.85 | 50.30 | 50.24 | - |
| Naive MLLM Training | 71.96 | 59.10 | 62.12 | 68.18 | 65.34 | + 15.10 |
| VIPGuard (Ours) | 99.48 | 96.40 | 98.05 | 98.96 | 98.23 | + 47.99 |
Q2:
It would be more compelling if the VIPGuard could be applied to other MLLMs, such as LLaMA-3.2-Vision, Qwen2.5-VL-3B, etc.
R2:
We appreciate your valuable comment. Following your suggestion, we conducted an additional ablation study of altering different MLLMs in the VIPGuard.
Table A2 shows that our method consistently achieves strong detection performance on personalized deepfake detection across various MLLMs. These results will be incorporated into the revised manuscript to strengthen its overall quality.
TABLE A2: Performance Comparison (AUC (%)) of Our Method with Different MLLMs.
| MLLM | BlendFace | InSwap | Arc2Face | PuLID | Average |
|---|---|---|---|---|---|
| Qwen-2.5-VL-7B (Baseline) | 51.00 | 49.79 | 49.85 | 50.30 | 50.24 |
| LLaMA-3.2-Vision-11B | 96.27 | 88.97 | 99.00 | 99.52 | 95.94 |
| Qwen-2.5-VL-3B | 90.71 | 87.85 | 87.82 | 97.94 | 91.08 |
| Qwen-2.5-VL-7B | 99.48 | 96.40 | 98.05 | 98.96 | 98.23 |
Q3:
More comprehensive ablation studies would be beneficial. (for Stage 2, Stage 3, and Stage 2, 3)
R3:
Thank you for your valuable comment. We have added additional results in Table A3 to demonstrate the effectiveness of Stage 2, Stage 3, and Stages 2,3. Due to MLLMs’ limited understanding of faces, directly applying Stage 3 results in poor performance. In contrast, enhancing the model’s basic perception through Stage 2 significantly improves performance, further demonstrating the effectiveness of the proposed progressive framework. These results will be included in the revised version of the paper, and we believe they will strengthen our work.
Table A3. An ablation study on the performance (AUC (%)) of different components in VIPGuard
| Variant | BlendFace | InSwap | Arc2Face | PuLID |
|---|---|---|---|---|
| Baseline | 51.00 | 49.79 | 49.85 | 50.30 |
| +Stage 2 | 95.29 | 73.57 | 97.24 | 94.33 |
| +Stage 3 | 71.96 | 59.10 | 62.12 | 68.18 |
| +Stage 2,3 | 99.63 | 93.99 | 98.10 | 98.76 |
Q4:
The proposed method seems to be sensitive to the first image input. Would you suggest analysis about it?
R4:
Thank you for your comment. The first input image serves as the reference ( in Figure 4 of our paper), representing the facial image of the protected user. In our method, we replace it with a VIP token to mitigate the sensitivity issue.
- Challenges may arise when detection relies solely on a single reference image without training the VIP token , a setting referred to as OneShot. Under this setting, changing conditions may alter the image attributes, making detection to variations in the initial input.
- In contrast, our method replaces the first image with a learned VIP token obtained through training. This token encodes the user’s identity from multiple images by training, resulting in a more robust and stable representation. Table A4 shows that VIPGuard consistently outperforms the OneShot setting, confirming its robustness and effectiveness.
- Figure 1 in the appendix shows that performance improves as more authentic images are encoded into VIP tokens, highlighting their effectiveness in mitigating sensitivity.
TABLE A4. Performance (AUC (%)) comparison between the OneShot and VIPGuard. VIPGuard, which utilizes a learned identity token trained from multiple images,
| Method | BlendFace | InSwap | Arc2Face | PuLID |
|---|---|---|---|---|
| One Shot | 90.82 | 73.45 | 77.25 | 71.08 |
| VIPGuard | 99.48 | 96.40 | 98.05 | 98.96 |
Q5:
VQA data for VIPBench, which is used to train VIPGuard, is generated by GPT-4o and Gemini 2.5 Pro.
- If those models are already able to explain visual attributes well, why are their deepfake detection capabilities lower than VIPGuard's?
- Is it possible to get better results from those models by prompt engineering?
R5:
-
The key difference lies in the information available to the model when functioning as a Captioner versus as a Detector. This distinction stems from the differing roles played by Gemini 2.5 Pro (and similar models) in each case:
- Captioner: Generates textual annotations for MLLM training. During this process, we explicitly provide rich prior information (e.g., category labels), and the Captioner formats it into the desired textual structure.
- Detector: Performs detection based on the input image and question, but is not allowed to access any prior information at inference time.
During annotation generation, Gemini 2.5 Pro, acting as a Captioner, was given full prior information—including category labels, as shown in the following equation (also Eq. (2) in our paper):
Here, , , and denote priors such as category, facial similarity, and facial attributes. Gemini 2.5 Pro and other commercial models merely format this information into text; they do not perform any detection themselves.
- These models cannot detect effectively during testing—when such priors are still unavailable—even with prompt engineering. In contrast, VIPGuard is trained to perceive this information and perform detection independently, without relying on external priors.
Thank you for providing the rebuttal. Although it answers most questions, the method’s novelty still seems limited to a standard MLLM fine-tuning procedure, and its practicality remains doubtful. The use of VIP tokens also raises practicality concerns: if multiple tokens are trained for different VIPs, how does VIPGuard select the correct one when a VIP image is given? The paper seems to assume that the user should 1) determine whether the input belongs to previously-seen VIP or a new one and 2) set an appropriate VIP token manually or decide to train new VIP token. Such assumptions do not reflect practical scenarios.
Thank you for your prompt reply. We appreciate your feedback and are pleased that our previous reply has addressed most of your concerns. We reply to your new concerns as follows:
Regarding the novelty of our methodology, we would like to clarify the following:
-
Standard MLLM training is insufficient:
- Simply and directly following the standard MLLM training procedure—without any modifications or optimizations—results in very low performance (with the value 65.34% of AUC in Table 1), while our proposed training paradigm achieves 98.23% (+32.89% over the Standard MLLM training) in Table 1.
- This significant performance improvement highlights the necessity of developing a new and more effective training strategy (not just directly following the standard strategy).
- Simply and directly following the standard MLLM training procedure—without any modifications or optimizations—results in very low performance (with the value 65.34% of AUC in Table 1), while our proposed training paradigm achieves 98.23% (+32.89% over the Standard MLLM training) in Table 1.
-
Key novel contributions of our approach:
- Specifically, our approach introduces the following novel components that go beyond standard MLLM training practices:
-
We tackle personalized deepfake detection from a new perspective by reframing forgery detection as fine-grained facial identity recognition.
-
Our framework consists of three stages designed to train MLLMs to recognize subtle identity differences of the VIP users progressively, which is different from the standard training procedure.
-
To the best of our knowledge, this is the first method to introduce a face comparison mechanism within MLLMs, applied in both Stage 2 and Stage 3.
-
- Specifically, our approach introduces the following novel components that go beyond standard MLLM training practices:
We hope that our latest responses have clearly addressed your questions and concerns. We truly appreciate your thoughtful feedback and the opportunity to improve our work. Thank you!
TABLE 1: Comparison (AUC %) between VIPGuard and Standard MLLM Training across four forgery methods.
| Method | BlendFace | InSwap | Arc2Face | PuLID | Average | Improvement (+) |
|---|---|---|---|---|---|---|
| Qwen-2.5-VL-7B (Baseline) | 51.00 | 49.79 | 49.85 | 50.30 | 50.24 | - |
| Standard MLLM Training | 71.96 | 59.10 | 62.12 | 68.18 | 65.34 | + 15.10 |
| VIPGuard (Ours) | 99.48 | 96.40 | 98.05 | 98.96 | 98.23 | + 47.99 |
Regarding the practicality of our method, we would like to clarify the following:
-
Even when multiple VIP tokens are present, our method can accurately verify image authenticity—crucially, it does so without any manual selection of VIP tokens. Specifically, we simply selected the VIP token according to the facial similarity score and call the method "Adaptive VIPGuard". The highly similar results of the original VIPGuard and Adaptive VIPGuard, as shown in Table 2, provide strong evidence of their effectiveness in practical scenarios. The details of the Adaptive VIPGuard are as follows:
- VIPGuard first calculates the similarity between the input image and the held VIP identity, selecting the corresponding VIP token based on this similarity; the second stage applies the selected VIP token for detection, following the procedure outlined in the paper. No additional training steps or modules are introduced in this process.
-
Our method already meets real-world application needs. For high-profile individuals such as politicians and celebrities, the required detection accuracy is much higher, as even a single undetected counterfeit image can have serious societal consequences. Our method can achieve a very high detection performance to satisfy the demand.
-
Traditional deepfake detection methods fall short of this level of accuracy and currently offer no user-specific protection. Our method significantly outperforms existing approaches in terms of accuracy for VIP users, offering robust protection and effectively addressing a critical gap in the field.
TABLE 2: Comparison of AUC (%) for adaptive VIP token selection in VIPEval, where Adaptive VIPGuard refers to the automatic selection of the VIP token without manual intervention.
| Method | BlendFace | InSwap | Arc2Face | PuLID | Average |
|---|---|---|---|---|---|
| VIPGuard | 99.48 | 96.40 | 98.05 | 98.96 | 98.22 |
| Adaptive VIPGuard | 99.31 | 96.14 | 97.63 | 98.95 | 98.01 |
This paper proposes a personalised and explainable deepfake detection framework called VIPGuard tailored for known individuals. The proposed method leverages a multimodal large language model to capture fine-grained facial attributes, perform identity-level discriminative learning, and incorporate user-specific customisation for robust detection. To evaluate performance, the paper also proposes VIPBench, which is a benchmark for identity-aware deepfake detection covering multiple generation techniques. Experiments demonstrate that the proposed method outperforms existing detectors.
Four expert reviewers initially raised several concerns including limited novelty, insufficient experimental results and ablation studies, heavy reliance on VIP user annotation via Gemini or similar APIs, lack of clarity on whether the proposed method generalizes to unseen forgery schemes and new identities, and the lack of failure case analysis. The rebuttal addressed most of these concerns resulting in two borderline accept, one accept and one borderline reject. However, the main concern about novelty has not been fully addressed and two out of four reviewers still consider the contribution to have limited novelty. Despite this, considering the positive aspects of the paper (the introduction of a new benchmark for personalized deepfake detection and demonstrated improvements over existing methods) the AC believes that the strengths outweigh the weaknesses and recommends acceptance.