PaperHub
7.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
4
5
5
5
4.0
置信度
创新性3.0
质量3.0
清晰度2.8
重要性3.3
NeurIPS 2025

Unmasking Puppeteers: Leveraging Biometric Leakage to Expose Impersonation in AI-Based Videoconferencing

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29

摘要

关键词
Multimedia ForensicsTalking Head VideosTalking Head VideoconferencingPuppeteering Attacks

评审与讨论

审稿意见
4

The paper proposes a defense method based on puppeteering attack, which identifies potential puppeteering attacks by introducing the latent representation ztz_t of each frame of a synthetic video and a fixed image RR into the EBL space to calculate the similarities of identity features. Meanwhile, contrastive learning facilitates the decoupling of relevant identities and irrelevant pose-expression, enhancing the identification of puppeteering attacks. Experimental results demonstrate the accuracy and robustness of the proposed method. It holds some application value for AI-based low-bandwidth talking-head videoconferencing, capable of defending against potential attacks.

优缺点分析

Strengths:

This paper proposes a novel defense method against puppeteering attacks, aiming to protect AI-based talking-head videoconferencing. The article ingeniously designs the EBL space to amplify identity discrepancies caused by puppeteering attacks, while theoretically verifying the boundary guarantees of contrastive learning in decoupling identity and pose-expression information.

Weaknesses:

  1. The writing and structure of the paper still have some issues, such as: Fig. 1 is too small. Section 3.3 may be redundant. Fig. 2 is not referenced in the text.
  2. There are some logical inconsistencies in certain parts of the article. For example: Discrepancy between Equation (2) and its description. From the perspective of the entire paper, a puppeteering attack should involve an attacker using their own identity information to integrate the victim's pose-expression information, thereby synthesizing a fake video. However, lines 111-112 state that the synthetic video is created using the victim's identity information and the attacker's pose-expression information. This seems contradictory to both Equation (2) and the overall description of puppeteering attacks presented in the paper.
  3. Some of the arguments presented in the paper lack enough evidence. For example: Insufficient detail about the images in Fig. 3: The paper discusses the transformation to the EBL space to amplify sensitivity to identity features. However, Fig. 3 does not provide detailed information about the source of the images. Are they collected by the authors themselves, or are they from a public dataset? Additionally, the identity features of the image pairs shown in Fig. 3 do not appear to differ significantly, and attributes such as clothing and hairstyle are also quite similar. Did the authors use the image pairs shown in Fig. 2 (which have significant differences in both identity and attribute features) to compare cosine similarity? Using these image pairs would better support the paper's argument: even when identity differences are substantial, if pose-expression features are consistent, the cosine similarity in the latent space of pose-expression features remains similar.

问题

The paper has the following issues:

Writing:

  1. Fig. 1 is too small, which affects readability. The authors should consider enlarging the image.
  2. For Section 3.3, I believe it could be considered for deletion since a similar discussion has already been presented in Section 1.
  3. The paper includes Fig. 2, but there is no mention of it in the text, suggesting that there might be an omission in referencing Fig. 2.

Content:

  1. There might be an error in Equation (2). The unauthorized  Rk\ R^k should be  R\ R^\ell, as the attacker only replaces the identity but still needs to mimic the speaker's pose-expression. Based on the above discussion, the statement "the adversary’s own video VtV^\ell_t is used to derive the pose and expression vectors ztz^\ell_t that are transmitted to the receiver. The receiver then uses the generator gg alongside the unauthorized reference RkR^k and ztz^\ell_t to reconstruct the video" might be inconsistent with Equation (2).

  2. The image pairs shown in Fig. 3 illustrate the distribution of cosine similarity scores, but the image pairs displayed appear very similar in terms of attributes such as clothing, skin tone, and age. Therefore, I am wondering whether ordinary cosine similarity can effectively distinguish between image pairs with significant attribute differences (for example, different identities but the same pose, as in the example in Fig. 1).

局限性

  1. It does not consider the impact of potential adaptive attacks, such as designing targeted adversarial samples to evade this defense mechanism.
  2. The paper claims to abandon comparison with Avatar Fingerprinting, but I believe it would be beneficial to present the original experimental results and compare them in terms of time or computational overhead. This would strengthen the paper's claim of being "efficient and real-time".
  3. The paper could include the use of different time-series-related neural network models, such as RNN (Recurrent Neural Network)[1][2], TPA (Temporal Pattern Attention)[3], GRU (Gated Recurrent Unit)[4], etc., to enhance the generalizability of the proposed approach across various model architectures.

[1] Jeffrey L. Elman. Finding structure in time. Cognitive Science, Volume 14, Issue 2, 1990, Pages 179-211.

[2] Jordan M I. Serial order: A parallel distributed processing approach[M]//Advances in psychology. North-Holland, 1997, 121: 471-495.

[3] Shih S Y, Sun F K, Lee H. Temporal pattern attention for multivariate time series forecasting[J]. Machine Learning, 2019, 108: 1421-1441.

[4] Cho K, Van Merriënboer B, Gulcehre C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[J]. arXiv preprint arXiv:1406.1078, 2014.

最终评判理由

This paper proposes a novel defense method against puppeteering attacks, aiming to protect AI-based talking-head videoconferencing. The article ingeniously designs the EBL space to amplify identity discrepancies caused by puppeteering attacks, while theoretically verifying the boundary guarantees of contrastive learning in decoupling identity and pose-expression information.

Questions are well addressed by authors' rebuttal. I hold my opinion on accepting the paper.

格式问题

None.

作者回复

Comment (Weakness 1 & Writing 1,2,3): The writing and the structure of the paper have some issues such as Fig.1 being too small and section 3.3 having redundancies.

We thank the reviewer for pointing out the issues with the paper’s writing and structure. We will address these in the camera-ready version by ensuring that all figures, including Fig. 1, are resized and reformatted for clarity and that Section 3.3 is revised to remove redundancies and improve conciseness. We will also review the full manuscript to improve clarity, ensure consistent formatting, and enhance overall readability.

Comment (Weakness 2): There are logical inconsistencies in certain parts of the paper specially in relation to Equation 2 and the lines leading up to that equation.

Thank you for catching the mistake in Equation 2 of our main paper. We sincerely apologize for the confusion this oversight may have caused and will correct it in the camera-ready version.

The correct text in lines 108–109 should read:

An adversary, speaker \ell, obtains a target speaker’s representations RkR^k and substitutes the driving video for the target speaker kk, VtkV_t^k, with their own video VtV_t^\ell, thus replacing the target speaker’s pose and expression vectors ztkz_t^k with their own pose and expression vector ztz_t^\ell.

Accordingly, the correct form of Equation 2 is:

V^tk=g(f(Vt),Rk)=g(zt,Rk)\hat{V}_t^{k \rightarrow \ell} = g(f(V_t^\ell), R^k) = g(z_t^\ell, R^k)

We thank the reviewer for pointing out this issue and will ensure it is clearly corrected and explained in the camera-ready version.

Comment (Weakness 3 - Part A): Fig. 3 does not provide detailed information about the sources of the images. Are they collected by the authors themselves, or are they from a public dataset?

Thank you for this comment. We apologize for inadvertently omitting this information The images shown in the Fig. 3 are taken from the NVFAIR dataset, which is publicly available. Similarly, the data used to generate the distributions shown in Fig. 3 is also from the publicly available NVFAIR dataset. We will revise the camera ready manuscript to include this information.

Comment (Weakness 3 - Part B): Additionally, the identity features of the image pairs shown in Fig. 3 do not appear to differ significantly, and attributes such as clothing and hairstyle are also quite similar. Did the authors use the image pairs shown in Fig. 2 (which have significant differences in both identity and attribute features) to compare cosine similarity?

Thank you for raising this issue. The specific images used in Fig. 3 were only chosen to visually illustrate pose/expression and identity matches or mismatches. They are only a single datapoint, and as such do not capture the significant variation in facial attributes, clothing, hairstyle, gender, ethnicity, etc. that was present in the larger dataset used to generate the distributions shown in Fig. 3.

To create the distributions shown in Fig. 3, we used a dataset of 500 unique video-level pairs of self-reenacted and puppeteered videos from the NVFAIR dataset, corresponding to over 125,000 frame-level cosine similarity comparisons. These comparisons comprised a wide variety of facial attribute variation, thus ensuring that even when identity differences are substantial, if pose and expression are consistent, then cosine similarity between pairs of embeddings with closely aligned pose and expression remains similar.

We apologize for the confusion that has resulted due to us omitting this information from our paper. We will revise the camera ready version to include this information to improve clarity.

Comment (Limitation 1): Paper doesn’t consider the impact of potential adaptive attacks such as adversarial examples to evade this defense mechanism

Thank you for this comment. We agree that understanding this system’s vulnerability to adversarial countermeasures is important. However, this involves a detailed theoretical investigation, as well as thorough accompanying experimental analysis. As a result, this is beyond the scope of this work.

Specifically, this would involve developing new attack strategies, creating new datasets, and running multiple experiments under several conditions to effectively understand this system’s vulnerability to attack. Unfortunately, we do not have the space to include this in this work.

We do, however, believe that this is an important topic of study and plan to address this in a separate future work.

Comment (Limitation 2): Lack of comparison with Avatar Fingerprinting in terms of time or computational overhead.

Thank you for your comment. We agree that a computational complexity comparison between our method and Avatar Fingerprinting (AF) would help the reader understand trade-offs and advantages of using each method, specifically with respect to computational complexity and runtime.

Unfortunately, at the time of this writing, Avatar Fingerprinting’s code and trained model are not publicly available. Furthermore, its authors did not report any results regarding its computational complexity or runtime.

However, we note that AF’s proposed method used a deep 3D Temporal CNN to process facial data from a number of video frames. This neural architecture has significantly higher computational complexity than our method’s MLP layers. Additionally, AF can only perform detection post-hoc, meaning it produces a single score for an entire video clip. This means that this method is unlikely to be able to perform detection in real time.

Comment (Limitation 3): The paper would benefit from using different time-series-related neural networks such as RNNs.

We appreciate the reviewer’s comment. The primary focus of our work is on the discovery of biometric leakage from the pose-and-expression embedding, the design of the enhanced biometric leakage (EBL) representation, and the associated training protocol and loss function. While temporal information fusion is an important component, it is not the core novelty of our approach.

We did evaluate different temporal fusion strategies during development and found that LSTM-based fusion provided stronger performance gains than simple temporal moving average methods. Importantly, the LSTM was able to achieve real-time performance on consumer hardware, aligning well with our design goals for practical deployment.

Given the strong overall system performance and our emphasis on the primary contributions noted above, we prioritized our research efforts toward those core components rather than conducting a comprehensive comparative study of temporal modeling approaches. We agree, however, that exploring alternative temporal modeling strategies could be an interesting direction for future work, and we will note this in the revised version of the paper.

评论

Thank you for your response. Most of my concerns are now resolved.

审稿意见
5

This paper focuses on preventing adversarial attacks in ai-based video conferencing (where essentially the pose-expression vectors control the generation of a video on the receiver side, thus reducing bandwidth). The main advantage of their method is that it doesn't require any video being sent to the receiver, and it generalizes to new people (old methods do not), i.e. the new person controlling the video doesn't need to have pre-recorded data about themselves to eliminate the likelihood of an adversarial attack. To do this, they learn a latent space that extracts biomarkers from pose-expression data via a novel pose-matched negative pair creation method, and compare the biomarkers of the person controlling it to the person being generated to ensure consistency. Detection happens in real-time.

优缺点分析

Strengths:

  1. Neat way of identifying mismatched controller / generated persons on new people by leveraging the same information that the classifier-based method used, simply by being smart about the negative pairs used in the contrastive learning loss.
  2. The paper is well written, and well motivated.
  3. There are smart engineering decisions, such as the compactness of the representation so its latency-free (i.e. real-time).

Weaknesses:

  1. The authors decided to remove the main baseline, Avatar Fingerprinting, citing that "the setup is incompatible with our framework, as removing enrollment disables its core mechanism and renders comparison unfair". However, I think its pretty crucial to understand when to use which method. For instance, I would recommend that they use the enrollment process with Avatar Fingerprinting for a portion of people, and compare against their method (how well does this paper's method compare to avatar fingerprinting that has the advantage of the enrollment data?). Ofcourse, when you don't have enrollment data, this paper (so far) is the only option. But, it is still interesting to know how well the two methods do relative to each-other when you do have that data.

问题

  1. I don't follow exactly what the deepfake detectors were tasked to do and why they were included? As mentioned in the paper, all videos are fake? I don't think this measures the same thing as the puppeteering attack ofcourse.

  2. It might be the case that if you were to do contrastive learning with a sufficiently large dataset of identities, both the biomarkers and the identity information in the way that people move would be sufficient to generalize to new people. I didn't find any discussion on this, but this could be just me missing some information.

  3. How robust is this method against adaptive adversaries who might use generative models or adversarial perturbations to transform their pose-expression embeddings to mimic the target's biometric signature? Have you considered attacks where someone trains a model to map their own biometric leakage patterns to match the victim's distribution? How would this method have to change when users start doing that?

局限性

While technical limitations were addressed, social limitations were not.

最终评判理由

I maintain my recommendation of accept.

格式问题

n/a

作者回复

Comment (Weakness 1): Request inclusion of a comparison with Avatar Fingerprinting and the analysis of use cases for both Avatar Fingerprinting and proposed approach.

We appreciate the reviewer’s suggestion to compare with Avatar Fingerprinting (AF). We agree that a side‑by‑side comparison can further clarify the benefits (or potential weaknesses) of our enrollment‑free approach.

To fully address the reviewer's concern, we have conducted an additional experiment in which we allowed Avatar Fingerprinting to have an “oracle enrollment period,” meaning all users were enrolled ahead of the evaluation period. We note that this experiment potentially unfairly penalizes our algorithm, since due to its problem setup, it is not allowed to exploit information that could be learned during this enrollment period.

Since Avatar Fingerprinting is not publicly available or accessible, we could not retrain it on our datasets presented in the main paper to make a direct comparison. However, we were able to make a comparison by benchmarking our method’s performance against Avatar Fingerprinting on an identical dataset. Specifically, we evaluated our method’s performance on NVIDIA-VC data generated using the generator LIA used in Avatar Fingerprinting’s main experiment.

In this experiment, our proposed method was able to achieve a detection AUC of 0.87. This outperforms AF’s reported AUC of 0.84, demonstrating that our system can offer benefits over AF in terms of both detection performance and lack of an enrollment requirement. We note that while our methods AUC in this experiment is lower than our average reported numbers in Table 1 and 2, LIA is an older generator. It only transmits a limited set of facial landmarks during operation, yielding noisier embeddings than the rich embedding vectors used by modern reenactment engines that were presented in our paper. We will amend the supplemental materials in the camera-ready version to reflect this comparison.

Comment (Question 1): Why are comparisons with deepfake detectors included considering the fact that all videos are fake.

We thank the reviewer for the question. We agree that traditional deepfake detectors are not designed to detect puppeteering attacks.

We included these experimental comparisons to help readers unfamiliar with this problem to understand the difference between puppeteering attack detection and deepfake detection. As our experiments show, deepfake detectors perform roughly on par with a random guess when performing puppeteering detection. This helps readers from outside this research area understand that differentiating between real videos and deepfake videos is a distinct problem from differentiating between self-reenacted and puppeteered videos, both of which are fully synthetic.

Comment (Question 2): If you were to do contrastive learning with a sufficiently large dataset of identities, would both the biomarkers and the identity information in the way that people move be sufficient to generalize to new people?

Thank you for this comment. Given enough reference training data from a particular identity, it is in fact possible to learn the unique biomarkers and the identity information in how people move. This is the approach explicitly taken by Avatar Fingerprinting. The experimental results presented in that paper demonstrate that movement biometrics can help expose puppeteering attacks.

However, this approach explicitly requires an enrollment period where a significant amount of movement data is recorded and used to train an identity-specific detector. Without this enrollment period, Avatar Fingerprinting and techniques based on movement biometrics cannot work.

Our method is designed to operate without an enrollment period requirement, thus enabling it to be applicable in a wider set of videoconferencing scenarios.

Comment (Question 3): How robust is the paper against adaptive adversarial attacks that specifically transform their pose-expression embeddings to mimic target biometric signature, or a model trained to map their own biometric leakage patterns to match victim’s distribution?

Thank you for this comment. We agree that understanding this system’s vulnerability to adversarial countermeasures is important. However, this involves a detailed theoretical investigation, as well as thorough accompanying experimental analysis. As a result, this is beyond the scope of this work.

Specifically, this would involve developing new attack strategies, creating new datasets, and running multiple experiments under several conditions to effectively understand this system’s vulnerability to attack. Unfortunately, we do not have the space to include this in this work.

We do, however, believe that this is an important topic of study and plan to address this in a separate future work.

评论

Thank you for the rebuttal. My comments have been addressed satisfactorily.

评论

Thank you for your insightful and thorough comments. We’re excited to integrate your suggestions and the additional experiments into our final manuscript.

审稿意见
5

This paper proposes a method to defend against puppeteering attacks in bandwidth-constrained videoconferencing systems. In a puppeteering attack, an adversary on the sender side transmits an unauthorized representation of a different target speaker when a video call is initiated. As a result, the identity reconstructed by the receiver differs from the actual person controlling the video stream. The proposed approach detects such attacks by verifying whether the identity information contained in the pose-and-expression vector is consistent with the identity in a reference frame.

优缺点分析

Strengths:
(1) The proposed defense against puppeteering attacks in bandwidth-constrained videoconferencing systems is intuitive and appears to be effective. The method is well-aligned with the practical requirements and constraints of such systems, and the design choices are clearly justified.
(2) The experimental results provided in the paper demonstrate the effectiveness of the proposed method in defending against puppeteering attacks.
(3) The paper is well-structured and clearly organized.
Weaknesses:
(1) It appears that the variables k and p may have been swapped in Equation 2.
(2) The proposed defense method essentially formulates the problem as a face recognition task, with the main difference being that, instead of using RGB images as input (as is common in standard recognition settings), the input in this scenario is a vector representing pose and expression. However, the paper lacks a comparison with standard recognition methods as baselines. In particular, it would be helpful to include results using established approaches such as CosFace or similar methods, to better contextualize the performance of the proposed approach.
(3) In Section 4.2.1, the authors state that using two separate heads leads to better performance. However, the ablation study only compares the presence and absence of the head module, and does not provide a direct comparison between using a single head and two heads.
(4) Using LSTM for temporal information fusion is a standard practice in the field. The paper would benefit from a discussion of why LSTM was chosen over other temporal modeling approaches.

问题

The proposed defense method essentially formulates the problem as a face recognition task, with the main difference being that, instead of using RGB images as input (as is common in standard recognition settings), the input in this scenario is a vector representing pose and expression. I am curious whether similar defense performance could be achieved by using a standard recognition training approach, simply replacing the RGB images with pose–expression vectors, combined with the hard negative data synthesis method proposed in the paper. It would be helpful if the authors could discuss or provide experimental results on this baseline to better isolate the contribution of their proposed method.

局限性

yes

最终评判理由

After reading the author's response, I decided to raise my score to accept

格式问题

/

作者回复

Comment (Weakness 1): Equation 2’s variables may have been swapped

Thank you for catching the mistake in Equation 2 of our main paper. We sincerely apologize for the confusion this oversight may have caused and will correct it in the camera-ready version.

The correct text in lines 108–109 should read:

An adversary, speaker \ell, obtains a target speaker’s representations RkR^k and substitutes the driving video for the target speaker kk, VtkV_t^k, with their own video VtV_t^\ell, thus replacing the target speaker’s pose and expression vectors ztkz_t^k with their own pose and expression vector ztz_t^\ell.

Accordingly, the correct form of Equation 2 is:

V^tk=g(f(Vt),Rk)=g(zt,Rk)\hat{V}_t^{k \rightarrow \ell} = g(f(V_t^\ell), R^k) = g(z_t^\ell, R^k)

We thank the reviewer for pointing out this issue and will ensure it is clearly corrected and explained in the camera-ready version.

Comment (Weakness 2 & Question): Direct comparison with standard facial recognition baselines

Thank you for pointing out that a direct comparison with standard facial recognition baselines like CosFace can help in contextualizing our method’s performance and better highlight our paper’s novel contributions.

To accomplish this, we conducted an additional experiment in which we evaluate the puppeteering detection performance of CosFace and ArcFace in two scenarios: (i) applied directly to the same pose-and-expression embeddings that our method utilizes, and (ii) further adapted these methods to be compatible with our proposed training protocol. For both of these scenarios, we utilized the same experimental setup as in Table 2 of our main paper where we trained only on the NVIDIA-VC subset and evaluated on the CREMA-D and RAVDESS subsets. We then reported the average detection performance over both subsets. The result of this experiment is shown in the table below:

MethodScenarioAvg. AUC
CosFaceDirect application0.61
ArcFaceDirect application0.56
CosFaceAdapted to use our training protocol0.93
ArcFaceAdapted to use our training protocol0.91
OursOur training protocol0.97

From these results, we observe that a direct application of standard facial recognition baselines like CosFace and ArcFace obtains substantially lower performance than our proposed method. While their performance is improved when CosFace and ArcFace are adapted to utilize our training protocol, it is still lower (0.93 vs. 0.97) than our method. These results show that (1) facial recognition methods do not distinguish between identities when directly examining the pose-and-expression embeddings, (2) our proposed training protocol provides substantial performance gains when applied to multiple systems due to its ability to enable networks to control for pose and expression differences, and (3) our method’s Enhanced Biometric Leakage space and network design are better at detecting puppeteering attacks than competing facial recognition approaches. Thank you for requesting this additional validation. We will include these important results and discussions in the camera-ready version of our main paper.

Comment (Weakness 3): The ablation study only compares the presence and absence of the head module, and does not provide a direct comparison between using a single head and two heads.

Thank you for your comment. We believe this concern may stem from a misunderstanding due to a lack of clarity in how we described our ablation study. We address this below and will revise our camera-ready version to resolve the issue.

Our method operates using two pose/expression embeddings that contain biometric information: Head 1 corresponds to the sender driving the video, and Head 2 corresponds to the reference frame used to reconstruct the video at the receiver. Puppeteering is detected by identifying biometric mismatches between these two heads. As stated in Section 4.2.1, two heads are essential—biometric mismatch cannot be identified without comparing biometric information between them. Information from a single head is insufficient to detect a mismatch between the speaker at the sender and the reconstructed identity at the receiver.

The confusion may have arisen from the way we described our ablation, specifically when we wrote: “The results in Tab. 4 show that removing the biometric leakage embedding modules (h1, h2) resulted in a substantial drop in performance.” This wording unintentionally implies that we removed the head modules entirely and attempted to measure biometric mismatch without using any heads, which could also be interpreted as suggesting that a single-head configuration could be evaluated.

In reality, our ablation did not remove the heads. Instead, it removed the two MLPs that project the information from the two heads into a common enhanced biometric leakage space. Without this projection module, the embeddings from the two heads were compared directly. The goal of this ablation was to demonstrate the importance of the enhanced biometric leakage space projection, not to imply that our method can operate without two heads.

We apologize for the confusion this wording caused and will clarify this point in the camera-ready version.

Comment (Weakness 4): Discussion regarding choice of temporal information fusion

We appreciate the reviewer’s comment. The primary focus of our work is on the discovery of biometric leakage from the pose-and-expression embedding, the design of the enhanced biometric leakage (EBL) representation, and the associated training protocol and loss function. While temporal information fusion is an important component, it is not the core novelty of our approach.

We did evaluate different temporal fusion strategies during development and found that LSTM-based fusion provided stronger performance gains than simple temporal moving average methods. Importantly, the LSTM was able to achieve real-time performance on consumer hardware, aligning well with our design goals for practical deployment.

Given the strong overall system performance and our emphasis on the primary contributions noted above, we prioritized our research efforts toward those core components rather than conducting a comprehensive comparative study of temporal modeling approaches. We agree, however, that exploring alternative temporal modeling strategies could be an interesting direction for future work, and we will note this in the revised version of the paper.

评论

We sincerely thanks for your thoughtful feedback and positive recommendation. We are pleased to hear that our revisions addressed your concerns. We will incorporate the suggested changes into the final version as recommended.

评论

The authors have addressed all my concerns. I highly recommend authors to include relevant results in either the main paper or the supplementary material. Thanks !

审稿意见
5

The paper addresses the problem of puppeteering attacks in AI-based talking-head videoconferencing, where an adversary hijacks a legitimate user’s reference embedding to impersonate them in real time. The authors propose to exploit subtle biometric cues inadvertently encoded in the pose and expression latent, learning a low-dimensional embedding (in the EBL space) via a pose-conditioned contrastive loss and an LSTM-based temporal fusion that flags identity mismatches via a cosine test, without ever reconstructing RGB frames or enrolling users. They report high AUC (> 0.95) on three datasets (NVIDIA-VC, CREMA-D, RAVDESS) and claim real-time operation (75 FPS).

优缺点分析

Strengths:

  1. Motivation of the paper is clear and is the first work to use latent representations without reconstructing RGB frames to detect puppeteering attacks.
  2. The authors claim real-time performance (75 FPS), which is a drastic improvement from 32 FPS of the previous baseline.

Weakness:

  1. The core idea of applying a contrastive loss to disentangle identity from pose/expression, is essentially a direct application of well-established hyperspherical embedding methods (such as CosFace, ArcFace)
  2. The real-time FPS is over-claimed.
  3. In-the-wild results are missing.
  4. Some hyper-parameter exploration missing.

问题

  1. Limited Novelty: The core idea of the paper is a direct application of margin-based embedding such as CosFace and ArcFace. There is substantial technical innovation beyond simply matching positives and ""pose-matched" negatives (Eq. 5). The pose-conditioned large-margin contrastive loss i"s superficially novel and merely re-weights standard contrastive objectives. The margin guarantee as shown in "Proposition 1" is a trivial consequence of spherical geometry and adds little theoretical insight.
  2. The experiment to show cross-domain robustness trains the model on NVIDIA-VC subset and tests it on CREMA-D/RAVDESS. However, the drop in AUC from 0.966 to 0.925 (Table 2) is non-trivial, and without benchmarks on truly "in-the-wild" videoconferencing data (e.g., Zoom/Teams logs), claims of real-world generalization are overstated. Can you include some "in-the-wild" results ?
  3. The paper omits Avatar Fingerprinting from its comparisons on the grounds that it "requires enrollment," but this seems to dismiss a key alternative approach. A more balanced evaluation would include it, perhaps under an “oracle enrollment” assumption, to better contextualize the proposed method.
  4. The authors claim real-time FPS of 75 on an RTX 3090, but modern videoconferencing must run on mobile/CPU devices. Performance on embedded or CPU-only hardware is not evaluated. Moreover, the LSTM temporal window of 40 frames (~1.3 s at 30 FPS) imposes a non-negligible detection latency, which is skipped in the "real-time" discussion (Section 7/8).
  5. Some hyperparameters such as λ in Equation (6), and γ in equation (7), and embedding dimension are not systematically explored.

I request the authors to resolve my above queries for the rebuttal. I am inclined towards the positive side but still have doubts regarding the over-claimed FPS and novelty.

局限性

Yes, they have included in the supplementary.

最终评判理由

The paper provides a real-time method for leveraging biometric leakage for detecting puppeteering. I had some queries which the authors have clarified in the rebuttal. I therefore, changed my rating accordingly.

格式问题

You could have included some figure explaining the whole method visually in the main paper.

作者回复

Comment (Question 1): Concerns regarding novelty. Specifically, the reviewer suggests that the core idea closely follows existing margin-based embedding methods such as CosFace and ArcFace, with limited novelty in the proposed pose-conditioned large-margin contrastive loss.

We appreciate the reviewer’s question regarding novelty. We agree that our method leverages a margin-based embedding framework and that contrastive objectives have been used in other prior works (e.g., ArcFace, CosFace). However, as with those prior works, the contribution is not simply using a contrastive loss but rather what we apply it to and how we make it work for this problem.

Our novelty is not that we trivially adapt a margin-based embedding. It is that we identify a previously unrecognized biometric side channel in pose-expression embeddings, and we design a training protocol and inference pipeline that make it possible to reliably and efficiently exploit that side channel for real-time puppeteering detection.

Specifically, our key novel contributions are:

(1) Identifying and exploiting a new biometric side channel for protecting against puppeteering attacks.

This is novel because such leakage had previously been considered a liability (a cause of identity entanglement), whereas we repurpose it as a security signal. In this paper, we first show for the first time that pose-and-expression embeddings used in AI-based talking-head videoconferencing systems unintentionally leak persistent biometric identity cues. We then demonstrate that this “biometric leakage” can be reliably extracted and used to detect puppeteering attacks - without decoding RGB frames or enrolling users.

(2) A training protocol that makes biometric leakage usable for authentication.

Directly applying a standard contrastive loss to pose-expression embeddings fails because differences in pose/expression dominate the subtle biometric signal (see Figure 3). Our protocol introduces a pose-matched positive/negative pairing strategy that actively controls for pose/expression during training.

This is technically important because it requires generating pose-matched negatives and positives in a way that is essentially impossible to achieve by simply visually matching real data. Without such control, neither our method nor standard ArcFace/CosFace-style training is effective in this domain.

To demonstrate this point, we conducted an experiment where we retrained ArcFace and CosFace directly on pose-expression embeddings from the NVIDIA‑VC dataset using both standard training, as well as our proposed training protocol. Below, we present the AUCs obtained by averaging performance over 5 SOTA talking head generation systems.

MethodStandard Training+ Our Training Protocol
CosFace0.60 AUC0.92 AUC
ArcFace0.56 AUC0.91 AUC
Ours0.64 AUC0.97 AUC

These results show that without our proposed training protocol, both ArcFace and CosFace perform poorly in detecting puppeteering. With our protocol, both improve substantially. This confirms that the novelty lies in how we adapt the margin-based embedding framework to this problem. Our full method further outperforms these adapted versions because it incorporates additional design choices (e.g. extreme-pose filtering, temporal fusion) tailored to maximize extraction of biometric leakage.

Comment (Question 2): Request for benchmarking our system’s performance using “in-the-wild” videoconferencing data.

Thank you for this comment. We agree that demonstrating robustness on “in‑the‑wild” videoconferencing streams is important. 

To this end, we conducted an additional “in-the-wild” benchmarking experiment in which four members of our lab each recorded five one‑minute Zoom calls under natural lighting and bandwidth fluctuations. For every call we produced both self‑reenactmed and puppeteered videos using five SOTA talking head generators (3DFaceShop, MCNet, EmoPortraits, SDFR, LivePortraits). This yielded 100 self-driven and 100 puppeteered videos that were never seen during training.

Next, we used our system as described in Section 5.2.2 of our paper and evaluated its puppeteering detection performance on this new dataset. Importantly, this was exactly the system used to produce the results in Table 2 of our paper and was trained solely on the NVIDIA VC subset. The “in-the-wild” performance of our system is presented in the table below:

GeneratorAUC
3DFaceShop0.94
MCNet0.97
EmoPortraits0.92
SDFR0.97
LivePortraits0.98
Average0.96

Despite codec differences, camera motion, and variable compression, our system retains ≥ 0.92 AUC for every generator and averages 0.955, outperforming the cross‑domain result (0.925) already reported in Table 2. These numbers bolster the claim that EBL features generalize to real videoconference traffic without fine‑tuning. We will incorporate this dataset description and the above table into the supplementary materials in the camera‑ready. 

Comment (Question 3): Request for comparison to Avatar fingerprinting under “oracle enrollment” in which all users would be enrolled prior to evaluation.

We appreciate the reviewer’s suggestion to compare with Avatar Fingerprinting (AF). We agree that a side‑by‑side comparison can further clarify the strengths and benefits of our enrollment‑free approach.

Our original decision to omit AF was motivated by problem‑setting asymmetry: AF requires a per‑user enrollment session during which a unique fingerprint is extracted and stored. By contrast, our defense is designed to operate in “zero setup” conditions where new/unknown users can use videoconferencing without pre-enrollment. Including AF in Table 2 would therefore have been unfair in both directions - it would penalize the enrollment‑free baselines yet still undervalue AF by denying it the enrollment it requires.

To fully address the reviewer's concern, we have conducted an additional experiment in which we allowed AF to have an oracle enrollment period (meaning all users were enrolled ahead of the evaluation period). Since AF is not publicly available or accessible, we cannot retrain it on our datasets presented in the main paper. However, we were able to make a fair comparison by benchmarking our method’s performance against AF on the identical set of data generated using LIA in AF’s main experiment.

In this experiment, our proposed method was able to achieve a detection AUC of 0.87. This outperforms AF’s reported AUC of 0.84, demonstrating that our system can offer benefits over AF in terms of both detection performance and lack of an enrollment requirement. We note that while our methods AUC in this experiment is lower than our average reported numbers in Table 1 and 2, LIA is an older generator. It only transmits a limited set of facial landmarks during operation, yielding noisier embeddings than the rich embedding vectors used by modern reenactment engines that were presented in our paper. We will amend the supplemental materials in the camera-ready version to reflect this comparison.

Comment (Question 4): Request to evaluate FPS performance on mobile/CPU devices and to address detection latency introduced by LSTM.

Thank you for your comment. We agree that benchmarking our runtime on mobile/CPU devices will provide more insight into our method’s ability to run in real time.

To address this comment, we benchmarked our method (including the 40‑frame LSTM) on two consumer laptops released in 2021. One laptop had no dedicated GPU and utilized an AMD Ryzen 9 5900HX CPU. The other laptop had the same CPU, but was fitted with a mobile NVIDIA RTX 3070 Ti GPU. The results are presented in the table below:

PlatformInference Speed (FPS)
CPU Only27
CPU + Mobile GPU64

These results show that our detector obtained 27 FPS on the CPU‑only machine, matching typical videoconference frame rates (≈ 25 FPS on low‑bandwidth Zoom). Using an older mobile‑grade GPU it reaches 64 FPS. This is within 15 % of the 75 FPS reported on a desktop RTX 3090, demonstrating favourable compute scaling. We further note that our code has not been optimized for runtime, and further speed gains can likely be achieved using a more efficient implementation.

We agree that the latency introduced by the LSTM should be addressed. We will note in the paper that the LSTM incurs a one-time 1.33 s delay before detection begins, after which detection lags by 1.33 s due to the LSTM window. Since puppeteering typically persists throughout a video rather than occurring in isolated segments, this latency should not significantly impact real-world applicability.

Comment (Question 5): Request to systematically explore hyperparameters.

Thank you for your comment. We did perform a systematic exploration of our hyperparameters during system development to select values that yielded the best performance, but we regret that these results were not included in the original manuscript. We apologize for this oversight.

To identify optimal values for 𝜆, 𝛾, and d, we conducted experiments in which we fixed two of these hyperparameters while varying the third. This process was repeated for each hyperparameter, allowing us to evaluate their individual effects on performance. The results of these experiments are provided below and will be incorporated into the supplemental materials for the camera-ready version.

λ ValueAUC
0.10.95
0.30.96
0.50.97
0.70.97
0.90.97
γ ValueAUC
0.10.96
10.97
100.97
500.96
d ValueAUC
640.94
1280.96
2560.97
5120.96

From these results, we found that the optimal performance of our system was achieved with the hyperparameter configuration reported in our submission (𝜆=0.5, 𝛾=1, 𝑑=256).

评论

Thank you again for your thoughtful and detailed feedback. We look forward to incorporating your feedback and the additional experiments into the final version of the paper.

评论

The authors have addressed all my queries. I highly recommend authors to include relevant results in either the main paper or the supplementary material. Thanks !

最终决定

This paper presents a real-time method to prevent adversarial attacks in AI-based video conferencing, where avatars are driven by pose-expression vectors. The core contribution is a technique that learns a latent space of "biomarkers" from this pose data. By comparing the biomarkers of the controller to those of the generated avatar, the system can detect a mismatch. A key innovation is a novel pose-matched negative pair creation strategy for contrastive learning, which allows the method to generalize to new users without requiring any pre-recorded data.

The paper is well-written, well-motivated, and proposes a clever solution to a practical problem.

The initial submission had some weaknesses regarding comparisons with prior work, the extent of the ablation studies, and clarity on certain implementation details. However, these concerns were thoroughly and satisfactorily addressed by the authors during the rebuttal period.

The decision to accept is based on the paper's novel and practical contribution to securing AI-driven communication. The method's ability to generalize to unseen users is a key advance for the field. The technical approach is sound and elegant, and the authors successfully addressed all reviewer concerns in their rebuttal, resulting in a strong and well-supported paper.