Playmate: Flexible Control of Portrait Animation via 3D-Implicit Space Guided Diffusion
摘要
评审与讨论
This paper proposed a novel audio-driven DiT-based portrait animation pipeline with customized emotion control and driving video control. The major contributions are 1) a motion-decoupled module with perceptual loss and adaptive normalization, 2) an emotion-control module with DiT blocks, and 3) an implicit 3D decoupled face representation. The experimental results of video visualization demonstrate the superiority of the Playmate framework.
给作者的问题
Please see the "Methods And Evaluation Criteria" and "Experimental Designs Or Analyses" sections.
论据与证据
The claims are supported by qualitative visualizations such as videos and figures, as well as quantitative comparison and ablation analysis.
方法与评估标准
The two-stage framework is well-explained, and the evaluation provides a comprehensive comparison of SOTA methods in both the HDTF benchmark and the self-collected dataset. However, it's unclear how the head pose and facial dynamics transfer and the perceptual loss improve the disentanglement, is there any detailed explanation for this?
理论论述
I have checked the theoretical formula in the method section for both perceptual loss and Adaptive Normalization.
实验设计与分析
I appreciate the authors for providing both video visualizations and benchmark comparison tables. However, I still have a few minor concerns:
-
Alongside the ablations on CFG scales (Table 2) and the Adaptive Norm (Figure 7), it would be beneficial to include a quantitative analysis of each proposed module—specifically, the perceptual loss and the emotion-control module.
-
How does the method handle scenarios in which both the audio and the driving video provide conditions for localized lip movement? Is there a mechanism to balance or integrate these dual inputs?
-
Although the supplementary website offers visual comparisons, a user study could further substantiate the effectiveness of the proposed Playmate framework in comparison to baseline methods. (Optional)
补充材料
I have reviewed the supplementary materials in both PDF and the anonymous website.
与现有文献的关系
The proposed method contributes to the portrait animation with more flexible control.
遗漏的重要参考文献
I believe all essential references have been included already.
其他优缺点
Please see the "Methods And Evaluation Criteria" and "Experimental Designs Or Analyses" sections.
其他意见或建议
In general, I believe this is a well-written paper in good shape. The existing experimental result is convincing and demonstrates expressive facial motions and head movements. The method is novel and offers flexible control.
Thank you for your feedback. We appreciate your recognition of our method's innovation and applicability. Here are our responses to your comments.
Q1:However, it's unclear how the head pose and facial dynamics transfer and the perceptual loss improve the disentanglement, is there any detailed explanation for this?
Our transfer loss is primarily inspired by DPE[1] and VASA-1[2]. DPE introduces a bidirectional cyclic training strategy akin to CycleGAN's A2B2A' and B2A2B' pathways to achieve the disentanglement of pose and expression. Our approach is similar but constructs bidirectional, non-cyclic pathways. Specifically, we randomly sample two frames and from a video clip; these frames are characterized by their expression and pose attributes: , for frame , and , for frame . After applying attribute transfer, we obtain images and . If the attribute transfer technique is perfect, we should obtain two portrait images with identical expressions and poses. To achieve this goal, we apply a perceptual loss to make the synthetic results appear more realistic.
[1] Pang Y, Zhang Y, Quan W, et al. Dpe: Disentanglement of pose and expression for general video portrait editing
[2] Xu S, Chen G, Guo Y X, et al. Vasa-1: Lifelike audio-driven talking faces generated in real time
Q2:Alongside the ablations on CFG scales (Table 2) and the Adaptive Norm (Figure 7), it would be beneficial to include a quantitative analysis of each proposed module—specifically, the perceptual loss and the emotion-control module.
Thank you very much for the reminder.
(1)About the transfer loss, the purpose of this loss is to achieve more precise disentanglement of facial attributes. Due to time constraints, we calculated the APD (Average Pose Distance) metric of Playmate on two datasets (HDTF and our dataset), as shown in the table below.
| Dataset | APD-jaw | APD-pitch | APD-roll |
|---|---|---|---|
| HDTF | |||
| Our Dataset |
(2)About the emotion-control module, we have compared our method with several approaches, as shown in the table below, demonstrating the effectiveness of our expression control and achieving superior generation quality in terms of emotional expression.
| Methods | FID | FVD | LPIPS | Acc(Emo) |
|---|---|---|---|---|
| EAMM | 111.710 | 210.275 | 0.223 | 0.160 |
| EAT | 95.085 | 166.316 | 0.138 | 0.450 |
| DreamTalk | 119.032 | 199.962 | 0.246 | 0.350 |
| EDTalk | 135.215 | 221.897 | 0.289 | 0.460 |
| Playmate | 68.234 | 149.837 | 0.112 | 0.550 |
Q3:How does the method handle scenarios in which both the audio and the driving video provide conditions for localized lip movement? Is there a mechanism to balance or integrate these dual inputs?
Since in most cases users expect to use audio to drive lip movements, when both audio and a driving video are available as inputs, we default to using audio for driving lip movements and expressions, while using the driving video to control the pose. In fact, we support multiple driving modes. For example, audio can drive lip movements, while the driving video controls the expression and pose, and vice versa. As shown in video1 and video2, the first row displays the reference image, the first column shows the pose control mode, and the remaining cells present the generated results. The lip sync is synchronized with the audio input, whereas the pose control is achieved through various driving mechanisms (e.g., driving video, preset mode, fixed pose).
Q4:Although the supplementary website offers visual comparisons, a user study could further substantiate the effectiveness of the proposed Playmate framework in comparison to baseline methods.
Thank you for your suggestion. Due to time constraints, we conducted a user study involving 10 participants who rated videos using the MOS (Mean Opinion Score) rating method, on a scale of 1 to 5, across four metrics: Lip Sync (LS), Video Definition (VD), Naturalness (N), and Visual Appeal (VA). As illustrated in the table below, Playmate has a notable advantage in the VD and VA metrics. While the LS and N metrics are slightly lower than Sonic's, they still outperform those of other methods, showcasing Playmate's strong competitiveness.
| Methods | LS | VD | N | VA |
|---|---|---|---|---|
| JoyVASA | 2.500 | 2.286 | 1.714 | 1.929 |
| Hallo | 2.964 | 2.929 | 3.071 | 2.893 |
| Hallo2 | 3.036 | 2.929 | 2.893 | 2.786 |
| MEMO | 3.321 | 3.036 | 3.179 | 3.143 |
| Sonic | 3.821 | 3.071 | 3.750 | 3.500 |
| Playmate | 3.750 | 3.857 | 3.464 | 3.643 |
After reading the other reviewers’ comments and the authors’ rebuttal, I sincerely appreciate the authors’ effort in providing additional visualizations for pose and emotion control videos, more comparison results, as well as a user study. The ablation study in Q2 has effectively resolved my concerns regarding the proposed modules.
While I find the user study useful, the sample size of only 10 participants limits its statistical significance. I encourage the authors to consider expanding the study to a larger scale in the final version.
Overall, my concerns regarding the experimental design and analysis have been sufficiently addressed. I will maintain my original rating of accept.
Thank you for your recognition and encouragement! We are delighted to receive your suggestions and will implement them in our revision and future work. We have reorganized a user study involving 50 participants, but due to time constraints, we were unable to complete the experiment and compile the data before the response deadline for this submission. We will continue to carry out this work and update the results in the final version. Furthermore, we will work on this in our future endeavors and keep the community updated on our progress and milestones.
The work introduces Playmat, a diffusion transformer based talking face generation model. Playmate is able to generate talking heads (portrait animation) given reference image and audio signal, as well as an emotion signal. It splits training into two stages, first training the talking face generation model (diffusion transformer backbone), and then training an emotional control module while keeping the backbone fixed. Playmate makes use of the same 3D motion representation from face-vid2iid and Liveportrait. The results demonstrate competitive quantative and qualitative performance of the method, with good video generation quality.
给作者的问题
- I am confused about you "Adaptive Normalization" - from the definition, using pre-computed global and private mead/std values is not adaptive, and is instead fixed. Adaptive Normalization, as the term is normally used for, involves learnable parameters in the network, as is commonly used for DiT models. Can you please explain this further, or make a distinction between the two concepts?
- Further, at inference, which private mean/std is used? How does this work for unseen speakers?
论据与证据
- precise motion decoupling: This line is used throughout the paper however I found its explanation (a) confusing and (b) lacking evidence. Namely, it is described to decouple expression and head pose, however section 4.3 does not explicitly investigate this. Section 4.3 performs an image qualitative analysis of certain failure cases with and without the adaptive norm, however I fail to see how these disjoint artifacts related to improved pose and expression decoupling. The lack of video in the supplementary material makes further investigation impossible.
- state-of-the-art lip synchronization/superior video quality: as shown both quantitatively and qualitatively, this is not the case, and the strength of this claim is not validated. Sonic achieves superior performance in this regard. Further, qualitative videos in the supplementary material demonstrate still that Sonic has better video quality given its overall expression and pose quality over Playmate, in my opinion.
方法与评估标准
In general, yes. The metrics and datasets follow related art, and the baselines are sound choices.
理论论述
The paper does not make any theoretical claims.
实验设计与分析
In general, the experimental designs are sound, with the exception of the following:
- It is common practice in this field to include a user study. While quantitative results are good, and performing good on benchmarks is important, it is not the full picture. A user study enables quantifying the qualitative results of the work in terms of user preference, which is the intended use case of this technology, and is critical to evaluating the quality of the method.
- While quantitative results show good performance on image/video metrics, they fall short in lip synchronization. This is in contrast to the repeated claim and analysis that Playmate achieves state-of-the-art performance in terms of lip synchronization (abstract/L50/col#2, conclusion/L414/col#2) - note however that caption in Figure 2 makes the alternative "competitive" claim. This conflicting language is misleading to the actual performance of the model.
- Further, analysis that video quality is superior to that of other methods, while shown quantitatively, is not supported qualitatively. Primarily, the method falls short in lip sync and expression realism compared mainly to Sonic. in talking head work like this, video quality is highly correlated to the animation quality, and I am not convinced of Playmates results. Playmate also suffers from odd artifacts and expressions, as shown in the videos on the supplementary material website, often have mouth ajar, or eyes bulging.
- These artifacts reduce the impact of analyses towards "precise motion decoupling" as well
- This further highlights the importance of a user study
补充材料
Yes, all of it and all videos present. These are appreciated.
与现有文献的关系
The contributions of this work are primarily scoped to talking head generation. The techniques used in this work pull from the broader literature, however the contributions do not generalize beyond their scope. This statement is made objectively, and does not demote the work itself.
遗漏的重要参考文献
Related art appears sound.
其他优缺点
- Strength: The paper does introduce a novel orchestration of components into a novel training framework and architecture. This is a strength, and is a good combination of prior with a few new components.
- Strength: the emotion control does seem to work well.
- *Weakness: The writing, particularly around claims and introduction of components is often embellished. The language used is often "gratuitously grandiose" - I use this language primarily to illustrate my meaning. This embellishment feeds the narrative of overselling the work's contributions. This is a minor concern and could be addressed in revision. For example,
- "meticulously designed": over embellishing language
- "specialized Proj modules": these are just linear layers it appears
- The authors often say they "introduce" a component, which is misleading terminology that would indicate this is a novel introduction of this technique, when it is in fact not
- Weakness: the strength of the novelty of the work however is not proven in experiments in my opinion, and the qualitative results do not convince me. While emotion control seems to work well, the overall expression quality is not great and a little uncanny still.
其他意见或建议
see other.
We are grateful for your review and valuable comments, and we hope our response fully resolves your concerns.
Q1:precise motion decoupling... (a) confusing and (b) lacking evidence...
About the motion decoupling. We emphasize motion decoupling because it is the foundation for Playmate to achieve various flexible driving combinations. When multiple control conditions exist, we can provide multiple driving modes. For example, a driving video can be used to control the pose, while audio is used to drive lip movements. This is also the most significant distinction between Playmate and methods like Sonic and EMO. Sonic only takes a reference image and audio as inputs, with the generated results (lip, pose, and expressions) all driven by the audio, without allowing users to specify different driving methods. We have uploaded multiple pose control videos and emotion control videos on our anonymous project website(video1 and video2). The lip sync is synchronized with the audio input, whereas the pose control is achieved through various driving mechanisms(e.g., driving video, preset mode, fixed pose).
Q2:state-of-the-art lip synchronization/superior video quality...Sonic achieves superior performance in this regard...
Thank you very much for the reminder. Although our quantitative metrics for lip synchronization on the two test sets are marginally lower than Sonic's, they still outperform other compared methods. Furthermore, across the remaining four comparison metrics, our method consistently exceeds the performance of all other competing approaches. We have uploaded more comparison videos with Sonic to our anonymous project website(video3, video4, video5 and video6). These videos show that, qualitatively, our lip synchronization is not far behind Sonic's. Moreover, in terms of video clarity, we are significantly better, especially in areas like the teeth. Regarding the differences from Sonic, let us briefly explain here. Sonic is a purely audio-driven algorithm that generates all features of the portrait, including lip movements, expressions, pose, etc., based on audio. This indicates that its driving flexibility is limited. In contrast, we achieve multiple controllable portrait driving methods by constructing a precise attribute disentanglement space, providing users with various flexible driving options. The implementation difficulty of this decoupling and then driving approach is higher than that of a simple audio-driven method. This is also the reason why we emphasize "Flexible Control" in our paper title.
Q3:About the user study.
Thank you for your suggestion. This question was also raised by Reviewer SMci. Please refer to our response to Reviewer SMci's question Q4.
Q4:While quantitative results show good performance on image/video metrics...This conflicting language is misleading to the actual performance of the model.
Thank you very much for the reminder. We will correct these issues in the revised version and will also carefully revise the full text.
Q5:Further, analysis that video quality is superior to that of other methods, while...
For details on the comparison with Sonic, see the response provided in Q2.
Q6:the strength of the novelty of the work however is not proven...
Thank you for your skepticism. In this response, we have added multiple tests(including responses to other reviewers), and uploaded more visual results on anonymous website, hoping that these results will dispel your doubts and concerns in this regard.
Q7:About the Adaptive Normalization.
Regarding Adaptive Normalization, our approach focuses on adapting to the dimensions of facial attributes. We apply distinct means and standard deviations for pose and expression, which provides additional prior information and reduces the learning complexity for the model. This facilitates more flexible control over the generated outputs. In the inference stage, we have the flexibility to derive these means and standard deviations from various available sources, thereby enabling more precise and controllable driving effects.
Q8:At inference, which private mean/std is used...
This question was also raised by Reviewer 54Zx. Please refer to our response to Reviewer 54Zx's question Q3.
W1:The writing, particularly around claims and introduction of...
We will correct these issues in the revised version and will also carefully revise the full text.
W2:the strength of the novelty of the work however is not proven in experiments in my opinion, and...
Due to the character limit on response, please refer to our response to Q6.
The rebuttal sufficiently address some of my more pressing concerns of performance comparison to methods like Sonic. While falling short in certain areas, the argument for flexible control makes sense and I appreciate the clarification. Pending improvements to the language mentioned, I am raising to a weak accept.
We are delighted to receive your response and suggestions. Thank you for raising your evaluation and for your support. Due to the limitations of the discussion strategy, we are unable to directly modify the submitted PDF file or submit a new PDF file in this stage to showcase our revised paper. However, we have thoroughly reviewed and made amendments to the paper. The key updates include:
- Revised the statements regarding lip synchronization performance(abstract/L50/col#2, conclusion/L414/col#2), corrected to "exhibiting strong competitiveness in lip synchronization".
- Revised "specialized Proj modules" in Section 3.2(L235/col#1) to "Proj modules".
- Revised "introduce face representation techniques" in Section 2.3(L113/col#2) to "utilize face representation techniques".
- Revised "introduce the pairwise head pose and facial dynamics transfer loss" in Section 3.1(L197/col#1) to "utilize the pairwise head pose and facial dynamics transfer loss".
- Revised "introduce a self-attention module" in Section 3.2(L230/col#1) to "utilize a self-attention module".
The aforementioned modifications will be reflected in the final version. Additionally, we will continue to strive to optimize the overall performance of Playmate in our future work and keep the community updated on our progress and milestones.
This work targets to generate lifelike talking videos for arbitrary identity, guided by a speech clip. Emotional and pose conditions are carefully devised to control the talking status. Specifically, a motion-decoupled module and emotion-control module are designed to enhance the performance.
给作者的问题
-
How is the training stability of applying head pose and facial dynamics transfer loss in equation 4? Any strategy to evaluate its effectiveness?
-
Is there any possibility to evaluate the accuracy of pose conditional control? For instance, facial reconstruction algorithms for the head pose angles.
-
For the motion-decoupled module, authors introduce adaptive normalization. In the inference stage, how does denormalization operate?
-
About the collected dataset, it seems the overall performance becomes worse for most approaches. What is the difference between the collected dataset and the HDTF dataset? How many videos are involved in the collected datasets? Will this dataset be released?
-
The presented videos show high image quality, but the poses do not showcase too much variability.
-
It works surprisingly well with only two DiT blocks inserted before the MLP head for emotional control, any explanations?
论据与证据
Authors claim superior talking head performance and compare it with state-of-art approaches. However, for pose control and emotion control, the comparison and experiment validation are missing.
方法与评估标准
The overall approach is composed with two components. One is the enhanced disentanglement in latent space while the other is audio-emotion conditioned diffusion transformer. Both are adaptively modified to accomplish current function. The talking video evaluation criteria follows previous approaches and is plausible.
理论论述
There are no theoretical claims involved in this paper.
实验设计与分析
- Authors design emotional control strategies but it seems to lack comparison with emotion-condition based state-of-art approaches.
补充材料
Authors include an appendix and an anonymous website, which provide valuable information.
与现有文献的关系
This approach introduces a diffusion transformer to the talking head generation field. It is interesting for the community to know such a design can enhance the talking video generation performance.
遗漏的重要参考文献
N.A
其他优缺点
-
The attached website does not include head pose control videos. It is hard to evaluate the pose-control performance.
-
Line 215 includes some spelling mistakes. “to enhance”.
其他意见或建议
N.A
First, we would like to thank the reviewer for your careful reading and providing numerous constructive comments! Below we address the concerns mentioned in the review.
W1:The attached website does not include head pose control videos.
Thank you for pointing this out. We have uploaded multiple pose control videos and emotion control videos on our anonymous project website. For the pose control videos(video1 and video2), the lip sync is synchronized with the audio input, whereas the pose control is achieved through various driving mechanisms(e.g., driving video, preset mode, fixed pose). Regarding the emotion control videos, we have compared our method with several approaches(video3 and video4). Additionally, we conducted quantitative comparisons on emotion control accuracy and video quality, please refer to our response to Reviewer SMci's Q2.
W2:Line 215 includes some spelling mistakes. “to enhance”.
Thank you very much for the reminder. We will correct these issues in the revised version and also carefully revise the full text.
Q1:The training stability of the transfer loss.
Regarding the training stability, since we use from LivePortrait as pre-trained model, which inherently possesses face attribute disentanglement capabilities, fine-tuning with transfer loss ensures both training stability and rapid convergence. We have uploaded sample image(image1) on website to demonstrate disentanglement between head pose and facial dynamics.
Q2:Evaluate the accuracy of pose conditional control.
In the field of image animation, Average Keypoint Distance (AKD) and Average Pose Distance (APD) are commonly used to evaluate pose control performance. We calculated the APD metric of Playmate on two datasets (HDTF and our dataset), as shown in the tables below.
| Dataset | APD-jaw | APD-pitch | APD-roll |
|---|---|---|---|
| HDTF | |||
| Our Dataset |
Additionally, we have uploaded sample pose visualization videos on our anonymous project website(video5 and video6), demonstrating that Playmate achieves good generation quality even in pure pose-driven scenarios.
Q3:In the inference stage, how does denormalization operate?
In the inference stage, for expression, the mean and standard deviation are the same used during training, calculated from all training data. For head pose, the mean and standard deviation are optional and can be computed from user-provided driving videos. If not provided, they can either be set to default parameters(e.g., computed from forward-looking videos) or calculated from randomly selected videos in the dataset.
Q4:About the collected dataset.
Our own dataset involved about 39k clips, featuring more complex scenes, diverse styles, higher clarity, and greater challenges, which leads to worse overall performance for most approaches. We'll try our best to release the collected dataset, but we hope you understand that both we and the community need to exercise caution when releasing rich datasets. This caution stems from concerns regarding potential risks, particularly those related to individual privacy and likeness rights.
Q5:The poses do not showcase too much variability.
Thank you for your interest in this section. In fact, we support multiple pose driving methods. As shown in our newly uploaded sample videos, under the same audio, different pose control conditions result in different poses generated by Playmate. We support various methods to enhance pose variability, such as using videos with significant pose variations as driving videos, or manually adjusting the rotation value of the pose.
Q6:It works surprisingly well with only two DiT blocks inserted before the MLP head for emotional control, any explanations?
Thank you for your affirmation of our emotion control module. We believe its effectiveness is due to two main reasons: (1) Precise latent space construction and attribute disentanglement, enabling the model to effectively correlate emotion control with emotion features after decoupling attributes like pose and expression, leading to effective learning; (2) A two-stage training approach, where we first stabilize audio-driven training, then train the emotion control module separately while keeping most weights fixed, reducing training complexity. Training all parameters simultaneously might prioritize other labels, rendering the emotion control signals ineffective.
This paper presents Playmate, a novel DiT-based framework for portrait animation with fine-grained emotion and motion control. The work makes contributions in: 1) Motion-Decoupled Module – Leveraging perceptual loss and adaptive normalization to disentangle motion dynamics effectively. 2) Emotion-Control Module – Integrating DiT blocks for precise emotional expression manipulation. The rebuttal addressed some concerns, with one reviewer score raising from 2 to 3. After reading the whole review process and careful discussion, the AC can recommend accept.
Authors are suggested to incorporate all new materials from the rebuttal into the paper. Especially the performance difference with Sonic is noticable both quantitatively (also in the added user study in rebuttal) and visually. The gap is noticed by both the reviewers and the AC. The author's explanation of the controllability features in this method make AC still lean towards Accept.