VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time
摘要
评审与讨论
The paper present VASA, a new method for talking head video generation. The method is build as a diffusion model using Transformer. To increase the performance of the model and allow more control on the generated video the authors decided to use the representation from [1] and learn to disentangle its components. With this the model can control the gaze, expression and camera position. The method achieve impressive qualitative and outperforms the methods they compare with.
[1] Nikita Drobyshev, Jenya Chelishev, Taras Khakhulin, Aleksei Ivakhnenko, Victor Lempitsky, and Egor Zakharov. Megaportraits: One-shot megapixel neural head avatars. In Proceedings of the 30th ACM International Conference on Multimedia, pages 2663–2671, 2022
优点
The qualitative results are very impressive and probably better than the state-of-the-art. The idea of disentangling the existing representation of [1] is nice. The model is very fast at inference on costumer grade GPU. The paper is well written and easy to understand.
[1] Nikita Drobyshev, Jenya Chelishev, Taras Khakhulin, Aleksei Ivakhnenko, Victor Lempitsky, and Egor Zakharov. Megaportraits: One-shot megapixel neural head avatars. In Proceedings of the 30th ACM International Conference on Multimedia, pages 2663–2671, 2022
缺点
My main issue with the paper is that despite the work being presented as reproducible in the checklist I feel that many too many details are missing to actually reproduce the method or the experiments :
- The new proposed CAPP score lack details for reproducibility e.g training procedure. Sharing this model would also be of interest to the community (and do not raise ethical concern which could justify keeping the code private). As of now it is difficult to know if the metric is actually sound.
- It is unclear if the entire voxceleb dataset is used and how many clips remain after preprocessing ?
- Lack of details of the new OneMin-32 dataset : size, type of videos, resolution, origins of the videos...
- The paper says that the model is trained on "4 NVIDIA RTX A6000 GPUs" and that the model "train on massive talking face videos from a large number of identities". How long does the training take and how much data is actually used. Because if all of VoxCeleb + the new unreleased dataset are used the training could be very long. More details are required here.
- It is not clear how the condition are used ion the network. Are they simply concatenated to the motion latent or used in cross Attention inside the Transformer
- It is not entirely clear if the architecture of [2] used out of the box to obtain the facial latent or if it was modified for the disentanglement.
Assuming that the dataset used is very large (>10e6 samples) is the comparison against the other methods that use 50k-100k samples for training fair. An ablation with a training on a dataset of that scale would have been interesting. Without it the impressive qualitative results of the method could simply be due to the huge amount of data.
The novelty is limited the paper mostly reuse existing modules and innovation is mainly in the disentanglement.
The comparison against sota is limited, the most recent method, SadTalker, is from 2022.
The FVD score on VoxCeleb should have been shown anyway, other method from the literature present it.
The method apparently used [1] for gaze direction estimation. However [72] appears to be a method to classify gaze between different modes (fixed, quick motion...). Was the method modified to obtain gaze direction g that is used in the paper ?
The ablation only present results on the gaze and audio condition. It would have been interesting to also see the effect of the expression condition.
[1] Raimondas Zemblys, Diederick C Niehorster, and Kenneth Holmqvist. gazenet: End-to-end eye-movement event detection with deep neural networks. Behavior research methods, 51:840–864, 2019.
[2] Nikita Drobyshev, Jenya Chelishev, Taras Khakhulin, Aleksei Ivakhnenko, Victor Lempitsky, and Egor Zakharov. Megaportraits: One-shot megapixel neural head avatars. In Proceedings of the 30th ACM International Conference on Multimedia, pages 2663–2671, 2022.
问题
How do the authors deal with the resolution difference between methods when computing FVD ?
With CAPP score are the head motions related to speech semantics measured ? (e.g head shake when saying no.) to finish
局限性
The authors address the limitation in the appendix.
Response to W1 (CAPP model sharing): We will soon release the CAPP model, which we believe fills the missing piece of audio-pose alignment metric in talking face generation research and will be valuable to the community.
Response to W2 (Voxceleb training data): We used the entire training set of Voxceleb2 for our model training. After filtering out invalid or low-quality videos, we ended up with approximately 170K training clips.
Response to W3 (OneMin-32 data details): It contains 32 one-minute video clips of 17 subjects (Line 229-232). They are mostly educational lectures and coaching sessions sourced from online video sharing platforms. The resolution is 512x512.
Response to W4 (training time): Our face latent model (encoder and decoder) takes around 7 days' training on a 4xA6000 workstation, and the diffusion transformer takes ~3 days. The total data used for training is ~500k clips (about 2-10 seconds each).
Response to W5 (condition usage): Yes, the condition signals are simply concatenated with noise along the temporal dimension as the input to the transformer. We'll clarify this in the paper.
Response to W6 (change to [2]): Our latent model architecture is same to [2]. We did not change the architecture but modified the training loss functions (Line 149-160) which are critical to achieve disentanglement (Figure A.8 and Line 344-350).
Response to W7 (data scale ablation): As mentioned above, our data size is ~500K (not ">10e6"). In the attached one-page PDF, we add a data size ablation study and comparison with other methods, as per the reviewer's suggestion. We trained a model using only 10% of the data (i.e., 50k clips). As shown in Table I, this model achieve comparable audio-lip and audio-pose synchronization to the full-data model, though the FVD and p metrics are not as good. This shows that our method performs well even with much less data, and more data enhances the motion diversity. Moreover, it still significantly outperforms other methods in all metrics assessing synchronization, motion intensity, and video quality.
Response to W8 (novelty): First, our motivation in the first place is, to model the human conversational behavior (facial dynamics and head movement ) holistically using diffusion model in a latent space that is agnostic to ID and appearance. This is our core innovation and, to the best of our knowledge, no previous methods have done this (it differs from the trend of further factor disentanglement and direct image generation; see our discussion in Line 47-55, 103-111). Second, in pursuit of the aforementioned goal, we did find the 3D-aided representation to be promising especially in terms of expressiveness and hence chose to leverage them. However, they can NOT meet our requirement of effective disentanglement. We made some insightful and provably critical modifications (Line 149-160, 344-350, and Figure A.8), without which we can never reach the current generation quality, esp. the liveliness with nuanced emotions. We perhaps have underemphasized the importance and contribution of such modifications, and will revise our presentation in the revision.
Apart from the two main contributions, our paper also offers other ones such as the design of face-factor-conditioned diffusion training, the CAPP model for filling the missing piece of pose-audio alignment metric, etc., which are also novel and valuable to the community.
Response to W9 (limited comparison to sota?): To the best of our knowledge, there are no other published methods which can generate both audio-driven head poses and facial movements with single images. We mentioned some concurrent unpublished works in our paper, and have added the visual comparison with a concurrent work EMO in the one-page PDF. We'd appreciate it if the reviewer can point out some specific papers that we should compare with.
Regarding the added comparison with EMO, we provide our results on some samples from EMO's official website (we are unable to provide video links per the rebuttal policy). As shown in Figure I, our method works consistently well and delivers vivid talking head videos. It is obvious that EMO has smaller head motion compared to us. Also, EMO seems less robust than ours in some cases, with artifacts – such as abrupt expression change, inaccurate lip sync, and subtle texture detail flickering – occasionally appear upon close inspection (note that their reported average lip-sync score is significantly lower than ours). On the other hand, however, EMO's video quality is slightly higher than ours in terms of sharpness, owing to their use of the large and powerful image generation foundation model.
Response to W10 (FVD on Voxceleb2) : As shown in Table I of the attached PDF, we provide the FVD scores of different methods on Voxceleb2. However, it should be noted that the video quality of Voxceleb2 varies widely and is often low (see Figure II of the PDF). Hence the FVD score may not accurately reflect the true generation quality, as mentioned in our paper.
Response to W11 (gaze estimation) : Thanks for your careful reading. We found that we inadvertently cited the wrong paper: we actually used L2CS-Net [a] to extract gaze direction. Will fix this error in our revised paper.
[a] Ahmed A.Abdelrahman, Thorsten Hempel, Aly Khalifa and Ayoub Al-Hamadi, L2CS-Net: Fine-Grained Gaze Estimation in Unconstrained Environments, 2022
Response to Q1: The results of different methods are resized to the same resolution (224x224) for FVD evaluation.
Response to Q2: This is an interesting question. For now we did not conduct in-depth analysis on whether the CAPP score captures the semantic relationship between speech and pose. We'll further explore this in our future work and thank you for the suggestion.
We hope we have addressed your questions. If not, it would be great to let us further know your concerns during discussion.
Thank you for your acknowledgment of our response and the additional comments.
Yes, our method can be easily adapted to generate facial dynamics only. Another easier way to achieve this is to directly replace the generated head poses with predefined ones before face image decoding. However, we shall point out that if the given head poses do not match the emotion or rhythm of the audio, the realism of the generated talking face video could degrade significantly (e.g., a calm head movement with intense speech or a rhythmic nodding with smooth speech would look weird). Generating realistic poses is one of the key contributing factors to achieve our high-quality results. That being said, we will try to add comparisons and more discussions about this type of methods in our revised paper and thank you again for the suggestion.
After reading the extensive rebuttal I see that the authors responded to most of my concerns. I see no reason to reject this paper and change my rating to accept. Some of the explanation of the rebuttal should be included in the final version. I would have been interesting to see comparison with more recent method even if they don't generate head pose. If the head pose is controllable shouldn't it be possible to freeze it to match that of the other methods ?
The paper presents a method for generating highly realistic talking head avatars that combines the diversity of facial expressions with the real-time generation speed. It provides a practical and commercially valuable approach to the field of talking head generation.
优点
- The overall structure of the paper is very clear and coherent, with a well-defined problem statement.
- By decoupling the information in dynamic faces, better control over the expressiveness of the generated faces can be achieved, meeting the needs of the users.
- The visual presentation of the video is excellent and leaves a lasting impression.
缺点
- The paper does not explain why the proposed method can achieve real-time generation; the use of a diffusion transformer structure might actually lead to a decrease in speed.
- There are some unclear configurations in the implementation section of the method, such as the scale of the video used for training.
问题
- Will this project be open-sourced? As it could actively promote progress in the field of talking head generation. If it is not open-sourced, it is suggested to provide more implementation details.
- Why is there no comparison with the recent EMO[1] method, for which there are already corresponding implementations in the open-source community?
[1] Tian L, Wang Q, Zhang B, et al. Emo: Emote portrait alive-generating expressive portrait videos with audio2video diffusion model under weak conditions[J]. arXiv preprint arXiv:2402.17485, 2024.
局限性
-
Although VASA-1 has made significant progress in generating realistic facial dynamics and head movements, the paper mentions that it currently only handles the portrait area up to the torso and does not extend to the full body. The coordination of full-body movements is necessary to achieve more natural and realistic virtual characters.
-
The paper mentions that, despite using a 3D latent representation, a more explicit 3D facial model is not employed, which may lead to some artifacts caused by neural rendering, such as texture adhesion issues.
Response to W1: We achieve real-time efficiency because of our framework design, i.e. (diffusion-based) motion generation in latent space + (CNN-based) decoding in image space. Our diffusion transformer works in the latent space and is small (only 29M parameters), so it runs very fast. The CNN image decoder is also small (55M parameters) and runs efficiently. (Note that we only need to run the face encoder once for generating a video and thus its time can be neglected or counted into the starting latency.)
We simply evaluated the running efficiency with our whole method naively deployed in PyTorch without any special speed-up strategy. We believe there's still room for improvement with sophisticated implementation optimization.
Response to W2: Regarding the scale of video data, we trained our model on approximately 500k clips (2-10 seconds each). In the attached one-page PDF, we also provide an additional ablation study for training data scale. As shown in Table I, the model trained with 10% of the data achieves comparable audio-lip and audio-pose synchronization to the full-data model, though the FVD and p metrics are not as good. This shows that our method performs well even with much less data, and more data enhances the motion diversity.
Response to Q1: We'll try our best to release the source code of our project in the future. However, we hope you understand that due to significant concerns regarding the potential risks, particularly those related to deepfake and fraud, we (as well as the community) do need to be very cautious with releasing a powerful model. In fact, due to RAI considerations, our team faced great difficulties in getting approved by our organization for open source, unlike any other projects we did before. While we are seeking for the possibility of open source, we'll also add more implementation details such as those suggested by the reviewers into the revised paper. Also note that we will soon release the CAPP model, which we believe fills the missing piece of audio-pose alignment metric in talking face generation research and will be valuable to the community.
Response to Q2: EMO is a concurrent work (published Feb 27th on arXiv) at the time of our submission (May 22th) and there's no public implementation, so we did not compare with it. However, we did mention it with some discussions in our paper (Line 41-44, 115-118). EMO uses an image diffusion model based on StableDiffusion to generate talking face videos, which is a significantly different technique. It can generate high-quality videos but suffers from heavy computation and slow generation speed compared to ours.
In the attached one-page PDF, we provide our results including animations on some samples from EMO's official website (we are unable to provide video links per the rebuttal policy). As shown in Figure I, our method works consistently well on EMO's demonstrated cases and delivers vivid talking head videos. It is obvious that EMO has smaller head motion compared to ours, perhaps due to the constraint of the face region mask used by it. Also, EMO seems less robust than ours in some cases, with artifacts – such as abrupt expression change, inaccurate lip sync, and subtle texture detail flickering – occasionally appears upon close inspection (note that their reported average lip-sync score is significantly lower than ours). On the other hand, however, EMO's video quality is slightly higher than ours in terms of sharpness, owing to their use of the large and powerful image generation foundation model.
Response to Limitations: Thank you for the comments. We plan to handle upper-body/full-body and explore more explicit 3D representations in our future work (both projects are on-going).
We hope we have addressed your questions. If not, it would be great to let us further know your concerns during discussion.
After reading the author's rebuttal, most of my doubts are eliminated. I would like to ask how scalable is the VASA1 method and whether it can be applied to full body generation? Compared to face generation, generating a natural full body is more complex and difficult.
Thank you for the further comments; we are glad to see our response eliminated your doubts. Regarding scaling VASA-1 to body generation, the problem is indeed more complex and difficult. But we believe the idea of VASA, i.,e., generating conversational human behavior holistically in a compact, ID-agnostic latent space and then generating the images, applies to body as well. We will work on this in our future work and keep the community updated on progresses and milestones.
This paper introduces a two-stage talking head method that can generate impressive talking faces. It includes 1) A diffusion-based model to generate implicit facial dynamics and head movements from audio and additional conditions. 2) A modified 3D-aided face reenactment decoder for generating faces from latent space. This method delivers high video quality with fast inference speed
优点
- Although many works utilize diffusion models to map audio to intermediate facial features, VASA-V1 demonstrates excellent engineering and generation capabilities, achieving appealing results.
- This method surpasses real-time speed at a resolution of 512x512, with fast startup cost and ID switching speed, leaving an impressive effect.
- The method outperforms existing comparative methods in terms of visual effects and numerical results for video realism and audio-visual synchronization.
缺点
- The 3D-aided face reenactment framework stage should be crucial for the overall method. However, some descriptions are too brief and vague, making them hard to follow.
- The paper's explanation of the fusion method for condition signals in the Diffusion Transformer is confusing and needs more specific details.
- The comparison methods in the paper lack implementation details. Considering the different scales of training data for various methods, are the comparison results in the table fair?
问题
- How does the number of layers in the 8-layer transformer encoder affect the results in the paper?
- Will the proposed CAPP in the paper be open-sourced?
- Does the 3D-aided face reenactment part of the method use distillation to speed up? How can the inference speed of MegaPortraits be accelerated?
- What are the parameter counts for each stage of the model?
局限性
The supplementary materials of the paper include relevant discussions.
Response to W1: Thank you for the comment. We will add more details of the 3D-aided face latent model into our main paper and appendix. Our network architecture follows MegaPortraits [18] where details can be found. To achieve disentanglement, we modified the loss functions and incorporated the cross-transfer consistency loss and ID-disentanglement loss . (Line 149-160). These loss modifications are critical (Line 344-350 and Figure A.8), without which we can never reach the current generation quality, esp. the liveliness with nuanced emotions. Again, we'll add more details and further improve the clarity as per your suggestion.
Response to W2: Different conditional signals are directly concatenated with noise along the temporal dimension as the input to the transformer.
Response to W3: First, please note that different methods in the literature may have used different data for training. We are unable to compare with these methods using exactly the same data and have simply followed the practice of running the trained models for comparison.
In terms of training data scale, our model is trained on ~500k clips (2-10 seconds each). To validate the data scale influence and compare it with previous methods at similar scales, we additionally trained a model using only 10% of the data (i.e., 50k clips). As shown in Table I in the attached PDF, the model trained with 10% of the data achieves comparable audio-lip and audio-pose synchronization to the full-dataset model, though the FVD and p metrics are not as good. This shows that our method performs well even with much less data, and more data enhances the motion diversity.
Regarding the compared methods, Audio2Head used >70k data clips for training but clearly underperformed compared to our model with 50k clips. MakeItTalk and SadTalker used very small subsets of Voxceleb for training, but there's no clear evidence that increasing their data would improve their performance significantly or even bring any positive consequence - we explain the reasons as follows. MakeItTalk uses an LSTM to map audio features to landmark offsets deterministically, which may struggle with modeling complex data distributions and one-to-many mappings as training data increases. SadTalker assigns a style code for each identity to generate head poses, but more data will introduce more diverse head motion patterns for the same identity, which a shallow VAE with a condition code might not be able to effectively model. Our model with 10% data still significantly outperforms these methods in all metrics assessing synchronization, motion intensity, and video quality.
Response to Q1: We set the transformer layer number to 8 as we found it produces good results while enabling the whole algorithm to run in real time on a consumer-grade GPU. We didn't explore more layers or larger model size because real-time efficiency is a key factor we want to achieve. We presume that a larger model size will further improve the performance because our current model is still small, and we'll further explore this in our future work.
Response to Q2: Yes, we will soon release the CAPP model, which we believe fills the missing piece of audio-pose alignment metric in talking face generation research and will be valuable to the community.
Response to Q3: No, we did not use distillation or any other strategies to speed-up the 3D-aided face encoder and decoder. These models are naturally small and run very fast. Note we only need to run the encoder once so essentially only the decoder needs to run for generating each video frame.
Response to Q4: The parameter counts of our 3D-aided face latent model and diffusion transformer model are about 200M and 29M respectively.
We hope we have addressed your questions. If not, it would be great to let us further know your concerns during discussion.
This paper aims to effectively and efficiently generate high-fidelity audio-driven talking head videos. To improve performance and efficiency, the authors have designed a Diffusion Transformer model within the latent space of motion signals, encompassing facial dynamics and head movements. Additionally, they propose a data-driven metric named Contrastive Audio and Pose Pretraining.
优点
- The paper applies the diffusion model to the task of generating audio-driven talking head videos, innovatively defining the diffusion model within the latent features of motion rather than those of the image, which is quite interesting.
- The paper is well-written and easy to follow, with detailed experiments that convincingly demonstrate the effectiveness of the proposed method.
缺点
The primary concern is the paper's contribution, as the realism and liveliness of the generated videos could be attributed to the performance of MegaPortraits. MegaPortraits' encoders effectively learn latent motion and appearance representations, supported by robust 3D warping generators and an image generator that ensures high-quality outputs. VASA-1, in a way, learns to generate latent motion representations akin to those in MegaPortraits through audio inputs. Despite this dependency, the method performs well overall. Therefore, my overall assessment leans towards accepting it, albeit with some reservations.
问题
- At inference time, how does the model generate condition signals like the main eye gaze direction and head-to-camera distance, given that the driving signal is only audio?
- This does not decrease the novelty of this work. However, a quantitative comparison between VASA-1 and EMO would be quite interesting. Given that EMO's code is unavailable, leveraging the image and audio from their officially provided videos for comparison is encouraged.
局限性
Yes.
Response to W1 (contributions): Thank you for the comment. In response to your question on our technical contributions and relationship with MegaPortraits, we'd like to emphasize two aspects. First, our motivation in the first place is to model the human face conversational behavior (facial dynamics and head movement) holistically using a diffusion model, in a latent space that is agnostic to ID and appearance. This is our core innovation and, to the best of our knowledge, no previous methods have done this (it differs from the trends of further factor disentanglement and direct image generation; see our discussion in Lines 47-55, 103-111). Second, in pursuit of the aforementioned goal and building a disentangled latent space, we did find the 3D-aided representation to be promising especially in terms of expressiveness and hence chose to leverage them. However, they can NOT meet our requirement of effective disentanglement. We made some insightful and provably critical modifications (Line 149-160, 344-350, and Figure A.8), without which we can never reach the current generation quality, esp. the liveliness with nuanced emotions. We perhaps have underemphasized the importance and contribution of such modifications, and will revise our presentation in the revision.
Apart from the two main technical contributions, our paper also offers other ones such as the design of face-factor-conditioned diffusion training, the CAPP model for filling the missing piece of pose-audio alignment evaluation, etc., which are also novel and valuable to the community.
Response to Q1: The extra condition signals such as the main eye gaze direction and head-to-camera distance are optional and they are provided by users. If not given, we can either set them to some default parameters (e.g., a forward-looking eye gaze and the average head-to-camera distance of the training data; see Line 220-222), or just leave them blank for unconditional generation.
Response to Q2: Thank you for the suggestion. We have run our method on the images and audios from EMO's official website as per your suggestion. Some visual results including animations can be found in the one-page PDF provided on this page (we are unable to provide video links per the rebuttal policy). EMO is a concurrent work which we mentioned and discussed in the related work section. It uses an image diffusion model based on StableDiffusion to generate talking face videos, which is a significantly different technique. EMO can generate high-quality videos but suffers from heavy computation and slow generation speed compared to ours.
As shown in Figure I of the attached PDF, our method works consistently well on EMO's demonstrated cases and delivers vivid talking head videos. It is obvious that EMO has smaller head motion compared to ours, perhaps due to the constraint of the face region mask used by it. Also, EMO seems less robust than ours in some cases, with artifacts – such as abrupt expression change, inaccurate lip sync, and subtle texture detail flickering – occasionally appear upon close inspection (note that their reported average lip-sync score is significantly lower than ours). On the other hand, however, EMO's video quality is slightly higher than ours in terms of sharpness, owing to their use of the large and powerful image generation foundation model.
We hope we have addressed your questions. If not, it would be great to let us further know your concerns during discussion.
Thank you to the authors for their feedback and efforts. After reviewing the rebuttal, I note that the authors have addressed some of my concerns, which leads me to maintain my initial rating. However, I recommend that the final version of the paper include more detailed explanations, particularly regarding the contributions and the reasons behind the statement, "We did find the 3D-aided representation to be promising." These details will enhance the clarity and impact of the work.
Thank you for your further comment and suggestion. We will incorporate more details including those suggested by the reviewers.
We sincerely thank all reviewers for the valuable comments and suggestions. We are encouraged by the reviewer's acknowledgment that our paper: "innovatively defining the diffusion model within... which is quite interesting"; "convincingly demonstrate the effectiveness.." (Reviewer WJD9); "demonstrates excellent engineering and generation capabilities", "leaving an impressive effect" (Reviewer wMXj); "visual presentation is excellent and leaves a lasting impression" "has made significant progress ..." (Reviewer 1diB); "qualitative results are very impressive" (Reviewer xKzZ).
We'd like to reiterate our novelty and contributions here:
- We propose diffusion-based holistic human face conversational behavior modelling (facial dynamics and head movement), in a latent space that is agnostic to ID and appearance. This is our core innovation and, to the best of our knowledge, no previous methods have done this. It is a new approach which differs from the recent trends of further facial factor disentanglement and direct image generation (see our discussion in Lines 47-55, 103-111, 115-118).
- We build a highly disentangled latent space to achieve the aforementioned goal. Although we leveraged existing 3D-aided representation and models due to their high expressiveness, they can NOT meet our requirement of effective disentanglement. We made some insightful and provably critical modifications (Line 149-160, 344-350, and Figure A.8), without which we can never reach the current generation quality, esp. the liveliness with nuanced emotions.
- We offer a few other supporting contributions including a controllable diffusion framework that enables flexible control of different face properties, and a new data-driven metric CAPP score for evaluating the alignment between audio and head pose.
- We advance audio-driven talking face generation to a new level of realism and liveliness not achieved before. Our work marks the dawn of real-time lifelike avatars which have the potential to reshape human-human and human-AI interactions across broad application domains.
We address each reviewer's questions and concerns under their respective reviews.
The attached one-page PDF contains the following figure and table contents:
- Visual comparison with EMO on EMO's official videos (Figure I)
- Sampled images from VoxCeleb2 to demonstrate the varied video quality and explain why we did not evaluate the FVD on it (Figure II)
- Training data scale ablation of our method and the requested FVD score on VoxCeleb2 (Table I)
Claimed contributions
- Diffusion-based learning of the latent space for the human face and head movements, with the backbone of MegaPortraits by specially-designed losses to disentangle the face and head movement and to disentangle the identity and motions
- Demonstration of the above with audio-driven human face video generation from a single image in real-time
Strengths
- Superb results, both quantitatively and qualitatively
- Novel insights into human face understanding with the proposed disentanglement of the face and head movements
- New metric "CAPP" to evaluate the alignment of audio and head poses
- Real-time performance with a compact latent diffusion model
Weaknesses
- Largely depends on MegaPortraits
Even with its dependence on MegaPortraits, the reviewers and I all agree that the paper is insightful and impactful to human face video generation and human face understanding, thus recommending acceptance as an oral