PaperHub
6.1
/10
Poster4 位审稿人
最低3最高4标准差0.4
3
3
4
3
ICML 2025

MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance

OpenReviewPDF
提交: 2025-01-20更新: 2025-07-24

摘要

关键词
Human Motion Video Generation

评审与讨论

审稿意见
3

This paper presents MimicMotion, a generative approach for animating a human in a single image using a 2D pose sequence. It employs a fixed pre-trained video diffusion model while training select layers, including PoseNet and U-Net modules. The PoseNet processes the 2D pose sequence, and its output is integrated into the U-Net to combine pose information with the human appearance from the image. The paper introduces several technical contributions: 1) confidence-aware pose guidance to account for the reliability of estimated 2D poses, 2) hand region enhancement to emphasize loss around hand regions, and 3) progressive latent fusion for generating long videos.

update after rebuttal

The authors addressed most of my concerns. I raised my score.

给作者的问题

Will the collected training and evaluation videos be public?

论据与证据

Yes

方法与评估标准

Yes

理论论述

Yes

实验设计与分析

Yes

补充材料

Yes

与现有文献的关系

They can be related to various human-related computer vision tasks.

遗漏的重要参考文献

No

其他优缺点

Strengths

The paper addresses practical challenges in diffusion-based human animation methods. The three proposed techniques—confidence-aware pose guidance, hand region enhancement, and progressive latent fusion—effectively enhance the quality of animated human videos.

Weaknesses

  1. Is confidence-aware pose guidance the optimal approach? While incorporating confidence information can mitigate issues like depth ambiguity and incorrect poses, alternative approaches might be more effective. For instance, integrating depth maps or part segmentations as additional inputs to PoseNet could address both depth ambiguity and incorrect pose estimation more comprehensively. With the great success of vision foundation models, we can relatively easily obtain depth maps and part segmentations from a single image using Depth Anything v2 and Sapiens. Since 2D poses lack depth information, relying solely on confidence-aware guidance may not be sufficient.

  2. Confusing impact of hand enhancement on overall metrics. In Table 2, the hand region enhancement improves all evaluation metrics, including those unrelated to hands. Since this method only assigns higher loss weights to hand regions without affecting other body parts, it is unclear why overall metrics such as PSNR see significant improvement. While Figure 8 demonstrates enhanced focus on hand regions, the broader impact on all human parts needs further clarification.

  3. Progressive latent fusion results in lower quantitative performance. Table 2 indicates that this technique performs worse in three out of four metrics. However, the paper lacks a detailed analysis of why this occurs.

  4. Post-processing methods may obscure actual performance. The supplementary material states that face swapping and frame interpolation were applied to the generated videos, which can significantly influence the authenticity and temporal smoothness of the results. The paper should explicitly state which face-swapping and frame-interpolation techniques were used. Furthermore, relying on post-processing might hide weaknesses of MimicMotion, such as difficulties in maintaining identity consistency—an inherent limitation of diffusion-based human animation. The authors should provide extensive qualitative results without any post-processing to ensure a fair evaluation. Additionally, the main manuscript's figures may also be affected by these techniques. For example, Figure 4 shows clothing color changes (white turning into blue), highlighting potential challenges in preserving identity.

其他意见或建议

Please see weaknesses.

作者回复

1. Is confidence-aware pose guidance the optimal approach?

We evaluated depth information as an additional input and found it does not significantly improve the resolution of front-back ambiguity or incorrect pose issues beyond what our confidence-aware approach already achieves. Depth estimation introduces its own prediction errors and computational overhead without commensurate benefits. While alternative guiding modalities have been explored in prior work (e.g., Champ), each presents inherent limitations—depth varies with camera distance, and part segmentations may compromise action template transferability across different contexts. The reviewer's suggestion, though theoretically plausible, lacks empirical substantiation. We maintain that our approach represents an effective balance of performance and efficiency within the scope of our research objectives.

2. Why hand region enhancement improves all evaluation metrics?

We respectfully note that our approach does not solely assign higher loss weights to hand regions. Rather, it strategically enhances weights for clear hand regions, encouraging the model to generate more defined hands. Given the iterative nature of the diffusion process, we observe that improved hand clarity appears to serve as a quality signal that positively influences overall video generation quality across successive denoising steps.

3. Progressive latent fusion results in lower quantitative performance.

To address this, we have supplemented the website with comparison videos corresponding to Fig. 9's progressive latent fusion to better illustrate the qualitative differences.

The discrepancy in quantitative metrics stems from inherent limitations in evaluation metrics: SSIM and PSNR are inadequate for evaluating our progressive latent fusion approach, as they assess image-level quality rather than transition smoothness between denoising windows. Similarly, Fréchet distance-based metrics rely on video classification models pre-trained on datasets like Kinetics, which primarily leverage spatial rather than temporal features for classification. These datasets can often be classified accurately from single frames, indicating their limited sensitivity to temporal consistency issues.

4. Post-processing

Thank you for raising this concern. We have addressed this by replacing all videos on our website with versions without post-processing.

We emphasize that the referred Figure 4 and all other comparison and ablation results in the paper and website are presented without post-processing. The identity preservation challenges visible in our results and baselines represent the actual performance of these methods, not artifacts of face swapping or other post-processing techniques. The clothing color shift observed in Figure 4 occurs to varying degrees across multiple methods, including ours, confirming the fairness and transparency of our evaluation.

审稿意见
3

This paper proposed an SVD-based human video animation framework. The major contributions of this work are 1) a confidence-aware strategy to reduce the negative impact of inaccurate pose skeleton detection 2) a local hand region enhancement strategy to alleviate hand motion blur 3) a long video generation strategy with weighted frames fusion.

给作者的问题

Please see the "Experimental Designs Or Analyses" section.

论据与证据

The contributions summarized in the introduction sections are clear and supported by experimental evaluation of baseline comparison and ablation analysis of each module.

方法与评估标准

Different from prior works with ReferenceNet architecture and AnimateDiff motion modules, MimicMotion adopted SVD as the diffusion backbone model, which makes sense for human video animation, especially for the motivation of ensuring the temporal smoothness of generated videos. The benchmark dataset, the TikTok dataset, is broadly used in previous works in this area.

理论论述

The major theoretical claim in this paper, progressive latent fusion for long video generation, is supported by clear visualization in Figure 3 and detailed explanation in Algorithm 1.

实验设计与分析

I appreciate the authors' efforts in providing baseline comparisons and visualizations in both the main paper and the anonymous website. However, I have several major concerns regarding the experimental design:

  1. Out-of-Domain Generalization: Section 4.1 states that the model is trained exclusively on human dancing videos, with the UNet backbone fine-tuned. For the out-of-domain visualizations featuring cartoon and animal identities, was the MimicMotion pipeline also fine-tuned on such data? If so, this should be explicitly mentioned in the Cross-Domain Results section. If not, is there any explanation for the observed generalizability?

  2. Missing Baseline Comparisons: The authors highlight contributions such as local hand pose enhancement and pose guidance using a confidence-aware strategy. However, a more thorough discussion and evaluation against prior works with similar motivations are needed. Comparisons with methods like DisPose (Sparse and Dense Motion Field), Champ (SMPL with depth/normal/semantic maps), and VividPose (SMPL-X-based pose) would strengthen the claims.

  3. Supplementary Website Demos: The website notes that videos are post-processed with face swapping and frame interpolation. However, these post-processing steps are not mentioned in the main paper. This should either be clearly discussed, or the authors should provide original videos generated by MimicMotion to demonstrate the out-of-domain real-human animations.

  4. Ablation Analysis of SVD and SD Backbone: MimicMotion employs SVD instead of the widely used ReferenceNet + AnimateDiff architecture in the baselines. This raises uncertainty about the effectiveness of the proposed progressive latent fusion. Is the performance gain attributable to the proposed strategy, or is it primarily due to the choice of video backbone? I appreciate the visualization in Figure 9, but this should be visualized as a video comparison between 1) RefenceNet+proposed strategy, 2) SVD, and 3) SVD+proposed strategy on the supplementary website to observe the difference better.

补充材料

I've checked both supplementary material in the PDF and the anonymous website.

与现有文献的关系

The key contribution of this paper is the SVD-base long video generation pipeline with smooth motion consistency. This work explores the potential usage of a pre-trained video diffusion backbone for human video animation, which is a slightly novel architectural design compared to previous works.

遗漏的重要参考文献

Please see the "Experimental Designs Or Analyses" section.

其他优缺点

Please see the "Experimental Designs Or Analyses" section.

其他意见或建议

Please see the "Experimental Designs Or Analyses" section. I'm more than willing to raise my score if my concerns are addressed.

作者回复

1. Out-of-Domain Generalization

The model is trained exclusively on human dancing videos. There are two reasons for this generalizability:

First, the SVD backbone can generate cartoon and animal videos. Our fine-tuning enhances the model with pose controllability, and since the pose feature space of humans, cartoons, and animals may share a common subspace that is preserved in the fine-tuned model, human control signals can also guide cartoons and animals. The results in AnimateAnyone and MagicPose also verify this point.

Second, we map the limb lengths of the pose template to the reference character following MuseV, making the pose guidance signal closer to the reference character. This helps the model preserve the character's appearance and also helps to adapt one pose to another with different body proportions, enhancing generalization.

We thank the reviewer for raising these concerns and will add these explanations in the revision.

2. Missing Baseline Comparisons

We have enhanced our evaluation by adding comprehensive comparisons with Champ both quantitatively (see table below) and qualitatively (refer to the Comparisons section on our website). Regarding DisPose, we respectfully note that it was released within 4 months of our submission date, which falls outside the comparison requirements according to the conference guidelines. For VividPose, despite our efforts, its unavailability as open-source presents reproduction challenges.

Our method specifically focuses on 2D DWPose-guided video generation, which offers practical advantages by eliminating SMPL and other inference overhead, and better handling out-of-domain characters where SMPL-based approaches face significant limitations.

FID-VID↓FVD↓SSIM↑PSNR↑
Champ17.37360.68615.0
Ours9.35940.79520.1

3. Supplementary Website Demos

Thank you for the suggestion. We have replaced all demos on the website with their raw versions without post-processing. It needs to be clarified that, for the cases of comparison and ablation, no post-processing was applied in the original web page. We used post-processing operations in the original web page only to showcase better visual effects in demos, without affecting the contributions of this paper.

4. Ablation Analysis of SVD and SD Backbone

Thank you for your suggestion. We have added videos corresponding to Fig. 9's progressive latent fusion on the website.

Regarding the SD Backbone, we believe that progressive latent fusion would not provide significant improvement. Videos generated with the SD backbone exhibit high-frequency flickering within a single window (see baseline comparison on the website). Progressive latent fusion only enhances transition smoothness between denoising windows, without improving smoothness within windows. Therefore, the SD backbone would likely still suffer from flickering. We have added a Moore AnimateAnyone (SD backbone) + proposed strategy result on the website, which verifies this point.

Like other research in this field, our paper deliberately focuses on a specific backbone (SVD), which is both well-pretrained and open-source. The issues with the SD Backbone fall beyond the scope of our current research objectives, as it is standard practice to focus on optimizing a single backbone network.

审稿人评论

I appreciate the authors’ effort in providing additional visualizations for the ablation analysis and the quantitative comparison with Champ. However, after reviewing comments from other reviewers, I share the same concerns regarding identity preservation as Reviewer 1Stn. From the updated demo, it is evident that the generated face identities do not align well with those in the reference image. Could the authors provide further explanation on the potential causes of this issue, and discuss any possible solutions from a model architecture perspective (excluding post-processing)?

Additionally, I have a follow-up question regarding out-of-domain generalization. Given that the SVD UNet has been fine-tuned solely on human dancing videos, why does the backbone still retain a shared subspace that is preserved during fine-tuning? Prior works based on reference networks—such as Animate Anyone, MagicAnimate, MagicPose, and Champ—did not fine-tune the SD UNet backbone. As a result, I believe the generation results from those models may not verify the claimed point regarding subspace preservation.

I would be happy to consider raising my score if these concerns are adequately addressed.

Thanks.

作者评论

Thank you for your timely and constructive feedback. Below is a concise response to the raised concerns:

1. Identity preservation

We believe the primary cause of the observed identity preservation issue is likely due to the difficulty of the generated cases in the demonstration:

Face Size and Pose Variability: Note that compared to Animate Anyone, MagicPose, Magic Animate, and Champ, our cases feature more full-body images and large dance movements. In these scenarios, faces occupy very few pixels, and temporal changes are significant, thus, the inherent difficulty of identity preservation is higher. On the other hand, our identity preservation is better for large faces and moderate movements (refer to the last row of Sec. Showcases: Dancing & Talking on website). Additionally, the fair comparisons with baseline methods (website Section) do not show clear evidence of worse identity preservation.

Pose Template Leakage: The driving dwpose sequences used in our demo came from different individuals w.r.t. the reference image. DWPose contains key facial landmarks that may cause the generated face shapes to resemble those of the pose template, deviating from the reference face. We would like to clarify that this issue is not unique to our approach.

Post-Processing Solution and Training Data Bias: Since identity preservation can be addressed through face-swapping, it was not the primary focus of our model tuning. Additionally, our training dataset contains 4,436 clips from a limited number of identities, which may introduce a slight bias towards these identities.

For possible solutions, we have the following thoughts:

  1. Introducing a dedicated face encoder (e.g., InstantID or IP Adapter) to better inject facial features into the denoising model may improve identity preservation.

  2. Applying the proposed regional loss enhancement on facial regions during training may also improve the model’s capability in face generation.

  3. Increasing the resolution or cropping the person out may also improve identity preservation, as it effectively makes the face larger and makes generation easier.

2. Out-of-domain generalization, subspace preservation, and fine-tuning

Clarification on Backbone Freezing: It is important to note that the Animate Anyone does not freeze the SD UNet backbone during training. This can be verified in Section 3.3 of the original paper and in the 3rd-party implementation here. Additionally, Champ also fine-tunes the UNet backbone as described in Section 3.4 of their paper, and other methods tune injected temporal layers in the UNet. Therefore, contrary to the assertion that these methods do not fine-tune the backbone, they indeed adjust the weights of the backbone during the training process.

Impact of Fine-Tuning on Subspace Preservation: Our approach uses a very small learning rate (1e-5), which minimizes the disturbance to the pre-trained weights. To quantify this, we calculated the relative L2 norm of the difference between the pre-trained and fine-tuned model weights: θoursθsvd/θsvd=0.019||\theta_\text{ours} - \theta_\text{svd}|| / ||\theta_\text{svd}|| = 0.019. This suggests that the fine-tuning process has only a minimal effect on the original model, preserving much of the learned subspace. Consequentially, we hypothesize that this is the reason why the model retains the capacity for generating diverse identities and appearances.

Parallel to LoRAs: This behavior aligns with the principles observed in fusing multiple LoRAs. Specifically, with a single LoRA applied, the weights of the UNet are adjusted, but the model still maintains its ability to generate diverse outputs and compatibility with other LoRAs / ControlNets. In our case, we observe similar behavior in terms of out-of-domain generalization.

We would be happy to incorporate these discussions into the manuscript. We hope this clarification adequately addresses your concerns. Should you have any further questions, please do not hesitate to reach out.

审稿意见
4

This research focuses on the task of video generation, aiming to produce videos where the subject replicates a reference pose based on a single input image and specified pose conditions.

The paper introduces three significant advancements over existing baseline methods:

  1. Integrating confidence awareness to reduce the influence of uncertain poses, thus mitigating the detrimental impact of incorrect pose conditions during model training.
  2. Loss amplification for hand regions to enhance training weights in high-confidence situations.
  3. Leveraging progressive latent fusion to ensure smooth transitions across multiple video segments, enabling the generation of longer and more cohesive videos.

Comprehensive evaluations reveal that the proposed approach surpasses baseline methods both qualitatively and quantitatively on established public benchmarks.

update after rebuttal

I have reviewed the rebuttal text provided by the authors. Since there was a clear response to my concerns and my initial score was to recommend acceptance, I will maintain my score unchanged.

给作者的问题

  • Is the image encoder in Figure 2 CLIP?
  • The authors mention in line 206 that "Pose guidance is not added to every U-Net block". What kind of degradation effects would occur if pose control were introduced in multiple layers of the UNet structure? Can comparative examples be provided?

论据与证据

Yes

方法与评估标准

Yes

理论论述

No Theoretical Claims.

实验设计与分析

The paper presents a thorough and meticulous set of experiments, with a particular focus on the ablation studies examining the three proposed improvements. Both the qualitative and quantitative findings strongly validate the authors' claims, showcasing the effectiveness of the proposed approach.

补充材料

Yes,I reviewed all supplementary material.

与现有文献的关系

None

遗漏的重要参考文献

  • Scalable Diffusion Models with Transformers (DiT) [ICCV2023]
    • Although using different architectures, U-Net vs. DiT, I still hope the authors mention or discuss this important foundational model and its derivatives based on DiT for controllable video generation in the Related Work section.

其他优缺点

The article presents a clear motivation and outlines its proposed improvements in a well-defined manner. Each improvement point is effectively articulated, and the methodology is introduced coherently. Overall, the article is well-crafted, with a strong focus throughout, and the experimental results are highly persuasive. No significant weaknesses are apparent.

其他意见或建议

None

作者回复

1. Essential References Not Discussed: DiT

Thank you for your effort in reviewing our paper. We will revise our manuscript and add a discussion about DiT. Compared with U-Net, DiT is considered a more novel and promising foundation model, especially in video generation tasks, where it demonstrates scalable capabilities.

2. Is the image encoder in Figure 2 CLIP?

Yes, it is a typical CLIP image encoder.

3. The authors mention in line 206 that "Pose guidance is not added to every U-Net block". What kind of degradation effects would occur if pose control were introduced in multiple layers of the UNet structure? Can comparative examples be provided?

We apologize for the confusion. Here, we mean pose guidance is only added to the output of the first convolution layer of U-Net, as we found this is sufficient for pose control in experiments.

This design choice is based on two hypotheses: First, introducing sequence pose information extracted without temporal interaction could potentially disrupt the spatio-temporal representations within subsequent U-Net layers. Second, since our image-to-video base model was pre-trained without posture information, excessive integration of pose sequence data may degrade pertained performance.

To ensure clarity, we will update the relevant section accordingly.

审稿人评论

Thank the authors for providing more information in the rebuttal. I still recommend accepting this paper. I also suggest that the authors incorporate the content promised in the rebuttal into the revised version.

审稿意见
3

Based on the confidence-aware strategy, this work propose hand region enhancement to alleviate hand distortion, which improves video generation performance. Additionally, the supplementary materials provide generation results in other scenes (such as animals and other styles), verifying the generalization ability of the algorithm. I believe this work has a good practicability and the whole paper is easy to read, more ablation can be conducted to validate each part of the contribution.

给作者的问题

see weakness

论据与证据

The claim for "long-video" should be compared with others.

方法与评估标准

yes

理论论述

yes

实验设计与分析

The main result validates the SOTA performance, while more ablations on long-video and effect of hand should be disscuss (see weakness & strength)

补充材料

the supplementary materials provide generation results in other scenes (such as animals and other styles), verifying the generalization ability of the algorithm.

与现有文献的关系

This work provide a strong potential for human-based video generation and has a good practicability.

遗漏的重要参考文献

-StableAnimator: High-Quality Identity-Preserving Human Image Animation

-ControlNeXt: Powerful and Efficient Control for Image and Video Generation

These similar works need to be discussed.

其他优缺点

1.Based on the confidence-aware strategy, the proposed hand region enhancement alleviate the hand distortion. However, I would like to see more detailed discussion beyond the ablation study, such as:

  • Does the hand region enhancement suppress the generation of other areas to some extent?
  • I hope to see whether videos without hand enhancement could achieve better results, and whether the non-hand regions in videos with enhanced hands still maintain good quality.

2.There are many similar works that are not mentioned and should be discussed in the Related Work section. These works should be compared with the proposed approach in the experiment. A discussion and comparison with these methods is needed.

-StableAnimator: High-Quality Identity-Preserving Human Image Animation

-ControlNeXt: Powerful and Efficient Control for Image and Video Generation

  1. Regarding the claim of long-video generation, I would like to see a comparison with other methods. For example:
  • How long can other methods generate high-quality videos?
  • It would be useful to show how the performance of the proposed method changes as the video length increases, both quantitatively and qualitatively.

其他意见或建议

no

作者回复

1. Essential References Not Discussed: StableAnimator, ControlNeXt

Thank you for your effort and valuable feedback during the review process. We will add a discussion of these two related works and revise our manuscript accordingly.

2. The influence of hand region enhancement.

Does the hand region enhancement suppress the generation of other areas to some extent? I hope to see whether videos without hand enhancement could achieve better results, and whether the non-hand regions in videos with enhanced hands still maintain good quality.

Thank you for your valuable suggestion. We have updated the supplementary website with full-frame and video comparisons (see Sec. Ablation). As demonstrated by multiple examples, we observe no significant quality degradation in non-hand regions while maintaining enhanced hand generation.

3. The claim of long-video generation.

How long can other methods generate high-quality videos? It would be useful to show how the performance of the proposed method changes as the video length increases, both quantitatively and qualitatively.

The video lengths of video generation base models are as follows. AnimateDiff: 2s, SVD: 3.6s, CogvideoX-5B: 6s, LTX-Video: 10.6s. We discussed long-video generation in a sub-section of Related Work. Existing methods adopt a windowed generation approach to produce longer videos, which can be extended to infinitely long videos; however, there are still issues with abrupt changes at the window boundaries. In contrast, our proposed progressive latent fusion mitigates this problem.

To further show how performance changes as the video length increases, we have added comparative videos to the website. This method has an independent good case rate for each window. While bad cases may occur over time, the quality of video clips will not deteriorate as the total video duration increases.

最终决定

This paper presents MimicMotion, a framework for high-quality human motion video generation using confidence-aware pose guidance, hand region enhancement, and progressive latent fusion. Reviewers found the method well-motivated with good results. While there were initial concerns about identity preservation, ablation completeness, and comparisons, the authors provided comprehensive responses and updated demos without post-processing. Most reviewers raised their scores finally.