PaperHub
6.0
/10
Poster3 位审稿人
最低5最高7标准差0.8
6
7
5
3.7
置信度
正确性3.3
贡献度3.0
表达3.0
NeurIPS 2024

Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation

OpenReviewPDF
提交: 2024-05-09更新: 2024-11-06
TL;DR

Our motion consistency model not only accelerates text2video diffusion model sampling process, but also can benefit from an additional high-quality image dataset to improve the frame quality of generated videos.

摘要

关键词
consistency distillationvideo diffusion modelsdiffusion distillationtext-to-video generation

评审与讨论

审稿意见
6

This work presented a new consistency based framework for video diffusion model distillation. Specifically, the adversarial loss is leveraged to enhance the video quality and consistency distillation loss is performed in the motion embedding space to learn the video motion patterns effectively. In addition, the authors proposed mixed trajectory distillation to ensure better alignment between training and inference phases. The experimental results demonstrate the proposed approach could produce more visual-pleasing results compared with previous distillation methods.

优点

  1. The proposed disentangled motion-appearance distillation is reasonable and effective.
  2. The generated results in Fig. 5 and supp are very promising.
  3. The quantitive comparisons in Table 1 and 2 are convincing.

缺点

  1. The adversarial loss is not stable and could the authors employ other manners such as perceptual loss?

问题

Could the proposed algorithm achieves satisfactory performance on other video generation tasks such as StableVideoDiffusion (Image-to-video) and AnimateAnyone (human video generation)?

局限性

The proposed algorithm is not evaluated on high-resolution video generation diffusion models such as (1024x576 or 768x768).

作者回复

We are grateful for your recognition of our novel contributions and promising results. We follow your advice to conduct the following set of experiments.

Weakness 1: Perceptual loss instead of adversarial loss. Following your suggestion, we conducted experiments comparing perceptual loss with our approach, using ModelScopeT2V teacher on the WebVid mini val set. The results are summarized in the table below. Our observations are: The adversarial loss outperforms using 1 and 2 sampling steps, benefiting low-step sampling by producing sharper details. Both losses perform similarly using 4 and 8 steps. We do not observe apparent training instability using adversarial loss, thanks to the well-trained feature extractor DINO v2 and the discriminator gradient penalty loss. We will add more analysis and details in the revised version.

MethodFVD @ 1 step2 steps4 steps8 stepsCLIPSIM @ 1 step2 steps4 steps8 steps
Adversarial loss70365076776028.7030.3730.9030.77
Perceptual loss82669374874227.3129.3230.4730.71

Question 1: Results on image-to-video and human video generation. Following your advice, we conduct the following image-to-video distillation experiment using Stable Video Diffusion (SVD). The quantitative results on MSRVTT are listed below. We make the following observations. We outperform Euler and AnimateLCM on FVD across most sampling steps. Compared to AnimateLCM, which finetunes the entire SVD (1.6B parameters), we achieve better performance by training only 5.9% of parameters using a lightweight LoRA (95M parameters).

We will add these results to our revised manuscript and make our SVD-based MCM checkpoint publicly available.

Method# ParamsFVD @ 1 step2 steps4 steps8 steps
Euler-1639163312681043
AnimateLCM1.62B772442253242
MCM (ours)95M749463246235

Due to limited rebuttal time, we demonstrate our method's compatibility with ControlNet for controllable human video generation. Examples are provided in the attached PDF. This integration highlights our MCM’s versatility and potential for broader applications. We will include additional results in the revised version.

Limitation 1: High-resolution video generation evaluation. Following your advice, we conducted high-resolution video generation evaluations at 768x768 resolution, using our AnimateDiff-based MCM on the WebVid mini val set. The results below demonstrate that our MCM achieves state-of-the-art performance in high-resolution video generation. We've included promising examples of these high-resolution outputs in the PDF attached to our general response.

MethodFVD @ 1 step2 steps4 steps8 stepsCLIPSIM @ 1 step2 steps4 steps8 steps
DDIM53642654137897320.2320.5323.6728.93
DPM++2371120897399021.8524.8128.6330.15
LCM1273106597998627.8128.9029.8729.93
AnimateLCM16731348107897925.0327.7229.0129.23
AnimateDiff-Lightning137413671297137028.3429.2229.7829.89
MCM (ours)1108103796289729.8630.2130.7529.98
评论

Thanks for the response. My concerns have been addressed well.

审稿意见
7

This paper proposed a single-stage video diffusion distillation method that can disentangle motion and appearance learning, thus improving frame appearance using various high-quality image data. The proposed mixed trajectory distillation mitigates the training-inference differences in terms of video quality. The extensive experiments demonstrate the superior performance in enhancing frame quality in the video diffusion model.

优点

  1. The proposed disentangled motion distillation and mixed trajectory distillation are intuitive and novel.
  2. The experiments are thorough. They are conducted across various datasets and show superior results in terms of video diffusion distillation. The ablation study shows the effectiveness of the proposed disentangled motion distribution and mixed trajectory distribution modules.
  3. The paper is well-written and easy to follow.

缺点

  1. Motion jittering in the supplementary video. It's probably caused by the teacher model but the authors could better discuss the way to alleviate it.
  2. In Fig. 6, there is no caption to indicate which one is the result of the proposed methods and which one is the designed two-stage baseline. What are the differences between the first row and the second row?

问题

  1. In Fig. 6, why do the "Ours w/ Webvid" results also have watermarks?

局限性

The authors have discussed the limitations of this paper

作者回复

Thank you for recognizing our MCM’s contributions and providing constructive feedback. We respond to your concerns below.

Weakness 1: Motion jittering. Thank you for the feedback! The motion jittering is largely inherited from the teacher model (the same phenomenon was also reported in AnimateDiff-Lightning [35]). To mitigate this, we could use the following methods.

  • Increase the hyperparameter λreal\lambda_{\text{real}} in mixed trajectory distillation, so that MCM could learn more information from real videos instead of jittering teacher output.
  • Apply additional temporal losses to maintain stable video outputs, such as constraints on the brightness or optical flow changes.

We will add additional discussion in the revised version.

Weakness 2: Fig. 6 caption. Our Fig. 6 shows only our MCM results adapted to different image dataset styles, not two-stage results. The first and second rows represent the first and last video frames, respectively, similar to Fig. 5. We'll clarify this in the revision.

Question 1: Watermark in “Ours w/ WebVid” in Fig. 6. Thank you for noticing the watermark. This is because all WebVid videos contain “shutterstock” watermarks. We use WebVid for a fair comparison with ModelScopeT2V-based video diffusion distillation methods, such as DDIM, DPM++, and LCM. We will elaborate more on this in the revised version.

评论

I appreciate the author's response. It has addressed my concerns.

审稿意见
5

This paper proposes a video diffusion distillation method that disentangles motion and appearance learning. Basically, it proposes to enhance the appearance generation with high-quality image data and distill motion knowledge from the video teacher model.

优点

  1. The proposed method can distill motion knowledge from video diffusion models and improve the appearance quality through disentangled motion distillation.
  2. The mixed trajectory distillation is proposed to improve training-inference alignment and enhance generation quality.
  3. This paper is technically clear and the organization is good.

缺点

  1. The introduction of gaps between the training and inference distillation inputs is not so straightforward.
  2. The introduction to related work needs to be significantly enhanced, especially in terms of the idea of decoupling appearance and motion, which is no longer uncommon and has many related works.
  3. It simply provides the conclusion that "learnable representation works the best" without giving specific analysis as to why. Such an analysis may be more helpful for following research.

问题

Please refer to the weaknesses.

局限性

Broader impacts and limitations have been discussed in the paper.

作者回复

Thank you for identifying our key strengths and providing insightful comments. We address your concerns point by point.

Weakness 1: Gaps in training and inference distillation inputs. Thank you for raising the concern! There exist two major gaps in the mixed trajectory distillation.

  • Distribution mismatch gap. During training, we assume all inputs are noisy low-quality videos; during inference, the input will be noisy high-quality videos.
  • Information leakage gap. During training, the original consistency distillation takes as input noisy videos that contain ground-truth video information; during inference, the sampling will start from pure noise, containing zero signal.

Our mixed trajectory distillation simultaneously addresses these two problems. We will make this clear in the revised version.

Weakness 2: Related work in decoupled appearance and motion. Thank you for the suggestions! While disentangle appearance and motion is well-established in video understanding, our work addresses unique challenges in video diffusion distillation.

  • Video diffusion models suffer from long sampling time due to the additional temporal dimension.
  • Popular open-source video datasets contain low-quality frames, such as watermarks, motion blur, and low resolution.
  • Our MCM simultaneously achieves video diffusion acceleration and frame quality improvement by leveraging additional high-quality image data.
  • We introduce disentangled motion distillation and mixed trajectory distillation to overcome specific challenges in video diffusion models.

We will expand our related work section to include additional references and contextualize our contributions in video diffusion distillation.

Weakness 3: Analysis on learnable motion representation. Thank you for the advice! Our Table 5 shows that most representations, except latent low-frequency components, reduce FVD and improve CLIPSIM at low sampling steps. This indicates that disentangling motion from the raw latent space for consistency learning enhances video quality. The learnable motion representation performs best, outperforming handcrafted ones. We attribute this to its ability to:

  • Adaptively learn optimal motion features, capturing complex motion patterns that generic handcrafted representations might miss.
  • Better separate motion from appearance, reducing conflicts in learning high-quality frames.

This learned disentanglement allows for more effective motion consistency modeling while preserving frame appearance quality. We will elaborate more on this in the final version.

评论

The authors have addressed most of my concerns. However, the response to Weakness 2 is still very sketchy. I suggest the authors provide more detailed discussions in the revised paper. I decide to keep my initial positive rating.

作者回复

We thank all reviewers for identifying our novel contribution in video diffusion distillation, promising qualitative and quantitative results, and well-written paper. We address all concerns in the individual responses below. Please find the attached PDF file for our additional qualitative results.

All training/inference code and model checkpoints used in our original manuscript and rebuttal will be made publicly available.

最终决定

Reviewers either recommend acceptance or leaning towards acceptance. Reviewers appreciated the novel technique and thorough experiments. Reviewers raised some concerns related to some missing analysis and discussion and several of which are addressed in the rebuttal. Reviewers did raise valid concerns and authors are encouraged to do the best of their abilities to address all the concerns in the final version.