Efficient architectural aspects for text-to-video generation pipeline

Arkhipkin Sergeevich Vladimir,Zein Shaheen,Viacheslav Vasilev,Denis Valerievich Dimitrov,Andrey Kuznetsov

OpenReview PDF

提交: 2023-09-24更新: 2024-03-26

TL;DR

In this paper we propose a two-stage latent diffusion video generation architecture and a new MoVQ video decoding scheme. We also conduct experimental research to compare temporal blocks and temporal layers for architecture design.

摘要

关键词

text-to-videovideo generationtemporal consistencyframes interpolationInception ScoreCLIPSIMMoVQ video decoder

评审与讨论

审稿意见

评分: 5置信度: 52023-10-15

This paper proposes a video generation pipeline based on MoVQ video decoding scheme. It consists of two stages: keyframes synthesis and video frame interpolation. It also compares two temporal conditioning approaches and different configurations of MoVQ-based models.

优点

The contributions of this method are summarised as below:

an end-to-end text-to-video latent diffusion pipeline that consists of key frames generation and frame interpolation
separate temporal blocks for temporal modelling
temporal output masking and data augmentations for robust VFI
investigation of video decoders

缺点

My concern is the lack of novelty. The concert comments are below:

This pipeline is not new. Various methods have tried to utilize text2image diffusion models for video generation. For example, the proposed temporal conditional scheme could be found at ``AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning ''.
Concatenating key frames along the channel dimension in video frame interpolation part is similar to "IMAGEN VIDEO: HIGH DEFINITION VIDEO GENERATION WITH DIFFUSION MODELS".
The introduction of temporal layers is similar to "Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models".
What does "efficient" in the title mean?

问题

See weakness.

审稿意见

评分: 5置信度: 32023-10-25

This paper proposes a new text-to-video generation method for temporal generation. New architecture using keyframe generation and frame interpolation with video decoder is proposed to handle the text-to-video scenario. Some sota results are acheived in terms of IS and CLIPSIM metrics.

优点

A novel scheme for video generation with customizations to ordinary text-to-video models. Evaluated several configurations for temporal blocks and MoVQ-based decoder.

缺点

Not the best results are acheived in terms of CLIPSIM metrics. IS scores are not good compared with videogen.

问题

Why the IS score is much lower than videogen ?

审稿意见

评分: 5置信度: 42023-10-31

The paper proposes a two-stage latent diffusion model, which contains keyframe generation and frame interpolation, for text-to-video generation tasks. Starting from a pretrained text-to-image model, separate temporal blocks are used to model the temporal information. A video decoder based on MoVQGAN is also adopted to improve the generation quality.

优点

The topic of this paper is significant.
The paper is overall clear and well-formulated.
The training of the proposed model seems resource-efficient compared to most text-to-video methods, as only 8-16 A100 GPUs and 120k data pairs were adopted in training.

缺点

There needs to be more explanations and analysis about the proposed methods and results. The authors should illustrate the possible reason why separate temporal blocks perform better than temporal layers.
As the limitations of quantitative metrics, more visual results should be provided and compared with other text-to-video methods.
The paper doesn't mention the number of parameters of the video generation model, and only 120k internal data pairs are utilized for training. The model maybe overfit to the training data. As the data domain is also agnostic, it's hard to decide whether the experiment results are evidential enough or not.

问题

Could the authors give more explanations about the separate temporal blocks over temporal layers?
As mentioned in the introduction part, the scarcity of open-source text-video datasets impedes the development of video generation. Is it possible to contribute the internal training data to the community?