4.3

/10

withdrawn4 位审稿人

最低3最高6标准差1.3

4.0

置信度

正确性2.8

贡献度2.0

表达2.8

ICLR 2025

Video-Infinity: Distributed Long Video Generation

Zhenxiong Tan,Xingyi Yang,Songhua Liu,Xinchao Wang

OpenReview PDF

提交: 2024-09-26更新: 2024-11-26

TL;DR

Generating extremely long video by multi-GPU parallelism and Dual-Scope attention

摘要

关键词

diffusion modelvideo generation

评审与讨论

审稿意见

评分: 3置信度: 42024-11-02

The paper introduces Video-Infinity, a distributed inference framework designed to enable efficient generation of long videos using diffusion models. It tries to address a challenge in video generation: the resource-intensive nature of long-form content, which often restricts video length and quality due to memory and computation limits. The proposed approach leverages two main innovations: Clip parallelism, which optimizes the distribution and synchronization of context information across multiple GPUs, and Dual-scope attention, which balances local and global temporal self-attention to maintain semantic coherence without requiring additional model training.

优点

The empirical results are persuasive, with Video-Infinity achieving a 10x improvement over comparable methods like FIFO-Diffusion and being significantly faster than alternatives like Streaming T2V.
The paper is well-organized, clearly outlining the technical details, methodology, and communication strategies.

缺点

It looks like this work adopt the idea from DistriFusion [1]. While the authors claim to tackle a more challenging problem, the dimensionality of frames, from a technical standpoint, is actually much simpler to manage compared to the problems addressed in DistriFusion.
How does this method impact frame-to-frame continuity? I noticed that many of the generated videos in the Supplementary Material exhibit noticeable continuity issues. The authors do not seem to have adequately addressed this problem. Additionally, many other generated long videos can only display repetitive motions and clips.
The evaluation lacks comprehensiveness, as the authors have only demonstrated their method on a single model, VideoCrafter2. It remains unclear whether the approach is effective across a broader range of model architectures. For instance, how well does this method generalize to new architectures like DiT? Additionally, what is the performance impact on these models?
It's more of an engineering work, the novelty contribution of this work is not good enough.

I'm sure it needs dedicated effort for applying this method on every new model architecture.

[1] DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models, CVPR'24

问题

Please refer to Weaknesses.

审稿意见

评分: 6置信度: 42024-11-02

The paper presents Video-Infinity, a novel framework for generating long-form videos using distributed diffusion models across multiple GPUs. This approach aims to reduces the inference time and resource demands typically associated with long video generation. The paper proposed two methods: Clip parallelism and Dual-scope attention, which optimize inter-GPU communication and temporal attention across frames, respectively. The methodology enables the generation of videos up to 2,300 frames in just 5 minutes, faster than existing methods.

优点

The integration of Clip parallelism and Dual-scope attention is a novel approach that effectively addresses the scalability and efficiency challenges in video generation.
The paper demonstrated ability to generate longer videos much faster than current methods, achieving substantial reductions in generation time.
Experiments are conducted to validate the performance, showcasing significant improvements over other methods in terms of speed and video length capabilities.

缺点

The method of synchronizing context across GPUs, crucial for maintaining temporal coherence, is not discussed detail.
While the framework improves efficiency, there is not much discussion on how these gains impact the qualitative aspects of the videos, such as resolution, realism, particularly under complex scene dynamics.

问题

How the synchronization latency affects the continuity and quality of the video, especially in dynamic scenes? Are there mechanisms in place to mitigate negative impacts if synchronization is delayed?
How does the proposed model perform with high-motion sequences or videos requiring rapid scene changes, and what are the limitations of the current approach in handling such dynamics?
The paper focuses on generating videos from a limited set of prompts and scenarios. How well does the approach generalize to a wider variety of video content or more complex scenes?

审稿意见

评分: 3置信度: 52024-11-03

This paper presents Video-Infinity, a distributed inference pipeline for long-video generation using diffusion models. It introduces two techniques for the main challenges in long-video generation. Clip parallelism divides a long clip generation task into several short clips to address the high GPU memory usage. Dual-scope attention gathers local and global context for temporal self-attention to generate a consistent long video. It compares with FreeNoise, StreamingT2V, and OpenSora 1.1V.

优点

The paper is well-written and easy to follow.
It is a training-free inference pipeline while extending the baseline model generation capacity.
Dual-scope Attention provides a new view of gathering the global and local context for high-fidelity long video generation. The generation results are impressive. It might provide insight into the training scheme or new architecture design.

缺点

The novelty of Clip Parallelism is limited. The paper merely migrates the DistriFusion[1] to the video diffusion model, where DistriFusion splits a large image into patches while this paper splits a long video into short clips. The distributed modules are similar to the sparse operations in DistriFusion[1], except for extending the sparse 2D convolution to the 1D/3D temporal convolution with different padding schemes. Also, the GroupNorm modification is similar. Moreover, the DistriFusion[1] further introduced Corrected asynchronous GroupNorm, which is more efficient than the paper's implementation since the asynchronous communication can be pipelined into the computation.
The paper didn't compare the video quality with FIFO-Diffusion[2], which also focused on long-video generation. It is difficult to demonstrate the proposed method's advantage over the SOTA work.
In the comparison of efficiency, comparing Open-Sora v1.1 and the proposed method is unfair because they use different model architectures (Spatial-Temporal DiT vs. VideoCrafterV2).
There are several typos. (e.g., GPU index in Fig.1, captions in Fig.3

[1] DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models. Muyang Li, Tianle Cai, et al. CVPR 2024

[2] FIFO-Diffusion: Generating Infinite Videos from Text without Training. Jihwan Kim and Junoh Kang and Jinyoung Choi and Bohyung Han, NeurIPS 2024

问题

The paper focuses on the video diffusion model with the temporal self-attention layer. However, DiT-based architectures are popular in current work, and they tend to use the 3D attention layer instead of the temporal self-attention layer (e.g., CogVideo-5B[2]). How does the Dual-scope Attention perform on the 3D attention layer?
How is the video quality compared with FIFO-Diffusion[2]?
Would it be possible to extend Video-Infinity to OpenSora for the efficiency comparison?

[3] CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer. Zhuoyi Yang, Jiayan Teng, et al, arxiv 2408.06072. 2024.

审稿意见

评分: 5置信度: 32024-11-04

The paper proposes Video-Infinity, a distributed inference pipeline designed for long-form video generation using diffusion models. The framework leverages two main mechanisms: Clip parallelism, which distributes video segments across multiple GPUs to improve processing efficiency, and Dual-scope attention, which balances local and global temporal contexts across devices. Together, these components enable Video-Infinity to generate lengthy, coherent videos with reduced memory overhead. On an 8 × Nvidia 6000 Ada GPU setup, the framework can produce videos up to 2,300 frames in approximately 5 minutes.

优点

This work brings incremental novelty by adapting distributed parallelism specifically for long-form video generation. It introduces a dual-scope attention mechanism to balance local and global temporal interactions, ensuring coherence across extended sequences. The clip parallelism approach further enables efficient processing of video clips across GPUs, effectively handling the unique scalability and memory demands of video data. These adaptations, including optimizations for temporal continuity, showcase Video-Infinity’s tailored application of distributed inference to the distinct challenges of generating coherent long videos.
Speed up performance is great. The proposed Clip parallelism and Dual-scope attention mechanisms optimize inter-device communication and memory requirements, leading to faster processing times and scalability for generating extended video sequences. It could reduce the inference time by up to 52%.

缺点

Performance. In the Table 2 under 64 frames settings, although the proposed work got the highest overall score, it did not showed dominating better results than other baselines.
Results on longer context. This work claims capability to generate longer video clips, while it only shows results for a maximum of 192 frames in Table 2. Since it emphasis the long video generation ability, I would suggest putting more quantitive results on longer video.
Results on memory usage comparison. This work lacks of comparison of reduced memory overhead to demonstrate the efficiency of the method.

问题

Has “Video-Infinity” been tested on different types of video content, such as fast-moving scenes or varying lighting conditions, which might challenge the coherence of frame transitions? How robust is the model across these diverse scenarios?

撤稿通知

2024-11-26

I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.