4.7

/10

withdrawn3 位审稿人

最低3最高6标准差1.2

3.3

置信度

正确性2.7

贡献度2.3

表达2.7

ICLR 2025

Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models

Zhening Xing,Gereon Fox,Yanhong Zeng,Xingang Pan,Mohamed Elgharib,Christian Theobalt,Kai Chen

OpenReview PDF

提交: 2024-09-26更新: 2024-11-23

摘要

关键词

Generative ModelVideo GenerationVideo Translation

评审与讨论

审稿意见

评分: 5置信度: 32024-10-22

The paper proposes Live2Diff, a video diffusion model that utilizes uni-directional temporal attention as well as additional warmup and depth information for streaming video translation. In specific, the uni-directional attention is conditioned on a sliding window of a few preceding frames, and the warmup contains attention from a small number of frames at the beginning of the video. To achieve real-time inference, Live2Diff uses three techniques: First, it replaces vanilla diffusion models with LCM-LoRA to reduce the number of denoising steps. Second, it uses a KV-cache during the implementation to avoid redundant computation. Third, it designs a pipelined denoising approach to accelerate model inference. In the experiments, the author(s) compare Live2Diff with three video translation methods, including StreamDiffusion, Rerender, and FreeNoise, and demonstrate that Live2Diff achieves higher CLIP scores and lower warp error while requiring lower latency.

优点

The paper is well-written and easy-to-understand.
The proposed method achieves high efficacy (16 FPS on single RTX4090) for streaming video translation.
The ablation experiments are well-studied and demonstrate the necessity of each component in Live2Diff.

缺点

I have some doubts about the novelty in this paper. In specific, just like the authors mentioned, several works have adopted an autoregressive way for long video generation, e.g. ART-V, SEINE, Gen-L-Video, Reuse and Diffuse. The idea of incorporating a small number of frames in the beginning is also not new [1]. The author(s) claim their major contributions of proposing uni-directional attention with warmup. However, it seems naive to me that the author(s) simply replace bidirectional attentions with uni-directional attentions because of the problem setting, where the target frame shall not have seen the later frames and can only use the information from preceding frames. Regarding warmup, it can be seen in Table 2 that warmup does not provide significant improvements. (Row A, B, and C have similar values.) Instead, the improvement comes from using additional depth priors instead of model architecture.
The selected baselines are somehow outdated. Similar work, such as StreamV2V (Open source, Arxiv, May 2024), that also targets real-time streaming video should be included for comparison. State-of-the-art video translation methods, like CoDeF and FlowVid, which may have better temporal consistency, editing performance, but lower throughput, should also be compared. In specific, StreamV2V lies in exactly the same scope as the paper, and they compare with both CoDeF and FlowVid, both of which outperform the selected baselines, i.e. Rerender and FreeNoise, with much better temporal consistency.
The literature review is not complete enough. There are several missing methods for video processing. To my best knowledge, existing methods can mainly be classified into four categories: propagation-based, representation-based, canonical-based, and attention-based. These methods use different strategies to improve temporal consistency, like propagation across frames [2,3], video deformation [4,5], transferring video to canonical space [6,7], or using attention mechanisms. It is acceptable that the paper only focuses on attention-based methods, which is much closer to the scope, but there are still many missing approaches that use diffusion models for video generation/editing. Some of them are zero-shot methods that do not require model training, e.g. Controlvideo, Tokenflow, Control-A-Video, VideoControlnet, Text2video-zero. Some of them further enhance temporal consistency with cross-attention, e.g. Tune-A-Video, Text2Video-Zero, FateZero, Vid2Vid-Zero. Some works propose training for video editing, e.g. Imagen Video, Make-A-Video. In short, I suggest the author(s) provide more comprehensive literature review in Section 2.
The temporal consistency is bad in some examples demonstrated. (Comparison 1, 2, 3, 6) To my knowledge, this is because the backgrounds in these examples contain more 'high-frequency' details. In other words, they are more complicated, compared to other 'dancing human' videos in indoor scenes with simple background. Since the paper has incorporated the depth priors, is it possible to first differentiate the static (background) and dynamic objects (human subjects, for example) in the video, and then do a simple masking techniques to avoid temporal inconsistency in the static background?

[1] W Harvey et al., Flexible Diffusion Modeling of Long Videos, NeurIPS 2022.

[2] O Jamriška, Stylizing video by example. ACM TOG 2019.

[3] X Wang et al., Learning Correspondence From the Cycle-Consistency of Time, CVPR 2019.

[4] V Ye et al., Deformable Sprites for Unsupervised Video Decomposition, CVPR 2022.

[5] Y Kasten et al., Layered neural atlases for consistent video editing, ACM TOG 2021.

[6] H Ouyang et al., CoDeF: Content Deformation Fields for Temporally Consistent Video Processing, CVPR 2024

[7] T Chen et al., NaRCan: Natural Refined Canonical Image with Integration of Diffusion Prior for Video Editing, NeurIPS 2024.

问题

Figure 3 and Figure 6 do not demonstrate strong initiatives in using depth priors. Since the baselines may not have depth priors, I am curious about the qualitative comparison between baselines, uni-directional attention with warmup, and uni-directional attention with warmup and depth.
Could you provide the video results of Figure 3? It seems that the full model with Figure 3(f), i.e. full model with depth, has worse temporal consistency than Figure 3(e), i.e. full model without depth, specifically at the regions of human clothes and face. I expect serious jittering artifacts in Figure 3(f).
Could you provide the comparison with more baselines, especially StreamV2V?
It seems to me that all the generated results are very similar to the input videos, and the prompt must align with the input videos. Is it possible for Live2Diff to do other editing tasks besides style transfer, for example, changing the color of human clothes?
See weakness.

伦理问题详情

The video-to-video translation may be used to generate fake videos for particular human subjects.

审稿意见

评分: 3置信度: 42024-10-31

this paper presents a comprehensive method for streaming video processing, leveraging on a unidirectional attention (with warmup frames and context window) of video frames and pipelined denoising (proposed in streamdiffusion) with kv caching. for streaming video processing application, depth information is extracted and proved to be very important for good performance.

优点

the problem this paper is working on is important, as streaming and low-latent video processing and generation has broad applications. the approach proposed by this paper makes sense as pipelined processing and temporal relation modeling are both important for decent performance under tight latency constraints.

缺点

overall, the writing of this paper is not good (e.g. "Absolute position encoding (i.e., sinusoidal position encoding) to fin before computing Eq. (2), to give the layer access to the temporal position of each feature vector."), and what's more important, the proposed approach lacks technical novelty.

the pipelined denoising is firstly proposed in streamdiffusion.
uni-directional attention and kv caching are both common in LLM community. the warmup design also lacks insightful analyses.
depth information extraction seems to be highly heuristic and specialized to the video to video application.

问题

how does your method work in other application scenarios? for example text to video generation
"We measure inference speed as the total amount of time it takes each method to process an input stream of 100 frames" --- but this is not the end to end latency at user side?
why is depth information so important? what will happen if the depth is inaccurate?
in Figure 5, the results are in different styles, which may lead to unfair comparisons?

审稿意见

评分: 6置信度: 32024-11-04

This paper introduces a novel approach to real-time video style transfer for streaming applications. The key innovation is adapting unidirectional attention mechanisms (similar to those used in Large Language Models) for video diffusion models, allowing frame-by-frame processing without requiring future frames. Additionally, they implemented warmup and KV-cache strategies.

优点

The paper is well-motivated, addressing an important practical problem of live video stream translation.
The proposed unidirectional attention mechanism is a natural solution inspired by successful LLM architectures.
Clear explanation of the architecture and mechanisms.
Multiple evaluation metrics and user study results including structural consistency and temporal smoothness.

缺点

Limited Analysis of Dynamic Scenes

The paper lacks a comprehensive evaluation of videos containing rapid motion or frequent scene changes. The authors are recommended adding comparative experiments between the warmup approach and simple unidirectional attention for several scenarios: videos with fast camera movements, videos with quick scene transitions, and action sequences with rapid object motion. In addition, I'm concerned if the warmup window might become useless or perform worse than simple unidirectional attention when the content is no longer relevant to the first few frames.

Depth Predictor Integration & Evaluation

The current depth Mean Square Error (MSE) metric creates an unfair comparison between methods. To address this limitation, I suggest adding structural similarity metrics between frames that are independent of depth, such as SSIM and LPIPS. Furthermore, evaluations should be conducted both with and without depth conditioning to ensure a more equitable comparison.

Visual Quality Issues

Several visual artifacts are noticeable in the results presented in the supplementary material. I suggest including detailed analyses of these failure cases to better understand the method's limitations and potential areas for improvement.

问题

How does the model perform on videos with rapid scene changes or fast motion where early warmup frames might become less relevant?
Are there justifications for using the depth MSE metric to measure structural consistency? The use of an off-the-shelf depth predictor can be an advantage.

撤稿通知

2024-11-23

I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.