5.3

/10

Poster4 位审稿人

最低3最高8标准差1.9

4.8

置信度

正确性2.8

贡献度2.5

表达2.5

NeurIPS 2024

FIFO-Diffusion: Generating Infinite Videos from Text without Training

Jihwan Kim,Junoh Kang,Jinyoung Choi,Bohyung Han

OpenReview PDF

提交: 2024-05-13更新: 2024-11-06

TL;DR

Infinitely long video generation technique without training based on pretrained diffusion models

摘要

关键词

generative modelsdiffusionlong video generationtuning-free

评审与讨论

审稿意见

评分: 3置信度: 42024-07-08

This article proposes a training-free diagonal denoising method aimed at extending video diffusion models, which typically generate only short clips, to the production of long videos. The authors have validated their method across multiple baselines and demonstrated that it can be extended to different VDM architectures.

优点

The motivation for this work is commendable; generating long videos is a worthwhile topic to research, as past VDMs could only produce short videos of a few seconds.
Viewing denoising long videos as a FIFO-Queue is interesting. From the paper, it appears to work to some extent.
The discussion of related works is comprehensive. The overall description of the method is understandable and should be reproducible with some experience in video diffusion.

缺点

What do the authors think about training models for longer video generation? SORA can generate videos up to 1 minute with remarkable temporal and even 3D consistency. An important rule in deep learning is scaling up. Methods that still work well or are meaningful when computation resources are upgraded are most valuable. The currently available video generation models are still not satisfactory. Applying them for training-free applications seems only to make the results worse and unsatisfactory.
Although the videos presented in the supplementary are long compared to current video generation models, most of them are not attractive enough. In most cases, there are just random motion alterations without any order or controllability. In this way, it seems not enough for practical usage. The most important reason is that you never really have a global perception of the generation process, and therefore the motions and contents tend to be repeated or meaningless alterations. I believe the authors should try to add additional controllability to their method. Otherwise, it is going to be not usable.

问题

The method presented in this article is straightforward and provides some insights. The biggest concern might be that the actual results are very poor, with so-called long videos consisting mostly of simple, repetitive actions. At this stage, I can only give a borderline reject score, while looking forward to the opinions of other reviewers.

局限性

Yes

作者回复

2024-08-07

[W1] Scaling-up and FIFO-Diffusion

Thank you for your insightful comments. We agree that scaling up in deep learning is crucial, particularly for video generation. However, we argue that FIFO-Diffusion offers benefits even for scaled-up models as an universal inference technique, and may aid in scaling up dataset sizes.

First, FIFO-Diffusion is an inference method applicable to a wide range of video diffusion models regardless of architecture (e.g., DiT or UNet), as shown in Figure 6. This implies that any future scaled-up model with longer and advanced video generation capabilities would also benefit from our technique by generating scalable lengths of videos than their original capacity.

Furthermore, this expandability might reduce the burden of preparing datasets. With the scalability of FIFO-Diffusion during inference, those who want to generate extremely long videos can utilize much shorter video datasets. Given that the number of sufficiently long videos is much smaller than shorter ones, FIFO-Diffusion provides a valuable alternative for preparing highly long video datasets.

[W2, Q1] Controllability and Random Motion

We believe the results shown in Figure 6 (c) and Appendix E address these concerns. The multi-prompt results demonstrate the controllability and capability of our technique for generating long-term ordered motion using varying text prompts. Our method effectively maintains visual consistency by keeping sufficient forward frames in the queue, enabling seamless motion transitions across prompt changes. This approach adds the desired level of control and order to the generated videos, making them more suitable for practical applications.

2024-08-13

Thank you for your time and effort. After reading your rebuttal and the opinions of other reviewers, and considering the limited novelty and the less satisfactory visual results of the experiments, I have decided to lower the score.

2024-08-13

Dear uUJo,

Thank you for your time and consideration on our paper. We are writing to follow up on the rebuttal we submitted for our paper in response to your initial review. If you have any additional comments or questions regarding our rebuttal, we would be grateful to hear them. Please let us know if there is anything else we can provide to assist in your review.

Warm regards,

Authors

2024-08-13

We appreciate your comments. However, we are concerned about the rationale for lowering the score, which seems to be based on superficial and uninformed criteria: "limited novelty" and "less satisfactory visual results."

In our initial rebuttal, we addressed the concerns and questions regarding scalability and controllability, arguing that our technique is highly relevant to scalability and is controllable with multiple text prompts. However, it appears these points were not considered in the final assessment.

Furthermore, while we understand that different reviewers may have different backgrounds and perspectives, we cannot find any detailed explanations for “limited novelty” and “less satisfactory visual results”. This assessment conflicts with another reviewer’s evaluation, which includes a high score (8, strong accept) from reviewer bkyd, “The quality of the generated videos is good.” from reviewer Hj7n, and “The contributions of this work are innovative” from reviewer Tbwe. We believe that the assessment based on subjective criteria, such as visual quality, does not fully capture our contributions.

We kindly request a reconsideration of the assessment using objective and descriptive criteria.

审稿意见

评分: 6置信度: 52024-07-08

The authors introduce FIFO-diffusion, a method to enable pre-trained video diffusion models to generate infinitely-long videos via the proposed diagonal denoising. In diagonal denoising, in contrast to the conventional denoising, the noise levels of the frames in a window form a decreasing sequence. This way, at each denoising step, the clean frame at the head of the queue can de dequeued and a new noisy frame can be enqueued, allowing for generating sequences longer than the ones the model was trained on. The authors additionally propose two techniques to further improve the quality of the generated videos:

Latent partitioning to increase the size of the queue and hence decrease the training-inference gap of diagonal denoising.
Lookahead denoising to allow the denoising process to benefit from the difference in the noise levels of the denoised latents and let the more noisy latents to leverage the information in the less noisy latents to reduce the prediction error.

优点

The paper considers a very interesting and challenging problem of enabling video generation models to generate long videos. The proposed method is simple enough to be directly applied to a variety of video generation models. The paper is well-written, all the components are well-motivated and supported by theory and experiments. The quality of the generated videos is good.

缺点

Prior work: Diagonal denoising has been first proposed in [1] for motion generation and in [2] for video generation. In both works the models are trained with the different noise levels support. Discussion with respect to these works is missing in the paper. I understand that by the submission time those works were only published on arxiv, but a discussion would facilitate a proper contextualization of the proposed method within the field.
Limitations: The main limitation of the method, as discussed by the authors, is the training/inference gap. And the authors state that training with the support of different noise levels could improve the performance. However, if the gap didn't exist (given that there are models that train with different noise levels) the solutions that form the main contribution of the paper would not be needed. Hence, such a statement weakens the contributions.
Evaluation: Quantitative comparisons are limited. The only quantitative comparison (except for human evaluation) is done through FVD and shows that the prior work (FreeNoise) performs better than the proposed method. Moreover, the FVD of FreeNoise is not only lower than the FVD of FIFO-Diffusion (also mentioned by the authors), but is also more stable, while there is a growing trend in the FVD of FIFO-Diffusion with time. This shows that the quality of the generated sequences degrades over time. The authors claim that this is due to the lack of motion in the videos generated with FreeNoise (line 454). Some additional metrics measuring both the temporal consistency and the motion magnitude would better illustrate the tradeoff.

[1] Zhang et al., Tedi: Temporally-entangled diffusion for long-term motion synthesis, arxiv 2023.

[2] Ruhe et al., Rolling Diffusion Models, arxiv 2024.

问题

Latent partitioning: The motivation to do latent partitioning is to decrease the training/inference gap by making the timestamp differences within the window smaller. However, by increasing the size of the window by n the number of frames a certain latent interacts with throughout the denoising process also increases by the factor of n. This also could positively influence the temporal consistency of the generated videos. Therefore, it is not completely clear, whether the improved quality comes from the decreased training/inference gap or from the increased window size. Could the authors provide more details regarding this?
Lookahead denoising: It is not clear from the paper, how exactly the ground truth score is obtained in eq. 7. Could the authors explain this better?

In the current form I am leaning to rejecting the paper, but I would love to raise my score, if the authors address my concerns (in Weaknesses and in Questions) in the rebuttal.

局限性

More clarification would be preferred in the limitations section (see Weaknesses). Are there other limitations? E.g. violation of some basic assumptions made in the paper?

作者回复

2024-08-07

[W1] Prior work: Relevance to Tedi and Rolling Diffusion Model

Thank you for bringing this to our attention. Since we became aware of Tedi [R5] and Rolling Diffusion Model [R1] after the ICML conference, which was subsequent to the submission, we gladly plan to cite [R5] and [R1], acknowledging its relevance to our work.

Both [R5] and [R1] aim to train diffusion models from scratch, while our focus is on achieving this without additional training. To this end, we exclusively introduce latent partitioning and lookahead denoising in our paper to realize a training-free approach. Particularly in the text-to-video (T2V) generation domain, we believe that training-free approaches are extremely valuable. T2V models require substantial computational resources (GPUs and storage) for preparing text-video pair datasets and large-scale training. By avoiding additional training, our method provides a more accessible and efficient solution for generating high-quality videos.

Additionally, we have demonstrated differences between [R1] and our work in terms of sliding window methodology and the target task in the third and fourth paragraphs of the global rebuttal. We kindly ask you to refer to the global rebuttal for further details.

[W2] Limitations: Beyond Gap Bridging

The benefits of both latent partitioning and lookahead denoising extend beyond addressing the training-inference gap. Lookahead denoising leverages the advantage of diagonal denoising, where noisier frames benefit from clearer frames. This benefit persists regardless of the gap due to the inherent similarity of neighboring video frames. As shown in Table 2, lookahead denoising enhances the baseline model's performance, suggesting its usefulness even without the gap.

While latent partitioning aims to reduce the training-inference gap, it also offers two additional significant advantages: facilitating parallelized inference on multiple GPUs and reducing discretization error, as mentioned in L170-176. Although the first advantage may diminish without the gap, the remaining two benefits, particularly parallelizable inference, are crucial in practice for reducing inference time, as evidenced in Table 1.

[W3] Evaluation: Quantitative Results

Thank you for your valuable comments. We believe that measuring both quantitative motion magnitude and FVD scores supports our claim in L454: “FVD is more favorable to algorithms that generate nearly static videos due to the feature similarity to the reference frames.”

In response, we measure optical flow magnitudes (average of optical flow magnitudes) to compare the amount of motion between 512 videos generated by FIFO-Diffusion and FreeNoise. We provide the result plot in the PDF file of our global comment. The plot (Fig. A) illustrates that over 65% of videos generated by FreeNoise are located in the first bin, indicating significantly less motion compared to FIFO-Diffusion. In contrast, our method generates videos with a broader range of motion.

Additionally, following the recommendation from Reviewer bkyd, we evaluate long video FVD (i.e. $FVD_{128}$ ) and IS on the UCF-101 dataset. We kindly encourage you to refer to response [W1] to Reviewer bkyd for detailed results.

Regarding the growing trend in the FVD, we argue that this does not imply instability in FIFO-Diffusion as we have discussed in L449-451. Since FIFO-Diffusion starts the process from a queue with corrupted latents, generated by the baseline model (the reference set for the FVD scores), it is natural for the FVD to increase rapidly in the first 16 windows.

[Q1] Latent partitioning: Window Size and Performance

We appreciate your insightful comments. We argue that the improved quality primarily stems from addressing the training/inference gap, while it might also be positively influenced by the increased window size. Although latent partitioning increases both the number of partitions ( $n$ ) and the queue length ( $nf$ ), the denoising process for each partition is conducted independently. This means that each frame only attends to temporal consistency within its partition at each step. Instead, increasing model capacity ( $f$ ) would more directly enhance temporal consistency within the queue.

However, it is a valid point that the model might benefit from increasing $n$ . Since the first-in, first-out mechanism mixes the partitions at every iteration, all frames in the queue implicitly interact with each other throughout the denoising process. This interaction could indeed contribute to maintaining overall consistency in the queue, as you suggested.

[Q2] Lookahead denoising: About Equation 7

We apologize if Equation 7 was unclear. We use these representations for coherence with the model output, $\epsilon_\theta$ . Similar to training loss in general diffusion models, we directly add noise sampled from a normal distribution into the latents and set this added noise as the ground truth for the score. Thank you for pointing this out, we will revise the equation in the future version.

[R1] Ruhe et al., Rolling Diffusion Models, ICML, 2024

[R5] Zhang et al., Tedi: Temporally-entangled diffusion for long-term motion synthesis, arXiv, 2023.

评论- Response to the rebuttal

2024-08-12

Dear authors,

The rebuttal has carefully addressed my concerns and I have raised my score accordingly.

Best regards, Reviewer

2024-08-12

Dear Reviewer Hj7n,

We are glad that our rebuttal has addressed your concerns. We appreciate for your thoughtful consideration and reviews.

Sincerely, Authors

审稿意见

评分: 4置信度: 52024-07-11

Paper Overview:

This paper introduces the FIFO-diffusion pipeline designed for training-free autoregressive video generation.

The core method, diagonal denoising, processes latents at varying denoising levels sequentially.
To enhance the video generation capabilities with more extensive denoising steps, the authors have implemented latent partitioning, thereby increasing the denoise window size.
This modification, however, presents a challenge in maintaining consistency across video partitions, which the authors address through the approach of lookahead denoising, albeit at the cost of doubling the computation.

Personal Reflection:

Initially, I was very intrigued by this paper. However, upon a detailed review, I noticed some areas that could benefit from improvement. While the foundational ideas are quite promising, it's important to note that the paper does not adequately cite previous works related to its core techniques.

优点

The paper is articulate and well-structured, making it accessible and informative.
Visual aids like charts and figures effectively illustrate the achieved results and methodologies.

缺点

Methods:

The diagonal denoising technique bears a close resemblance to the previously work Rolling diffusion model (released at 12th Feb 2024), which unfortunately is not cited, raising concerns about the novelty.
The method section heavily focuses on latent partitioning and lookahead denoising. But I think there should be some simpler solutions to decoupling the denoising steps from the window size.
The concept of lookahead denoising seems similar to strategies employed in MultiDiffusion techniques, specifically in managing noise overlap to maintain partition consistency. However, there is no citation of this work, which might be seen as an oversight in acknowledging related prior work.

Result:

The core mechanism, "diagonal denoising," lacks sufficient experimental explanation, especially regarding its efficacy in training-free contexts.
Table 1 compares efficiency and computational costs across different FIFO settings but fails to include video quality metrics. This omission makes it unclear which setting offers the best balance of performance and quality, raising questions about the advantage of using more GPUs if fewer can achieve faster results.

问题

I am curious about how the rolling method initiates in the absence of historical video data. Could you explain how the process starts when there is no prior video latent available?

局限性

The authors have discussed the limitations of this work.

作者回复

2024-08-07

[W1] Relevance to Rolling Diffusion Models

Thank you for bringing this to our attention. We gladly plan to cite Rolling Diffusion Models [R1], acknowledging its relevance to our work. Please understand that we became aware of Rolling Diffusion Models [R1] after the ICML conference, which was subsequent to the submission. We kindly ask you to refer to the global rebuttal.

[W2] Latent Partitioning and Lookahead Denoising

Thank you for your reviews, but we found it challenging to fully understand your concerns regarding the second weakness. Could you please provide more details or examples to help us address this effectively?

From our understanding, you may feel our focus on latent partitioning and lookahead denoising is disproportionate to their importance. We would like to clarify that these two methods are crucial for generating high-quality long videos, especially in a training-free manner. While diagonal denoising introduces the concept of long video generation, latent partitioning and lookahead denoising address its limitations and leverage its advantages. The seamless integration of these methods significantly enhances the quality of video generation.

[W3] Relevance to MultiDiffusion

First, please let us clarify that lookahead denoising is not relevant to “maintaining consistency across video partitions” as you summarized in the paper overview and the third weakness. Lookahead denoising aims to make all frames take advantage of earlier, clearer frames, leveraging characteristics of diagonal denoising. Our first-in, first-out strategy naturally mixes up the partitions, thereby inherently maintaining consistency between them.

While lookahead denoising might seem similar to strategies employed in MultiDiffusion [R4] due to the overlap, the two methods are fundamentally different in purpose and implementation. The overlap in [R4] is a strategy to maintain consistency of the independent outputs, resulting in the optimization process described in equation (3) of [R4]. In contrast, our method does not require joint alignment of the outputs. The overlap is simply a resultant phenomenon and does not necessitate any optimization process.

Although the suggested methods are not directly relevant, both papers address similar challenges by proposing methods to handle out-of-range input sizes within their respective domains. With this in mind, we plan to cite [R4] in the future version. Thank you for bringing this to our attention.

[W4] Experimental Explanation of Diagonal Denoising in Training-free Context

We would like to clarify that diagonal denoising is not primarily designed for "training-free" algorithms but rather for generating long videos with fixed memory allocation. The key components that facilitate the training-free context in our techniques are latent partitioning and lookahead denoising, which seamlessly integrate with diagonal denoising.

We have demonstrated the experimental performance of diagonal denoising alone in our ablation study, both quantitatively (Table 2 in Section 5.2) and qualitatively (Figures 20 and 21 in Appendix H). The results indicate that, in training-free contexts, diagonal denoising by itself predicts less accurate noise (Table 2) and generates less favorable videos (Figures 20 and 21) compared to the complete version of FIFO-Diffusion.

We did not focus solely on diagonal denoising in other experiments because we believe that the success of FIFO-Diffusion in a training-free context is due to the integration of latent partitioning and lookahead denoising.

[W5] Table 1 with quality metric

We have provided qualitative and quantitative comparisons between different FIFO settings in our ablation studies (Section 5.4 and Appendix H), particularly in L289-291. We also encourage you to visit our project page (Section F. Ablation Study) to view the generated video samples. Table 1 is intended to demonstrate the fixed memory allocation and parallelizability of our method. For your reference, we report $\text{FVD}_{128}$ and IS on UCF-101 dataset in Table A, following the recommendation from Reviewer bkyd.

Table A $FVD_{128}$ and IS on the UCF-101 dataset. We sample $256\times 256$ resolution 2048 videos utilizing the model from [R3] pretrained on the UCF-101 dataset, sampling by FIFO-Diffusion with and without lookahead denoising. LD denotes the inclusion of lookahead denoising.

Method	$FVD_{128} (\downarrow)$	IS $(\uparrow)$
FIFO-Diffusion (n=4)	1497.1	70.49±1.86
FIFO-Diffusion (n=4, with LD)	1476.5	72.87±2.97

[Q1] Initialization of FIFO-Diffusion

We initiate the FIFO-Diffusion process with corrupted latents generated by the baseline model, as discussed in L129-130 and L166-169 of our paper. To summarize, since latent partitioning requires $nf$ latents and the baseline model generates $f$ initial latents, we repeat the first frame of the initial latents to construct $nf$ latents.

FIFO-Diffusion can be initiated even without prior video latent. You can start directly with $nf$ random noises and use a varying timestep schedule. Specifically, for $n=1$ , the timestep at the first iteration would be $\tau = [\tau_f, \ldots, \tau_f]$ , as all latents are random noise. Subsequently, the timestep will be updated at each iteration, such as $[\tau_{f-1}, \ldots, \tau_{f-1}, \tau_f]$ , then $[\tau_{f-2}, \ldots, \tau_{f-2}, \tau_{f-1}, \tau_f]$ , and so on. You can decode the dequeued latent into a frame after the $f^\text{th}$ iteration. This strategy can be similarly expanded for $n>1$ by substituting $f$ with $nf$ .

[R1] Ruhe et al., Rolling Diffusion Models, ICML, 2024.

[R3] Xin et al., Latte: Latent Diffusion Transformer for Video Generation, arXiv, 2024.

[R4] Omer et al., MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation, ICML, 2024.

2024-08-11

I appreciate the authors' detailed response, especially regarding the quantitative evaluation and the explanation of the initialization process.

However, the response concerning W2 leaves me with some reservations. As Reviewer Hj7n pointed out, the motivation behind latent partitioning is to reduce the training/inference gap by minimizing the timestamp differences within the window. While increasing the window size, as suggested by Reviewer Hj7n, is one approach, I believe that sliding the denoising window every $n$ denoise steps could also extend the denoise steps from $f$ to $nf$ without requiring additional computational costs or more GPUs, as introduced byLatent Partitioning and Lookahead Denoising. Thus, while these methods are innovative, their necessity may be worth reconsidering.

In conclusion, the contributions of this work—diagonal denoising, which is inspired by Rolling Diffusion, along with the methods of Latent Partitioning and Lookahead Denoising—are innovative, these latter methods may introduce complexity and potential drawbacks without clear necessity. Nevertheless, I appreciate the authors' efforts in clarifying the initialization process and providing detailed quantitative evaluations, and based on this, I have decided to maintain my final score.

2024-08-12

We appreciate your dedicating effort to review and reply for our rebuttal.

However, we respectfully argue that the suggested framework to extend denoising steps does not mitigate the training-inference gap. To describe the reason, we will follow your suggestion step by step, sliding a denoising window every $n \geq 2$ denoising steps. We define the time step schedule as $0<\tau_0<\tau_1<\ldots<\tau_{nf}=T$ .

Frame number $1 \sim f$ with diffusion time steps $[\tau_{nf-f+1}, \tau_{nf-f+2}, \ldots, \tau_{nf}]$ will be denoised $n$ times without sliding, reaching time steps of $[\tau_{nf-f+1-n}, \tau_{nf-f+2-n}, \ldots, \tau_{nf-n}]$ .
Now we should slide the window, but we encounter two challenges. First, the foremost frame cannot be dequeued as it is not fully denoised. Second, adding a new random noise with the noise level of $\tau_{nf}$ results in an additional training-inference gap. Specifically, the queue now contains frames with time steps of $[\tau_{nf-f+2-n}, \tau_{nf-f+3-n}, \ldots, \tau_{nf-n},\tau_{nf}]$ , where the last two frames have a time-step difference by $n$ while the rest has a gap of $1$ . Moreover, the total noise level difference increases from $f-1$ to $f+n-2 > f-1$ .
Assume that we repeat step 2 ignoring the aforementioned problems. Then, we will eventually have frames with time steps of $[\tau_{f}, \tau_{2f}, \ldots, \tau_{nf}]$ as a model input. The total time-step difference is now $(n-1)f$ . In contrast, the time-step difference of latent partitioning is $f-1$ since frames with time steps of $[\tau_{kf+1}, \tau_{kf+2}, \ldots, \tau_{kf+f}]$ are model input.

Consequently, while your suggestion will increase denoising time steps, it cannot tackle the training-inference gap, which is latent partitioning’s motivation. Moreover, although your suggestion and latent partitioning require the same amount of computation, only our method allows parallelized inference on multiple GPUs.

Meanwhile, lookahead denoising is an independent component with latent partitioning, which can be applied solely to diagonal denoising. It clearly leverages the advantage of diagonal denoising that noisier frames are benefitted by clearer frames. We have demonstrated its effectiveness in ablation study (Section 5.4) both qualitatively and quantitatively. Moreover, we would like to clarify that all components in our paper have their own roles, and their necessities are well supported by theoretical and empirical analysis.

Finally, we want to clarify that we would rather consider our paper as concurrent work with [R1]. We have also submitted this paper in ICML 2024 with submission number 9782. Moreover, while there are some similarities, the two papers also have many differences as we described in the global response.

2024-08-13

Dear reviewer Tbwe,

Thank you for your response to the rebuttal. We appreciate the time and effort you have put into your evaluation.

We understand that you had concerns regarding our initial rebuttal, and we have carefully addressed your points in our responses. We believe that this latest rebuttal discusses the issues you raised and provides further clarification on the matters you pointed out. We kindly ask you to review our updated rebuttal and share your thoughts.

Sincerely,

Authors

审稿意见

评分: 8置信度: 52024-07-13

Given a “base” diffusion based video generation model, this paper shows how one can use “diagonal denoising” to generate arbitrarily long videos which have similar visual quality as the base model and obey conditioning. Notably, the proposed method does not require any training which makes it applicable in principle to any video diffusion model. The idea is to input a sequence of frames with a sequence of different noise scales — the denoiser is then trained to “map” these frames at noise levels [tau_1… tau_f] -> [tau_0 … tau_{f-1}]. Frames are kept in a queue — they are enqueued at noise scale tau_f and and by applying the denoiser iteratively on this queue they eventually reach noise level tau_0=0 at which point they are dequeued (and “done”). Some tricks are proposed for making this basic idea more accurate (such as partitioning the noise scale range and applying the above trick separately for the different partitions.

优点

This method is a proof-of-concept that given a base video diffusion model capable of generating a short clip, we can always use that same model to generate arbitrarily long clips without needing to retrain.

It is not the first such paper (two competitors include Freenoise and Gen-L) but this paper is particularly simple compared to the other methods and outperforms Freenoise in user-studies (which outperforms Gen-L). I’m admittedly less familiar with Freenoise and Gen-L, but I suspect that compared to these other papers, this approach may also be more architecture-independent as it seems applicable to any video diffusion model; moreover, compared to Freenoise, this approach does not require attending to all frames, so it runs in bounded memory.

缺点

Not a weakness per-se but a recommendation: One thing that would make this paper stronger would be to include long-video FVD evaluations on datasets like UCF and Kinetics (see, e.g. the recent TECO paper). This would establish a baseline of performance on those datasets showing a “theoretical maximum FVD” that can be achieved by using a model that was not trained explicitly on long videos and would be a valuable contribution to the community.

Another suggestion for readibility is to improve the description of the method in the introduction. It is hard to understand why there would be a train-test discrepancy until Section 4 and I kept on thinking that the authors were over-claiming that it wasn’t necessary to train until I understood the details of the approach.

问题

There is a recent paper on “Rolling diffusion” by Ruhe et al that seems related to this work. I wonder if authors could comment on the relevance.

局限性

yes

作者回复

2024-08-07

[W1] Long-video FVD Evaluation

We appreciate your invaluable recommendation. Including long-video evaluations on specific datasets would definitely strengthen our paper by evaluating general performance of our techniques for video generation.

In response, we have evaluated the class-conditional long video generation performance on the UCF-101 dataset by computing the Frechet Video Distance (FVD) and Inception Score (IS). Specifically, we computed $FVD_{128}$ and IS following the details and implementation outlined in [R2], utilizing the model from [R3] pretrained on the UCF-101 dataset. The results are shown in Table A.

Table A $FVD_{128}$ and IS on the UCF-101 dataset. We sample 128-frame, $256\times 256$ resolution 2048 videos utilizing the model from [R3] pretrained on the UCF-101 dataset, sampling by FIFO-Diffusion. We also compute the IS of [R3] with DDIM 64 steps. Note that we cannot compute $FVD_{128}$ of [R3], as [R3] generates 16-frame videos.

Method	$FVD_{128} (\downarrow)$	IS $(\uparrow)$
Latte [R3]	-	69.42±3.90
Latte + FIFO-Diffusion	1476.5	72.87±2.97

[W2] Readability

Thank you for your valuable suggestion, and we apologize for any confusion. We will add a brief explanation in the introduction to clarify the training-inference gap with diagonal denoising. This should help readers understand the rationale behind our approach earlier and avoid any misunderstanding. We appreciate your feedback and will make these adjustments to enhance readability.

[Q1] Relevance to Rolling Diffusion Models

Thank you for bringing this to our attention. We gladly plan to cite Rolling Diffusion Models [R1], acknowledging its relevance to our work, that [R1] and diagonal denoising process latents with different noise levels in a sliding window manner. Furthermore, we have demonstrated differences between the two approaches in terms of training burden, sliding window methodology, and the target task in the global rebuttal. We kindly ask you to refer to the official comments for further details.

[R1] Ruhe et al., Rolling Diffusion Models, ICML, 2024.

[R2] Ivan et al., StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2, CVPR, 2022.

[R3] Xin et al., Latte: Latent Diffusion Transformer for Video Generation, arXiv, 2024.

2024-08-13

Dear reviewer bkyd,

We want to express our sincere gratitude for your positive feedback on our paper. Your thoughtful comments and insights have been incredibly valuable to us.

As our paper is currently under discussion, your continued support would be greatly appreciated. Your perspective is instrumental in highlighting the strengths of our work and advancing the discussion in a constructive direction.

Thank you once again for your support and consideration.

Sincerely,

Authors

作者回复

2024-08-07

Global Rebuttal

1. Relevance to Rolling Diffusion Models

We became aware of Rolling Diffusion Models [R1] after the ICML conference, which was subsequent to the submission. We gladly plan to cite [R1], acknowledging its relevance to our work, as both techniques process latents with different noise levels in a sliding window manner. We will discuss the differences between the two works: burden of training, the sliding window mechanism, and the target task.

First, [R1] and our work aim to realize each technique from different perspectives: training-based and training-free. [R1] trains diffusion models from scratch to simultaneously process frames with different noise levels, while our work focuses on leveraging promising pretrained models. In this sense, we introduce latent partitioning and lookahead denoising that effectively mitigate the training-inference gap and exploit the benefits from diagonal denoising. Especially in text-to-video generation, training-free approaches are much more valuable in terms of saving computational resources. Our method offers a more accessible and efficient solution for high-quality video generation.

Second, there are clear discrepancies regarding the ‘sliding’ implementation. [R1] introduces two independent iteration axes (global and local diffusion time) during inference, corresponding to window shifting and the diffusion timesteps, respectively. Consequently, [R1] adopts a double nested loop for sampling (Algorithm 2 in [R1]) and requires $NT$ iterations to generate $N$ frames. In contrast, our method employs an efficient entangled axis corresponding to both window shifting and diffusion steps, resulting in $N$ iterations to generate $N$ frames. While both works share the key concept ‘sliding window’, their implementations of sliding are substantially different.

Lastly, the two works address distinct areas within the video generation domain: video prediction and text-to-video. [R1] solely focuses on a video prediction task, extending the existing videos from the datasets. However, our goal is to facilitate text-to-infinite video generation. Our work does not require any existing videos or images from the dataset, relying solely on text for inference.

By clarifying these points, we hope to address your concerns and highlight the unique contributions of our approach.

2. Motion Metric Comparison between FIFO-Diffusion and FreeNoise

To address the third weakness of Reviewer Hj7n, we provide a figure of plot that compares the amount of motion between FIFO-Diffusion and FreeNoise in attached PDF file.

[R1] Ruhe et al., Rolling Diffusion Models, ICML, 2024.

最终决定Accept (poster)

2024-09-25

This paper introduces a novel, training-free approach to generate arbitrarily long videos using a "diagonal denoising" technique applied to pretrained diffusion-based text-to-video generation models. The paper is well-written and clearly structured, making it both accessible and informative.

The reviewers generally appreciated the clarity and organization of the paper, but several concerns were raised during the review process. One common point of discussion was the missing citation of related work, specifically "rolling diffusion." The authors addressed this in their rebuttal, clarifying the differences between their approach and "rolling diffusion." The authors also noted that "rolling diffusion" can be considered concurrent work, as it was not formally published before the NeurIPS submission deadline. I agree with this assessment, and the explanation provided by the authors is satisfactory.

Two reviewers questioned the necessity of using latent partitioning and lookahead denoising to address the training-inference gap in diagonal denoising. After the rebuttal, one reviewer found the authors' response convincing, while the other did not provide further feedback.

The reviewer who recommended rejection primarily challenged the paper on two fronts. First, they questioned the motivation for pursuing training-free long video generation given the current limitations of video generation models. Second, they criticized the quality of the generated videos, noting that they often consist of simple, repetitive actions. However, it's important to recognize that the proposed method is orthogonal to the baseline model's quality. Furthermore, the multi-prompt results shown in Figure 6(c) and Appendix E demonstrate that the proposed method has the capability to control motion and generate long-term ordered sequences using varying text prompts.

While I acknowledge that the generated videos still exhibit flaws, such as issues with object motion and flickering effects, I believe the exploration of training-free methods for long video generation is a valuable contribution that could bring interesting insights to the community. The potential for this approach to inspire future research in this area outweighs the current limitations.

Given the novelty of the approach and its potential impact on the field, I have decided to accept this paper.