PaperHub
7.8
/10
Spotlight4 位审稿人
最低4最高5标准差0.4
5
5
4
5
4.3
置信度
创新性3.3
质量3.3
清晰度3.0
重要性3.3
NeurIPS 2025

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

OpenReviewPDF
提交: 2025-04-07更新: 2025-10-29
TL;DR

We introduce a new method to train autoregressive video diffusion models by performing autoregressive self-rollout with KV caching during training.

摘要

关键词
Video generationAutoregressive modelsDiffusion modelsWorld models

评审与讨论

审稿意见
5

This paper introduces Self Forcing, a post-training approach for autoregressive video diffusion models. The key idea is to simulate inference-time behavior during training by letting the model condition on its own generated frames rather than ground truth. This helps close the gap between training and inference, which has long been a challenge due to exposure bias. It combines few-step diffusion and stochastic gradient truncation to efficiently expose models to their own prediction errors and optimize a holistic video-level distribution-matching objective.

优缺点分析

Strengths

  1. This paper is the first to introduce autoregressive self-rollout in video diffusion post-training, fully aligning training and inference distributions to eliminate exposure bias.
  2. It uses holistic video-level distribution-matching losses (DMD/SiD/GAN), so the model learns to correct its own residual errors, improving both visual fidelity and temporal consistency.
  3. It is intuitive why this works: by matching the model’s output to the true distribution at the video (holistic) level, the method enforces a global constraint on cumulative errors.
  4. This paper also demonstrates broad success across VBench, human preference studies, and efficiency benchmarks, outperforming both autoregressive and bidirectional diffusion baselines.

Weaknesses

  1. Gradient truncation (only backpropagating the final denoising step) and detaching historical KV embeddings save memory but may weaken the model’s ability to capture long-range, cross-frame dependencies. 2 .Comparison of DMD, SiD and GAN objectives is limited; the paper does not clearly identify which loss is best for which scenario. 3 .Key hyperparameters (diffusion steps, truncation point, KV-cache size) are highly sensitive, yet the paper offers little guidance on their tuning.

问题

  1. Did the authors experiment with applying varying noise intensities to individual frames before denoising and then enforcing a holistic distribution-matching constraint? How do different noise levels impact the mitigation of cumulative errors?

局限性

yes

最终评判理由

The authors have addressed my concerns. I keep my initial rating.

格式问题

None

作者回复

We thank the reviewer for their valuable feedback. We are encouraged by the overall positive assessment, with reviewers agreeing that our work addresses a "critical and long-standing problem" [Vwja] and is "very well motivated" [2kqq]. The method is "clearly presented" [9RYz], "intuitive" [N5mF], and the paper is “well-written and easy to follow” [2kqq]. Our "quantitative results are solid" [9RYz] and were praised as a "thorough empirical section" [Vwja], demonstrating "broad success across VBench, human preference studies, and efficiency benchmarks" [N5mF].

We will now address the concerns and questions raised to further clarify our contributions.

  • Gradient truncation: We implemented more heavily-optimized activation checkpointing and additional training parallelism to support Self Forcing training without gradient truncation. However, we find that models trained without gradient truncation do not produce better results than models with truncation. We hypothesize that models already learn representations that capture long-range dependencies during the pretraining stage, so that enabling long-range gradient flow during post-training becomes unnecessary. We will add those discussions in the final version of the paper.

  • Comparison of DMD/SiD/GAN: We find GAN training is generally less stable and requires more hyperparameter tuning efforts than DMD/SiD. DMD is preferred over SiD when using a larger teacher model. Unlike SiD, DMD does not require backpropagating through the teacher and is therefore much more efficient. We will add a discussion in the final version.

  • Hyperparameter sensitivity: Our method is not sensitive to the mentioned hyperparameters. The maximum KV-cache size is determined by the base model, and we always apply truncation to the last denoising step to minimize memory consumption. We do not observe better results with full backpropagation (without truncation), despite larger memory consumption. The number of diffusion steps is a hyperparameter that controls the quality/speed trade-off. Our method does not fail catastrophically when using less denoising steps. For example, we trained a 3-step model and it obtains a VBench total score of 84.26, only slightly lower than the 4-step version (84.31).

  • Varying noise intensities: We interpret the question as follows: instead of using clean self-generated context frames as conditioning (as in the current Self-Forcing algorithm), can we add noise to self-generated context frames during training and inference (maybe with progressive intensity levels similar to rolling diffusion), and does it further reduce error accumulation?

    We do not find adding noise to context frames helpful in our case. Intuitively, when there is a train-test distribution mismatch (e.g., in teacher forcing), adding Gaussian noise to both distributions makes them have larger overlap and thereby mitigates cumulative errors. However, since our method does not have a train-test distribution mismatch, adding noise to the context is no longer necessary and only introduces drawbacks as discussed in L45-47.

评论

Thanks for the authors' rebuttal that clarifies many of my concerns. I keep my initial rating.

审稿意见
5

This paper presents self-forcing, a diffusion framework for long video generation. Different from prior approaches such as diffusion forcing or teacher forcing, self-forcing aims to remove the gap between training and inference by explicitly doing rollout during training. To reduce the computation and memory burden for this training strategy, the paper introduces several tricks to mitigate this, e.g., using few-step diffusion models or only doing backpropagation for a single random timestep per frame. The paper also adapts the KV caching trick (which has been popular in LLMs) by for long video generation. By incorporating all of these compoments, the paper shows self-forcing can generate long videos in real-time with high quality, outperforming prior long video generation methods.

优缺点分析

Strengths

  • The paper is generally well-written and easy to follow.
  • The paper is very well motivated, as the length limit of existing video generation models is one of the biggest bottlenecks for them to be extended to real-world videos.
  • Removing the discrepancy between training and inference is important and a reasonable approach.
  • Some components seem technically novel (e.g., how to handle gradients).

Weaknesses

  • There are very limited qualitative comparisons, making it difficult to know how the proposed method generates better long video compared with other baselines. In particular, I wonder how self-forcing can be qualitatively better than diffusion-forcing or teacher-forcing but I cannot find qualitative comparisons. Could the authors provide more comparisons in general (like comparison with other baselines, eg, CausVid) as well as comparison with DF or TF?
  • I think real-time generation seems great, but the provided examples do not seem to be long, as they are just 10 seconds videos.
  • Minor: [1] also deals with chunk-wise autoregressive long video generation and exposure bias but the paper misses discussion.
  • Minor: Could the authors provide memory consumption in training and inference?

[1] Yu et al., MALT Diffusion: Memory-Augmented Latent Transformers for Any-Length Video Generation, 2025

问题

Please see my weaknesses

局限性

Yes

最终评判理由

I think this paper tackles and an important problem with a quite elegant solution. I recommend acceptance for this paper

格式问题

N/A

作者回复

We thank the reviewer for their valuable feedback. We are encouraged by the overall positive assessment, with reviewers agreeing that our work addresses a "critical and long-standing problem" [Vwja] and is "very well motivated" [2kqq]. The method is "clearly presented" [9RYz], "intuitive" [N5mF], and the paper is “well-written and easy to follow” [2kqq]. Our "quantitative results are solid" [9RYz] and were praised as a "thorough empirical section" [Vwja], demonstrating "broad success across VBench, human preference studies, and efficiency benchmarks" [N5mF].

We will now address the concerns and questions raised to further clarify our contributions.

  • Long video generation: The goal of this paper is not to improve long video generation quality in the extrapolation setting. Our work aims to address exposure bias and to reduce error accumulation for videos generated within the training context length. This is a distinct problem from extrapolation (i.e., generating videos substantially longer than the training context), where we do not claim strong performance, as discussed in our limitations (L318-320). Extrapolation introduces a second distributional mismatch (of video length), in addition to the distribution mismatch that we address (ground-truth vs. generated history). While resolving exposure bias is a necessary step for long video generation, it is not sufficient by itself. To truly enable high-quality long video generation, future work needs to either scale up the video length during training, or investigate orthogonal techniques that enable better extrapolation.

  • Visual comparisons with DF/TF: We will add additional qualitative comparisons with DF/TF in the final version of the paper (we are not allowed to add them in rebuttal due to NeurIPS policy restrictions). Visually, videos generated with DF/TF tend to exhibit significantly more error accumulation and often suffer from over-saturation over time.

  • MALT Diffusion reduces error accumulation by adding noise to context frames during training. This was also used in earlier works including Diffusion Forcing and GameNGen, and we have some discussions in L44-47. We will add citations to MALT Diffusion in the final version.

  • Memory usage: We use mostly bfloat16 precision for both training and inference. The peak allocated memory during training (batch size 1) is 53.2GB. During inference, it is 23.9GB Inference without text encoder offloading and 13.0GB with offloading. We will add these details in the final version.

评论

Thanks for the response. I think this is a good paper on video diffusion literature. Please revise the manuscript accordingly (e.g., adding memory consumption) if the paper is accepted. I've raised my score.

审稿意见
4

The paper proposes Self-Forcing, a fine-tuning technique to enable autoregressive generation in video diffusion models. The method starts with fine-tuning the base model with causal masking to initialize it with some autoregressive capability. Then it follows with training on autoregressively self-generated data using distribution matching losses to make sure the generated chunks form a plausible temporally consistent video as a whole. KV caching and few-step distillation are used to speed up the training and the inference with the model. Self-forcing achieves better results than the prior work on VBench and human preference.

优缺点分析

Strengths

  1. The motivation and the method itself are clearly presented.
  2. Applying KV caching to video models is a meaningful and valuable contribution.
  3. The quantitaive results are solid.

Weaknesses

1.The proposed Self-Forcing method is essentially a distillation or fine-tuning technique, whereas Teacher Forcing (TF) and Data Forcing (DF) were originally designed for training from scratch. Presenting these approaches as direct counterparts can be confusing and makes the overall story harder to follow. 2. The method is presented as addressing exposure bias and enabling long video generation, yet no quantitative long generation results are provided (only some qualitative examples in the supplementary material’s limitations section). Moreover, the authors acknowledge that the model struggles with sequences longer than those seen during training. This seems like an inherrent limitation of the approach, as the distributional losses do not seems to go beyond the native generation window of the base model. 3. The visual results, aside from the local attention ablation, do not clearly demonstrate the benefits of the proposed method.

问题

  1. I recommend moving the details about the different training objectives to the main paper, as they are essential for understanding key aspects of the method, such as how the pseudo ground truth scores are obtained and how the critic is trained.
  2. Could you clarify how many rollouts are used during training? What is the total number of frames? Does this total align with the native number of frames of the base model or the critic?

I think in the current state the presentation weaknesses overweigh the method's strengths, therefore I am leaning towards rejection. However, I am open to revising my score if the rebuttal adequately addresses my concerns, as I do believe the contributions (the achieved efficiency through KV caching and autoregressive distillation) are noteworthy.

局限性

yes

最终评判理由

The authors did a good job during the rebuttal and addressed most of my concerns in both the rebuttal and the follow-up discussion phase. Enabling AR generation in modern VDMs offers valuable benefits, and the proposed method provides a reasonable solution to this problem. The application of KV-caching for videos is a meaningful contribution.

格式问题

I have noticed no major formatting issues.

作者回复

We thank the reviewer for their valuable feedback. We are encouraged by the overall positive assessment, with reviewers agreeing that our work addresses a "critical and long-standing problem" [Vwja] and is "very well motivated" [2kqq]. The method is "clearly presented" [9RYz], "intuitive" [N5mF], and the paper is “well-written and easy to follow” [2kqq]. Our "quantitative results are solid" [9RYz] and were praised as a "thorough empirical section" [Vwja], demonstrating "broad success across VBench, human preference studies, and efficiency benchmarks" [N5mF].

We will now address the concerns and questions raised to further clarify our contributions.

  • Presentation versus TF/DF: We believe comparing Self-Forcing directly to Teacher-Forcing (TF) and Diffusion-Forcing (DF) is useful for understanding our core contribution, as the fundamental axis of comparison is the source of the conditioning context during training. As illustrated in Figure 1, these methods represent three distinct choices for this context:

    • Teacher Forcing: Conditions on clean, ground-truth frames.
    • Diffusion Forcing: Conditions on noisy, ground-truth frames.
    • Self Forcing (Ours): Conditions on the model's own previously generated frames.

    Although TF/DF were originally introduced for pre-training and Self-Forcing is described as a post-training algorithm for practical considerations, we argue that algorithms used at different stages can still form valid comparisons. As an analogy, while today GANs are often used only for post-training/distillation and diffusion models for pre-training, they are still routinely compared as alternative generative modeling paradigms.

    Moreover, the application of TF/DF is not limited to pre-training. Our key baseline, CausVid, applies DF in a post-training phase with a DMD objective. Our ablation studies (Table 2) apply all three paradigms in the same post-training setting. This ensures an apples-to-apples comparison to isolate the impact of the conditioning strategy—the central variable our paper investigates.

  • Long video generation: The goal of this paper is not to improve long video generation quality in the extrapolation setting. Our work aims to address exposure bias and to reduce error accumulation for videos generated within the training context length. This is a distinct problem from extrapolation (i.e., generating videos substantially longer than the training context), where we do not claim strong performance, as discussed in our limitations (L318-320). Extrapolation introduces a second distributional mismatch (of video length), in addition to the distribution mismatch that we address (ground-truth vs. generated history). While resolving exposure bias is a necessary step for long video generation, it is not sufficient by itself. To truly enable high-quality long video generation, future work needs to either scale up the video length during training, or investigate orthogonal techniques that enable better extrapolation.

  • Visual results not convincing enough: The visual results in Fig. 7 show much better quality than CausVid (less oversaturation). While videos obtained with Wan/SkyReels/MAGI-1 indeed sometimes have visually comparable quality, those methods are 150-400x slower than us in terms of latency. So even same quality is still a big win for us.

  • Moving training details: We thank the reviewer’s suggestion and will move the training objective descriptions to the main paper in the final version.

  • Training rollout: We generate 21 latent frames (corresponding to 81 video frames) during training, which is aligned with the setting of the base model. The number of rollouts is 21 divided by the chunk size (which equals to 3 or 1 in our experiments). We will add these details in the final version.

评论

Thank you for the rebuttal. I appreciate the clarifications and the effort to address the concerns raised. I have also read the other reviews. However, I still have a few remaining concerns.


  • The paper frames exposure bias (i.e. error accumulation in autoregressive models due to the domain shift in the conditioning signal) as a core motivation, but since the model does not aim to generate beyond the base model’s native frame limit, it is unclear why exposure bias is a real issue in this setting. If everything fits in the base model’s window, parallel generation avoids the problem. This is also why I found it confusing to present this approach versus TF/DF, where exposure bias is more clearly relevant. Could the authors clarify this point?

  • I initially found the KV caching for video models to be one of the most interesting contributions of the work. In the paper (e.g. see the abstract and figure 3) it is motivated by the need of extrapolation and long video generation. However, the rebuttal states that is not the focus. This weakens the significance of KV caching as currently presented.

评论

Thank you for your continued engagement and insightful follow-up questions. We appreciate the opportunity to provide further clarification.

  • You are absolutely correct that fully parallel, non-autoregressive diffusion models (like the base model) do not suffer from exposure bias. However, the central motivation for adopting an autoregressive (AR) framework is to enable applications that are fundamentally impossible with parallel generation, as discussed in our introduction (L18-28). These include:

    • Real-time streaming: Where video frames must be generated and displayed sequentially with minimal latency.
    • Interactive applications: Where user input can alter the course of the video during its generation.

    Once we commit to an AR paradigm to unlock these benefits, exposure bias immediately becomes a critical issue, even when generating videos within the original context length. Previous AR Diffusion models have been trained with TF/DF, which suffer from error accumulation even within training length due to exposure bias. Our Self-Forcing method addresses this fundamental problem for AR models.

  • Clarifications about KV caching:

    • KV caching's standard role is to make AR generation efficient, not specifically about extrapolation. This holds true both in our work and in its standard, default usage in LLMs, where it prevents quadratic complexity during AR generation.
    • While there have been prior works that use KV caching for video models (CausVid/MAGI-1), using KV caching during training is indeed one of our main contributions. This is a departure from standard practice where it is an inference-only optimization.
    • We aditionally propose a rolling KV cache algorithm (abstract and figure 3) specifically for the extrapolation scenarios. This algorithm aims to improve the efficiency of extrapolation (orthogonal to quality) and is also orthogonal to our main contribution (addressing exposure bias by performing rollout with KV caching during training, within length limit).

We sincerely appreciate this discussion and will integrate these clarifications into the final manuscript to make these contributions clearer to all readers. We are happy to answer any remaining questions.

评论

I would like to thank the authors for their clarifications. I agree that enabling AR generation in modern VDMs offers valuable benefits (such as supporting fine-grained controllable generation). While I may conceptually disagree that this behavior should be distilled from a chunk-wise model and would prefer it to be supported already during pretraining, I acknowledge that, given the availability of strong pretrained chunk-wise models, the proposed method presents a reasonable solution to the problem. I also recognize that the paper clearly presents the approach, and the application of KV-caching for videos is a meaningful contribution, as it doesn't only reduce the computational load of generating next frames, but also potentially allows for increased context length in the extrapolation scenarios. My only suggestion would be to revise the writing to make the motivation more apparent. I will raise my rating accrodingly.

审稿意见
5

The paper addresses a critical issue in autoregressive video generation setups: error accumulation in the network's predictions, leading to train/test domain gaps. The authors employ an Infinite Nature approach, where instead of teacher forcing (training on ground truth sequences), they use student-forcing or self-forcing, in which the model also observes its own outputs during training. This is challenging because the model must backpropagate through all diffusion time steps and all previous generation steps. To address this, the authors propose several techniques: 1. Use a few step distribution matching backbone and training scheme instead of a full ~50-step SDE/ODE model; 2. Stochastic gradient truncation, where the gradients flow only through the last denoising step of each frame; earlier steps and prior frames are detached. The model works well and achieves 17FPS (on par with its backbone CausVid), and have certain quality improvement.

优缺点分析

Strength:

-- The problem of error accumulation is a critical and long-standing problem of AR models, and the paper is well-motivated and proposed a reasonable solution.

-- The training is very efficient and could converge within hours.

-- The authors provide thorough empirical section: VBench, human study, latency/throughput, ablations on loss functions and training paradigms.

Weakness:

-- The idea of student forcing/self forcing is not new. Prior autoregressive work such as Infinite Nature did some thorough experiments on such methods.

-- I personally work on this topic and tried the code as soon as it was released. However, I am not certain how much improvement is actually made by the proposed self-forcing training scheme -- efficiency-wise, the FPS and workload is exactly the same as CausVid, so efficiency-based improvements cannot be claimed by this paper; quality-wise, I am also not certain how much improvement is made on top of CausVid, the short 5-sec videos look on par with CausVid, and longer extrapolation still has the drifting issue, and I personally think it is arguably more severe than CausVid, potentially because self-forcing is trying to accommodate and ignore the error, but at the same time it also forgets some portions of the context.

Overall, this is a good paper that should be accepted, as it provides some interesting insights. However, I am not sure how "functional" the method actually is — the error accumulation is still evident, and it is hard to argue how much more robust the model gets compared with the backbone. Therefore, I vote for acceptance, but not a strong acceptance.

问题

N/A

局限性

N/A

最终评判理由

I do not have many concerns initially and would keep my positive score.

格式问题

N/A

作者回复

We thank the reviewer for their valuable feedback. We are encouraged by the overall positive assessment, with reviewers agreeing that our work addresses a "critical and long-standing problem" [Vwja] and is "very well motivated" [2kqq]. The method is "clearly presented" [9RYz], "intuitive" [N5mF], and the paper is “well-written and easy to follow” [2kqq]. Our "quantitative results are solid" [9RYz] and were praised as a "thorough empirical section" [Vwja], demonstrating "broad success across VBench, human preference studies, and efficiency benchmarks" [N5mF].

We will now address the concerns and questions raised to further clarify our contributions.

  • Novelty: We agree with the reviewer and acknowledge in our paper (L49, L70-74) that training with self-rollout was explored in early RNN+GAN-based models, including InfiniteNature. However, as Reviewer N5mF kindly noted, our work is the first to successfully adapt this principle to modern Autoregressive Diffusion Transformers. This adaptation is highly non-trivial, requiring a careful design for combining a few-step diffusion, KV caching, and gradient truncation. Successfully bridging this established concept to the AR-DiT architecture is a core technical contribution of our work, leading to significant gains in generation quality over older paradigms.

  • Improvement over CausVid: We want to clarify that Self-Forcing has significantly better quality than CausVid-Wan-1.3B implemented by the original author under the same-speed setting. This advantage is quantitatively supported by VBench scores (Total Score: 84.31 vs. 81.20), user preference study (66.1% preference), the qualitative examples in the appendix (Fig. 7), and the html website in the supplementary material. We respectfully hypothesize that the reviewer's observation that our results “look on par with CausVid” may refer to a different implementation of CausVid built upon a larger closed-source model that runs significantly slower (9.4 FPS for 360P, compared with ours 17FPS for 480P). Our paper provides an apples-to-apples comparison against the public CausVid implementation to fairly evaluate the specific impact of our training paradigm on the same 1.3B architecture.

评论

I do not have many concerns initially, and would keep my positive score.

最终决定

The submission proposes a training approach termed Self Forcing, to solve the problem of error accumulation for autoregressive video diffusion models. Specifically, Self Forcing performs autoregressive self-rollout during training, denoising the next frame based on previously generated frames by itself.

The reviewers acknowledged the author's rebuttal and gave all positive final ratings: three “accept” and one "borderline accept". After reviewing all the materials, the AC agrees with the ratings of the reviewers and suggests a solid accept. AC also suggests a Spotlight recommendation considering the significance of autoregressive video generation and the effectiveness of the proposed Self Forcing approach.