PaperHub
4.8
/10
Poster3 位审稿人
最低2最高3标准差0.5
2
3
3
ICML 2025

Diffusion Adversarial Post-Training for One-Step Video Generation

OpenReviewPDF
提交: 2025-01-22更新: 2025-07-24
TL;DR

Convert a slow diffusion model into a one-step, real-time, high-resolution video generator through adversarial post-training

摘要

关键词
video generationimage generationdiffusionadversarial trainingGANreal-timefastdistillationgenerative model

评审与讨论

审稿意见
2

This paper proposes an adversarial finetuning framework for one-step T2I and T2V generation for the flow denosing model, which introduces the insights of designing the discriminator and the model and how to stabilize the model training. And it achieves an exciting performance compared with other few-step models.

给作者的问题

Please see above comments.

论据与证据

yes

方法与评估标准

yes

理论论述

Yes, they look good to me.

实验设计与分析

The experiments seem to be not efficient; the main focus of this paper is on the one-step video generation. However, there are not many experiments on the analyses of the video part, but it focuses on the image models.

补充材料

Yes, the demo on the video examples.

与现有文献的关系

The key ideas of this paper are most related to the UFOGEN (https://arxiv.org/abs/2311.09257), which presents the adversarial fine-tuning for diffusion models on 1-step denosing generation.

遗漏的重要参考文献

The references are sufficient.

其他优缺点

The strengths of this paper can be concluded as 1. concrete design of adversarial diffusion finetuning for most recent diffusion models with DiT architecture and the flow matching objective. 2. The simplified alternative R1 for stabilizing the model training.

The weakness of this paper is that there is not much experimental comparison for the video generation. 720p one step generation is exciting, but 5s of 1 step video generation would be so great.

其他意见或建议

Please add more experimental comparisons with other video models, at least the base model and the one-step mode on the public benchmark.

作者回复

Video Comparisons

We would like to clarify that existing research on diffusion acceleration has primarily focused on the image domain. To the best of our knowledge, no prior studies have proposed high-resolution, one-step video generation methods, and consequently, no suitable baselines exist for comparison. Therefore, in our submission, we evaluate our method using the image generation task, for which comparable prior works are available.

We acknowledge the reviewer’s perspective. Following the reviewer’s suggestion, we have included a comparison with the consistency distillation baseline and our base model. The evaluation follows the public VBench protocol. Specifically, we generate 5 videos for each of VBench's 946 prompts and report the VBench scores in the table below.

As the table shows, APT significantly outperforms the baseline consistency distillation 1NFE, 2NFE, and 4NFE. We observe that the consistency 1NFE baseline generates lower-quality results (e.g., blurry videos), and these issues persist even when increasing to 4NFE. This result verify the effectiveness of APT post-training compared to the baseline.

Even when compared to our base model at 50NFE (25 steps + CFG)—despite this being an unfair comparison—our APT 1NFE achieves a comparable total score (82.00 vs. 82.15) and even performs better on some metrics.

We will add the table and the analysis in the revised paper.

VBenchTotal ScoreQuality ScoreSemantic ScoreMultiple ObjectsOverall ConsistencySpatial RelationshipTemporal StyleObject ClassDynamic DegreeAesthetic QualityHuman ActionSceneImaging QualityBackground ConsistencySubject ConsistencyTemporal FlickeringMotion SmoothnessColorAppearance Style
Ours APT 1NFE82.0084.2173.1561.8926.1171.7823.2092.4162.7863.0490.2041.1969.0596.3897.2698.5798.5488.9819.31
Ours APT 2NFE81.8584.3971.7059.6025.7071.9523.8891.3068.6162.6887.0040.1069.6095.8996.5498.4898.4985.9218.84
Consistency 1NFE67.0573.7840.154.8517.4719.3815.3339.1320.5641.9633.2010.7842.9596.8097.3998.7298.3586.9321.33
Consistency 2NFE74.2078.8355.6923.7822.8844.2120.9663.3433.3354.1263.6024.4859.9396.2196.3398.5598.2988.8819.20
Consistency 4NFE77.9781.9362.1037.3023.7357.9822.1677.5853.6159.3574.0025.8166.3896.3795.9798.2098.1590.4318.35
-
Diffusion Base 50NFE82.1584.3673.3171.6826.0974.9523.8790.4975.5662.9489.6033.6369.8597.0296.4497.6197.5488.6918.93
审稿人评论

Thanks for the reply, the one-step results on the VBench look promising.

作者评论

Thanks for the encouraging assessment. Please let us know if there are any additional questions. We appreciate that the reviewer can update their review in light of this response.

The Authors

审稿意见
3

The paper introduces Adversarial Post-Training (APT), a method that accelerates diffusion-based video generation from multiple inference steps to a single step while preserving high-quality visual output. The approach builds on a pre-trained diffusion model and uses direct adversarial training with real data.

给作者的问题

None.

论据与证据

The claim from Ln 42 to Ln 45, “It is important to notice the contrast to existing diffusion distillation methods, which use a pre-trained diffusion model as a distillation teacher to generate the target.” is not accurate. For example, UFOGen, SF-V do not require the pre-trained diffusion model to generate the target.

The claim from Ln 56 to Ln 57, “APT demonstrates the ability to surpass the teacher by a large margin in some evaluation criteria” seems conflicted with the experiments in Appendix Table 5.

方法与评估标准

While the paper claims superior video generation quality, most quantitative comparisons are conducted against image models in the image domain. A user study and evaluations on benchmarks like VBench comparing the method with other open-source text-to-video generation models (e.g. opensora, hunyuan video, and etc.) would better demonstrate its effectiveness.

理论论述

Looks good to me.

实验设计与分析

Comparisons with other distillation methods would be helpful, (e.g. ADD, LADD, UFOGen).

补充材料

Yes.

与现有文献的关系

Adversarial post-training to reduce the number of inference steps of diffusion models has been widely studied.

遗漏的重要参考文献

Lai, Zhixin, Keqiang Sun, Fu-Yun Wang, Dhritiman Sagar, and Erli Ding. "InstantPortrait: One-Step Portrait Editing via Diffusion Multi-Objective Distillation." In The Thirteenth International Conference on Learning Representations.

Zhang, Zhixing, Yanyu Li, Yushu Wu, Anil Kag, Ivan Skorokhodov, Willi Menapace, Aliaksandr Siarohin et al. "Sf-v: Single forward video generation model." Advances in Neural Information Processing Systems 37 (2024): 103599-103618.

Mao, Xiaofeng, Zhengkai Jiang, Fu-Yun Wang, Wenbing Zhu, Jiangning Zhang, Hao Chen, Mingmin Chi, and Yabiao Wang. "Osv: One step is enough for high-quality image to video generation." arXiv preprint arXiv:2409.11367 (2024).

Wang, Fu-Yun, Zhaoyang Huang, Alexander Bergman, Dazhong Shen, Peng Gao, Michael Lingelbach, Keqiang Sun et al. "Phased Consistency Models." Advances in Neural Information Processing Systems 37 (2024): 83951-84009.

其他优缺点

None.

其他意见或建议

None.

作者回复

Related Works

We appreciate the reviewer pointing out the related works. We will add all of them in the revised paper.

  • SF-V and OSV are 1-step image-to-video generation. We initially did not include them because our work focuses on text-to-video generation. Both of these works are based on UNet based Stable Video Diffusion and can generate up to 1024x576 for 14 total frames. In comparison, our works can generate 1280x720 of total 48 frames (2s 24fps) from text prompts.

  • PCM is an adversarial consistency distillation approach very similar to Hyper and our paper has conducted extensive comparison with Hyper. We notice that the paper has also experimented with distillation on AnimateDiff along with AnimateLCM and AnimateDiff-Lightning. We will add this paper to our related works.

  • InstantPortrait focuses on 1-step image editing tasks with contributions primarily addressing task-specific problems, which may not be directly relevant to our problem.

Comparison Against Other Distillation Methods

The reviewer suggested a comparison against ADD, LADD, and UFOGen. In fact, both ADD and LADD are already included in our submitted paper. Specifically, the ADD method is referred to as SDXL-Turbo and the LADD method as SD3-Turbo, following the naming conventions used by Stability AI [1,2]. We apologize for any confusion and will clarify this in the revised version. We did not include UFOGen in our comparisons, as it was shown to performce worse than DMD2 in the DMD2 paper. Since our paper includes DMD2 as a baseline, we believe this provides a sufficient point of comparison.

VBench Results

We have provided VBench metrics for the rebuttal. Please refer to the rebuttal response to Reviewer afjC. These results will be included in the revised version of the paper.

Other Clarifications on Our Claims

About Ln42-Ln45: We have acknowledged in Ln70-Ln76 that UFOGen is closest to our work which only applies adversarial training on real data. We also elaborated our differences compared to UFOGen. Specifically, our discriminator design is closer to the traditional GAN, and our APT model can surpass the teacher model in some criteria which validates the proposed adversarial post-training.

About Ln56-Ln57: We have elaborated in Appendix B how the traditional COCO FID and PFID metrics in Table 5 and 6 do not fully capture the model performance and thus opt for human evaluation. We leave the exploration for better automated metrics to future works. Our claim is based on Table 1 and 4, where our APT model shows improved visual fidelity compared to the original diffusion model.

审稿人评论

Thanks for the prompt reply by the authors. Most of my concerns were addressed. My apologies for the confusion caused by Comparisons with other distillation methods would be helpful, (e.g. ADD, LADD, UFOGen). My initial thoughts were since this paper focuses on video generation, including such comparisons in video space on the same video model might help the readers have a better understanding of the effectiveness of the proposed method.

作者评论

We sincerely appreciate the review and the response.

We would like to note that existing methods such as ADD and LADD have several limitations for video tasks. For example, LADD requires pre-generating videos from the teacher model, which is computationally expansive especially for the high-resolution video generation tasks. ADD uses DINOv2 discriminator on the pixel space which requires decoding and backpropagation through the VAE decoder. This is also computationally and memory-wise infeasible. These limitations prevented the use of these methods for video generation and inspired our design of APT which is more suitable for the video task.

Please feel free to let's us know if there are any more questions.

审稿意见
3

The paper presents a post-training approach to transform a pretrained video diffusion model (based on DiT architecture) into a one-step generation model, unlike traditional diffusion models requiring multiple (or at least a few) steps. Unlike many existing distillation methods that train a separate student model under the supervision of a teacher model, this work directly applies post-training to a pretrained model using adversarial training (GAN-based framework). Notably, the authors introduce techniques to stabilize training, particularly an approximated version of R1 regularization, as the standard R1 loss is unsuitable for large-scale training scenarios.

给作者的问题

  1. Could you provide direct qualitative and quantitative comparisons between your final adversarially post-trained model and the initial consistency-distilled baseline? This is crucial for evaluating the genuine effectiveness of your proposed method.

  2. Given that the adversarial post-training starts from an already strong consistency-distilled model, can you explicitly demonstrate if and how the proposed method improves video quality rather than merely preserving it?

  3. Could the authors elaborate further on the fundamental difference between their post-training approach and existing distillation methods like DMD, which also initialize student models with teacher weights?

论据与证据

The main claims include:

  • Effective conversion from multi-step diffusion to one-step generation through adversarial post-training.
  • Improved training stability due to the introduced approximated R1 loss.

The authors provide qualitative evidence supporting the overall effectiveness of their final model. However, there is a gap: the paper lacks direct comparison between the adversarially post-trained model and its initial consistency-distilled baseline. This omission makes it difficult to clearly attribute performance gains to the proposed adversarial post-training strategy alone.

方法与评估标准

The proposed adversarial post-training method and the approximated R1 regularization are well-chosen and relevant to the stated problem. Evaluation criteria (qualitative visual comparisons with state-of-the-art methods) align with standard practice. However, the evaluations suffer from a crucial shortcoming: absence of direct comparative evaluation against the initial consistency-distilled model, undermining the strength of claims regarding performance improvements.

理论论述

No explicit theoretical claims or proofs are provided.

实验设计与分析

While the experimental design convincingly shows the practical engineering value and demonstrates stability gains from the approximated R1 loss, it lacks thorough comparative analyses. Specifically, the paper fails to provide direct visual and quantitative comparisons between the final adversarially post-trained model and the initial consistency-distilled baseline.

补充材料

Yes

与现有文献的关系

The paper relates closely to recent literature on accelerated diffusion models and distillation techniques. However, the distinction made by authors between their post-training approach and existing teacher-student approaches (e.g., DMD, CVPR 2024) is somewhat ambiguous since many student models are similarly initialized from teacher weights.

遗漏的重要参考文献

No essential missing references identified.

其他优缺点

Strengths:

  • Practical approach achieving effective one-step video generation.
  • Useful technical innovation in approximated R1 regularization, addressing large-scale training stability issues.

Weaknesses:

  • Inadequate comparative analysis against the baseline consistency-distilled model.
  • Limited novelty regarding model initialization approach relative to existing methods (e.g., DMD).

其他意见或建议

To significantly strengthen the paper, the authors should provide explicit qualitative and quantitative comparisons between the final adversarially post-trained model and the initial consistency-distilled baseline, clarifying whether performance genuinely improves.

作者回复

Comparison with Consistency Baseline

We respectfully point out that consistency distillation (CD, including methods such as LCM) has been extensively studied in prior works (e.g., DMD, DMD2, Lightning, Hyper-SD, LADD), which consistently show that CD struggles to produce sharp results in a single step. For instance, see Figure 3 in Lightning. This has also been shown in Figure 9 of our submission which illustrates that as training progresses, the initial CD model generates noticeably blurry outputs in one step. Therefore, in our work, we choose to compare directly against state-of-the-art methods rather than the CD baseline.

However, following the reviewer's suggestion, we have included the CD-initialized model as a baseline. We compute metrics: FID, PFID, and CLIP scores, on both the COCO-10K and COCO-5K benchmarks. As shown in the table below, the results are consistent with those previously reported in the paper. We will incorporate them in the revised paper.

In addition, we have included a video comparison between the CD and APT models on the VBench benchmark. Please refer to our response to Reviewer afjC for the detailed results.

COCO-10KFID↓PFID↓CLIP↑
Diffusion 25step20.724.733.1
Consistency 1step114.1161.322.3
APT 1step22.128.532.2
COCO-5KFID↓PFID↓CLIP↑
Diffusion 25step26.930.633.2
Consistency 1step119.6164.622.3
APT 1step27.934.632.3

More visualization comparison between CD and APT is provided in this external link.

Differences to Existing Methods

We would like to elaborate on the differences between our methods and existing methods (e.g. DMD CVPR24):

  1. Simplicity: DMD combines variational score distillation (VSD) and rectified flow (RF) objectives, as VSD alone can lead to mode collapse while RF can cause sharpness issues. DMD2 changes to use both VSD and adversarial objectives, but this setup requires training 3 networks: the student, the VSD negative score model, and the adversarial discriminator. In contrast, our method adopts a simpler approach, using only an adversarial objective with CD initialization. This streamlined design allows the model to adapt quickly—in just 350 iterations (see Figure 9 in the paper).

  2. Post-training vs. distillation quality: Existing methods (such as DMD, DMD2, Lightning, Hyper, ADD, and LADD) distill the teacher results, where the teacher model is the quality upper-bound. Our method is adversarial post-training (APT) against real data, and we demonstrate that it is able to surpass the teacher diffusion model in some criteria. The discussion is provided in Appendix H.

  3. First large-scale video generation with 1NFE: Existing methods like DMD and LADD require precomputing teacher noise-sample pairs, which can be computationally expensive—especially for high-resolution video tasks. In contrast, our method trains directly on real data, avoiding these overheads. This enables us to be the first to demonstrate one-step generation using a large-scale T2V model, achieving 1280×720 resolution at 24 fps for 2-second videos.

  4. Closer to traditional GANs: Existing adversarial distillation methods generally follow the DiffusionGAN approach, which may appear similar to GANs but differs in key ways: 1) noise corruption is added to the discriminator inputs (e.g., LADD, UFO-GAN, Lightning), and 2) it freezes the discriminator backbone and only trains the generator (e.g., ADD, LADD). In contrast, our method more closely resembles classical GANs: the discriminator is fully trainable, and no noise corruption is applied to its inputs. Additionally, we introduce an approximate R1 loss, inspired by traditional R1 regularization, which significantly improves training stability. We hypothesize that this design contributes to the reduction of visual artifacts in our one-step generation, as compared to existing works (LADD, Lightning, etc.), as illustrated in Figure 3.

We hope the above points clarify the differences from related work and substantiate the contributions of our method.

最终决定

The paper received 2 Weak Accept and 1 Weak Reject score pre-rebuttal. The scores are not changed after the rebuttal. Although Reviewer afjC did not change his Weak Reject score, his response to the rebuttal was positive, and the rebuttal seemed to address his concern well.

The ACs checked and found the paper's results strong and valuable to the community. Hence, we agreed to accept the paper to ICML. The authors should include the discussions and results in the rebuttal, particularly the additional VBench results, in the camera-ready version.