Simple and Fast Distillation of Diffusion Models
A simple and fast distillation of diffusion models that accelerates the fine-tuning up to 1000 times while performing high-quality image generation.
摘要
评审与讨论
This paper presents a fast-training accelerate sampling algorithm that is based on the distillation paradigm. This algorithm performs trajectory matching under all time points from 1 to 0 (t=80-0.006). The approach avoids the huge overhead of bi-level optimization through the ``detach()'' operation in pytorch, and further improves the performance of the algorithm by correcting the settings of a series of hyperparameters: loss function (), step-condition, analytical first step.
优点
-
The results of the algorithm presented in this paper are extremely impressive, surpassing all known distillation-based accelerated sampling algorithms.
-
The experiments conducted in this paper are thorough and well-analyzed. Particularly enlightening is the analysis of the guidance scale on Stable Diffusion.
-
The paper is exceptionally well-written, offering clarity and ease of understanding, and it is inspiring in various ways. Notable examples include the introduction of the SFD algorithm and the innovative use of CFG=1 for training, as well as the adaptation of any CFG for inference on SD-v1.5.
缺点
The paper has no significant shortcomings; however, it is worth noting that while the algorithm substantially reduces training costs, it also inevitably lowers the maximum performance threshold. An open question remains: if we extend the training duration, can we achieve results comparable to those of the CTM? It appears that not all time steps contribute equally to model performance. Additionally, introducing randomness through local trajectory matching could potentially enhance generalization.
问题
-
Can SFD accelerate DiT? It would be beneficial to include some preliminary experiments to explore this possibility.
-
Many accelerated sampling algorithms for diffusion models employ discriminators for trajectory matching, with LCM being a prominent example. Could this discriminator-based approach outperform loss in the context of SFD?
-
Is there a bound for SFD's performance as training time is extended? Given that matching in the form of global inherently involves compression by discarding randomness.
局限性
Yes
Thanks for your positive feedback! Below we address the specific questions.
Q: If we extend the training duration, can we achieve results comparable to those of the CTM?
The remarkable FID results of CTM are mainly attributed to the introduced GAN loss, otherwise the FID would significantly increase. In Table 2, the results of our SFD (second stage) are on par with CTM without the GAN loss. However, as we will elucidate later, the goal of discriminator-based approaches is generally in contradiction with the goal of trajectory distillation methods such as progressive distillation and our SFD.
Q: Could discriminator-based approach outperform L1 loss?
If we do not get it wrong, the mentioned discriminator-based approach refers to the GAN loss used in CTM [1], because we do not see a discriminator used in the LCM paper [2]. Generally, the effect of this GAN loss is to match the quality of the denoised sample to a real image, which can be treated as building a ''shortcut'' from to for every sampled . However, this does not align well with the goal of trajectory distillation, where we are to build a ''shortcut'' from to () for sampled and , aiming at faithfully matching the teacher trajectory. Only when will these two objectives align. Therefore, trajectory distillation methods could hardly benefit from this GAN loss.
Q: Can SFD accelerate DiT?
Surely. One appealing property of diffusion models is the unique encoding, or referred to reproducibility in [3]. Given a noise distribution and a dataset, diffusion models build a fixed mapping between the noise and the implicit data distribution, regardless of different model architectures (e.g., DiT, U-ViT or U-Net) and training procedures, as long as the model capacity and data size are promised [3]. Starting from the same noise, the sampling trajectory given by a U-Net could resemble that of a DiT, as both of them are trained to predict the same score function. Therefore, the effectiveness of SFD should be independent to the model architecture.
We conduct some primary experiments using the provided DiT-XL/2 model on ImageNet 256x256. We train our SFD with settings of DPM-Solver++(2M) teacher, , 100K generated trajectories. Since DiT adopts classifier-free guidance, AFS is disabled. We use guidance scale of 4 and 10,000 images for FID evaluation. The results are shown below.
| Solver | NFE | FID-10K |
|---|---|---|
| SFD | 3 | 14.34 |
| DPM-Solver++(2M) | 9 | 13.58 |
| DDIM | 3 | 56.64 |
Q: Introducing randomness through local trajectory matching could potentially enhance generalization. Is there a bound for SFD's performance as training time is extended?
To show that SFD (global trajectory matching) does not sacrifice generalization, we compute fidelity (measured by precision and density) and diversity (measured by recall and coverage) [4] on CIFAR-10 dataset following the standard practice. Below are the results.
| Solver | NFE | FID | Precision | Recall | Density | Coverage |
|---|---|---|---|---|---|---|
| SFD-v | 2 | 4.28 | 0.77 | 0.70 | 1.06 | 0.93 |
| 3 | 3.50 | 0.78 | 0.71 | 1.10 | 0.94 | |
| 4 | 3.18 | 0.79 | 0.71 | 1.13 | 0.94 | |
| 5 | 2.95 | 0.79 | 0.71 | 1.15 | 0.95 | |
| DPM-Solver++(3M) | 11 | 3.93 | 0.76 | 0.71 | 1.04 | 0.94 |
| 15 | 2.64 | 0.76 | 0.73 | 1.03 | 0.95 | |
| 19 | 2.54 | 0.77 | 0.72 | 1.04 | 0.96 | |
| 23 | 2.65 | 0.77 | 0.72 | 1.05 | 0.96 | |
| 50 | 2.01 | 0.78 | 0.72 | 1.11 | 0.96 | |
| DDIM | 50 | 2.91 | 0.79 | 0.71 | 1.09 | 0.95 |
| Heun | 50 | 1.96 | 0.79 | 0.72 | 1.10 | 0.96 |
It is shown that the diversity of SFD is very close to that of the teachers, meaning that SFD generalizes well. As for the performance bound, we extend the training iterations of SFD-v shown in Table 2 up to four times. Below are the FID results for different trained trajectories and NFEs.
| Trained trajectories | NFE=2 | NFE=3 | NFE=4 | NFE=5 |
|---|---|---|---|---|
| 800K | 4.28 | 3.50 | 3.18 | 2.95 |
| 1600K | 4.28 | 3.47 | 3.11 | 2.90 |
| 2400K | 4.18 | 3.41 | 3.05 | 2.92 |
| 3200K | 4.24 | 3.40 | 3.04 | 2.91 |
Reference:
[1] Kim D, Lai C H, Liao W H, et al. Consistency trajectory models: Learning probability flow ode trajectory of diffusion[J]. arXiv preprint arXiv:2310.02279, 2023.
[2] Luo S, Tan Y, Huang L, et al. Latent consistency models: Synthesizing high-resolution images with few-step inference[J]. arXiv preprint arXiv:2310.04378, 2023.
[3] Zhang H, Zhou J, Lu Y, et al. The emergence of reproducibility and consistency in diffusion models[C]//Forty-first International Conference on Machine Learning. 2023.
[4] Naeem M F, Oh S J, Uh Y, et al. Reliable fidelity and diversity metrics for generative models[C]//International Conference on Machine Learning. PMLR, 2020: 7176-7185.
I thank the authors for their response. I think most of my proposed concerns are addressed. However, I don't agree with the idea that LCM can't benefit from discriminator. Although the original LCM paper did not leverage GAN Loss, many papers have demonstrated that this form is effective, for example:
[1] Kong F, Duan J, Sun L, et al. ACT-Diffusion: Efficient Adversarial Consistency Training for One-step Diffusion Models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 8890-8899.
[2] Liu H, Xie Q, Deng Z, et al. SCott: Accelerating Diffusion Models with Stochastic Consistency Distillation[J]. arXiv preprint arXiv:2403.01505, 2024.
[3] Zhai Y, Lin K, Yang Z, et al. Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation[J]. arXiv preprint arXiv:2406.06890, 2024.
Especially the last one, I've recently explored the effects of the combination of discriminator and LCM, and this pipeline has demonstrated to be very effective.
I have scanned the comments of the other reviewers and the authors' responses, and I continue to keep my point that this work is excellent and informative, and I believe that this approach could even be beneficial for dataset distillation on diffusion models. Therefore, I maintain my original score.
We appreciate the reviewer's fast responce and we actually share the same viewpoints with you. We fully agree that LCM can benefit from the discriminator, as LCM is categorized as a consistency distillation method (mentioned in Appendix A), instead of trajectory distillation. The goal of trajectory distillation methods is to faithfully match the whole trajectory (i.e., build shortcuts from to for sampled and ). For consistency distillation method like LCM, the actual effect is to build shortcuts from to for every sampled , which is the same as that of the GAN loss. Therefore, LCM can indeed benefit from discriminator.
Based on my thorough consideration, I think this paper is so valuable, and I finally decided to upgrade the ratings from Accept to Strong Accept. Good luck!
This paper proposes a fast distillation method for diffusion models. This method simplifies the existing knowledge distillation framework and proposes Simple and Fast Distillation from a global perspective to reduce redundant time steps in training. The SFD framework can achieve good experimental results in a very short time compared to existing methods.
优点
- The method proposed in the paper is not complicated and explores many available improvement ideas, achieving good results.
- Compared with existing methods, the method proposed in this paper can achieve good results in an extremely short time, with significant improvements in time efficiency.
缺点
- Many of the techniques in the paper lack innovation and are more like technical explorations.
- Some of the figures in the paper are difficult to understand, and figures like fig.6 should be explained more.
- The selection of some hyperparameters in the paper, such as t_min, appears too random.
- If the performance of other training methods such as CTM under the same training time as SFD were reported, it will further demonstrate the superiority of the proposed method.
问题
- Add more explanations to some meaningful figs in the paper.
- Conduct more ablation experiments or explain the criteria for selecting some hyperparameters in the method.
- Adding more experiment as I mentioned in weakness 4.
局限性
Yes, the authors adequately addressed the limitations.
Thanks for your feedback! Below we address the specific questions.
Q: Many of the techniques in the paper lack innovation and are more like technical explorations.
Here we would like to clarify the main technical contributions of this paper.
(1) In Section 3.1, for the first time, we recognize that distillation-based training makes a smooth modification of the gradient field of diffusion models. Motivated by this, we propose to fine-tune on a coarse-grained timestamps to reduce the large training overhead typically required for distillation-based methods while achieving decent performance.
(2) In Section 3.2, we view distillation from a global perspective (SFD), which is well-motivated by the defects of local fine-tuning as analyzed in the main text. To release the potential of our method, several technical explorations are investigated, e.g., efficient solvers, timestamps and loss metrics. As verified by our main experiments, the obtained settings (displayed in Table 6) are robust across different datasets.
(3) The Section 3.3 involves another technical contribution where we propose improved SFD-v with the step-condition, and it consistently outperforms SFD under the same averaged training iterations, providing further evidence to our innovative finding in Section 3.1.
(4) In Section 3.4, we offer a novel strategy for text-to-image distillation motivated by the shape of latent trajectories, which is also shown to be effective.
Q: Explanation on Figure 6.
The intention of Figure 6 is to provide a further evidence to the smooth modification mentioned in Section 3.1. Here is the explanation.
Take ``SFD, NFE4'' as example, where the SFD is only trained to sample with NFE of 4. Its performance is marked by a star. Interestingly, when using this SFD to sample with untrained NFEs (i.e., 2,3,5,6), even though the timestamps have never been trained in these cases, the performance is still decent and outperforms DDIM to a large degree (DDIM with NFE of 6 gives a FID of 35.62).
For ``SFD-v, NFE2-5'', the SFD-v is trained to sample with NFE of 2, 3, 4 and 5 by a single model. The performance is marked by 4 stars. Similarly, good results are observed for untrained NFEs (DDIM with NFE of 10 gives a FID of 15.69).
This extrapolation ability further verifies our finding that there do exist a mutual enhancement when modifying the gradient fields for different timestamps. Thank you for noting this! We will make it clearer in the revised version. Please let us know if there are any possible ambiguities in this paper.
Q: The criteria for selecting hyperparameters.
The final selected hyperparameters are displayed in Table 6. Here we illustrate the criteria for the first 4 items (i.e., teacher solver, , and AFS). As validated in Section 3.2, they are set the same for the first 3 datasets. Therefore, we focus on illustrating why they are different for Stable Diffusion.
As for the teacher solver, it is well-known that for Stable Diffusion, a higher-order solver may suffer from instability especially when the guidance scale is large. Therefore, in the official implementation, DPM-Solver++(2M) is the default setting for DPM-Solvers (mentioned in line 285). We simply follow this setting.
denotes the number of teacher sampling steps taken from to , and we provide an ablation study of for CIFAR-10 in Table 8. We find that achieves a good trade-off between performance and fine-tuning time and thus adopt this for other two datasets. As Stable Diffusion requires a significantly larger resources, we re-evaluate this setting and choose in this case.
We use for Stable Diffusion because of its different default time schedule. While is used in EDM by default, is used in Stable Diffusion. Following the same criteria shown in Figure 5, we find to be a robust setting for the teacher solver to give consistently better results than the default setting.
Finally, it is empirically observed that the approximation error made by AFS is not negligible for Stable Diffusion when the guidance scale is large. This is intuitive given the complex trajectories shown in Figure 9c. Since large guidance scale is typically used in practice, we disable AFS for Stable Diffusion.
Q: The performance of other training methods such as CTM under the same training time as SFD.
As mentioned in Appendix B.1, we follow the setting used for CTM [1] faithfully to estimate the training time. Since the pre-trained CIFAR-10 models used for both SFD and CTM are from EDM [2], a direct comparison can be made given the FID-iteration curve shown in [1] (Figure 15 of [1]).
Our SFD-v shown in Table 2 is trained for 4.26 A100 hours, which is equivalent to around 7,000 CTM training iterations. According to the referred FID-Iteration curve, a FID of around 3.4 is achieved by CTM with 18 NFEs while 3.18 is obtained by SFD-v with only 4 NFEs.
Our SFD (second-stage) is trained for 4.88 A100 hours, which is equivalent to around 8000 CTM iterations. A FID of around 7.5 is achieved by one-step CTM while 5.83 is obtained by ours. To reach a FID around 6, 40,000 iterations are required for CTM without GAN loss, which costs around 24 A100 hours.
Reference:
[1] Kim D, Lai C H, Liao W H, et al. Consistency trajectory models: Learning probability flow ode trajectory of diffusion[J]. arXiv preprint arXiv:2310.02279, 2023.
[2] Karras T, Aittala M, Aila T, et al. Elucidating the design space of diffusion-based generative models[J]. Advances in neural information processing systems, 2022, 35: 26565-26577.
Thanks for your response. After reading the authors' rebuttal and other reviewers comments, the concerns are addressed, I raise my score to weak accept. In the next version, I hope this paper could conduct discussion with the concurrent work Relational Diffusion Distillation (RDD).
Reference:
Feng W, Yang C, An Z, et al. Relational Diffusion Distillation For Efficient Image Generation[C]//ACM Multimedia 2024.
Dear reviewer RDCV,
We appreciate your feedback and your note for the concurrent work! We will discuss it in the next version of this paper. If we have addressed your concerns, please consider update the score in the reviewer's console.
Best regards,
Authors
In this work authors propose a novel diffusion distillation method by unrolling the student model to match unrolled/generated pre-trained diffusion (teacher) model's trajectory. Authors demonstrate effectiveness and compute efficiency of approach on stable diffusion.
优点
Paper is easy to read and follow. Show effectiveness on stable diffusion and much faster to train for large scale stable Diffusion too. It is interesting to see fixing model at some part improves model over all other parts.
缺点
What is effect of SFD on diversity w.r.t distilled model? It might be easy to have high quality but much lower diversity. Proposed unrolling of student model ( global trajectory optimization) is effectively multi-step training like in structured prediction and imitation learning which has shown to be prone to mode-collapse.
Justification of method is still unclear and also not sure on how sensitive is SFD to training time noise schedule/time step weighting i.e., forward diffusion process and its impact on induced trajectory. E.g. if we assume flow matching style linear trajectories, they could be more easier to distill within proposed framework.
Unrolling trajectories for distillation is also considered in previous works to account for accumulation of error like BOOT, ImagineFlash, etc. So that in isolation can't be considered as major contribution.
Also proposed method though is more effective for Stable Diffusion, there is siginificant performance drop on LSUN, ImageNet etc w.r.t progressive distillation or consistency distillation? This also further raises questions and need for additional experiments to understand under what settings this method is effective.
问题
Though 'N-model' to finetune has been used, it is interesting to see fixing model at some part improves model over all other parts. This could indicate lower curvature part of trajectory and its more a directional feedback which SFD might be exploiting? Can authors further comment on this?
W.r.t trajectory distillation techniques, Consistency Distillation can take potentially shortcuts. But PG retains gradient field. So not sure if such characterization is correct.
局限性
As discussed in weakness and questions generalization to other models is unclear, effect of forward process on SFD? And currently this work also lacks conclusive reasoning under what settings its useful also do not demonstrate impact on diversity, etc.
Thanks for your thoughtful feedback. Below we address the specific questions.
Q: What is the effect of SFD on diversity?
We appreciate the reviewer’s suggestion to evaluate the diversity of SFD. Following standard practice, we computed fidelity (measured by precision and density) and diversity (measured by recall and coverage) [1] on the CIFAR-10 dataset. The same random seed is used for different solvers in the following for a fair comparison.
| Solver | NFE | FID | Precision | Recall | Density | Coverage |
|---|---|---|---|---|---|---|
| SFD-v | 2 | 4.28 | 0.77 | 0.70 | 1.06 | 0.93 |
| 3 | 3.50 | 0.78 | 0.71 | 1.10 | 0.94 | |
| 4 | 3.18 | 0.79 | 0.71 | 1.13 | 0.94 | |
| 5 | 2.95 | 0.79 | 0.71 | 1.15 | 0.95 | |
| DPM-Solver++(3M) | 11 | 3.93 | 0.76 | 0.71 | 1.04 | 0.94 |
| 15 | 2.64 | 0.76 | 0.73 | 1.03 | 0.95 | |
| 19 | 2.54 | 0.77 | 0.72 | 1.04 | 0.96 | |
| 23 | 2.65 | 0.77 | 0.72 | 1.05 | 0.96 | |
| 50 | 2.01 | 0.78 | 0.72 | 1.11 | 0.96 | |
| DDIM | 50 | 2.91 | 0.79 | 0.71 | 1.09 | 0.95 |
| Heun | 50 | 1.96 | 0.79 | 0.72 | 1.10 | 0.96 |
It is shown that the diversity of SFD (NFE = 2,3,4,5) is shown to be closely matched to that of teachers (DPM-Solver++(3M), NFE = 11,15,19,23) and the mode collapse issue is not observed. Moreover, SFD can even surpass the teacher in terms of fidelity.
We will add these results in the revised version of this paper.
Q: Justification of the method and the sensitivity to time schedule.
Our method shares its rationale with several recent approaches like progressive distillation (PD) [2]. Both PD and our SFD aim to establish shortcuts between various timestamps along the sampling trajectory. The primary advantage of our method, in terms of accelerated distillation, compared to PD, is derived from focusing optimization efforts solely on the timestamps that are actually used for sampling. As a result, choosing a time schedule is critical to enhance the robustness of SFD.
During our experiments, we adhere to the time schedules utilized in previous studies (for instance, a polynomial schedule with for EDM [3] and a linear schedule for LDM [4]), and find their methods effective. In the following, we show an ablation study based on CIFAR-10 with 2-NFE SFD trained with different polynomial coefficients.
| Time schedule | FID | |
|---|---|---|
| 5 | [80.00, 15.11, 1.22, 0.006] | 5.50 |
| 6 | [80.00, 12.63, 0.86, 0.006] | 4.61 |
| 7 | [80.00, 10.93, 0.67, 0.006] | 4.53 |
| 8 | [80.00, 9.72, 0.55, 0.006] | 4.54 |
| 9 | [80.00, 8.82, 0.47, 0.006] | 4.60 |
| 10 | [80.00, 8.13, 0.42, 0.006] | 4.81 |
Q: If we assume flow matching style linear trajectories, they could be more easier to distill within the proposed framework.
Flow matching [5] does not generate a linear sampling trajectory. Although the forward process of the OT-based flow matching is linear, the backward sampling process is not. Our SFD may benefit from smaller curvature. However, as far as we know, there is currently no direct evidence suggesting the OT-based flow matching indeed reduces the curvature. We will explore this direction in future work.
Q: Unrolling trajectories for distillation is also considered in previous works to account for accumulation of error like BOOT, Imagine Flash, etc.
We respectfully disagree with this.
First, BOOT [6] does not unroll the trajectory. It is designed to be data-free but not to address error accumulation. As mentioned in the original paper, ``(Error accumulation) becomes more pronounced in our case due to the possibility of out-of-distribution inputs for the teacher model...''. To mitigate this, the authors propose to uniformly sample the timestamp, and use a high-order Heun's method.
Second, although the backward distillation proposed in Imagine Flash [7] does unroll the trajectory, our method differs from backward distillation, where only the final prediction of the student model is supervised by the teacher instead of the whole trajectory. Moreover, Imagine Flash (released on May 8, 2024) is better to be considered as a concurrent work to ours (NeurIPS abstract submission deadline: May 15, 2024).
Q: Performance drop on LSUN and ImageNet w.r.t PD or CD.
Our goal is to achieve image quality similar to that of progressive distillation (PD) with a largely reduced training time. Given around 100-200x faster training time compared to PD, the performance drop is not significant, e.g., 4.28 (SFD-v) compared to 4.51 (+0.23, PD) on CIFAR-10, 9.47 (SFD-v) compared to 8.95 (-0.52, PD) on ImageNet, and 9.25 (SFD-v) compared to 8.47 (-0.78, PD) on LSUN. Moreover, our SFD (second-stage) consistently outperforms both PD and Guided PD for one-step sampling to a large margin.
Admittedly, SFD's performance is worse than that of consistency distillation (CD). However, CD requires around 1,000x larger training time, and the performance drop can be compensated by 2-3 more NFEs of SFD.
Q: Further comments on fixing model at some part improves model over all other parts.
We attribute this property to the regularity of the diffusion model trajectory revealed in a recent work [8]. It is shown that the sampling trajectories generated by diffusion models tend to share a simple boomerang shape (Fig. 4 in [8]). The effect of our method is to modify the gradient (originally pointing towards the tangent direction) along the trajectory to build ''shortcuts''. Seeing the model as a nearly continuous function, it is intuitive that when the gradient at time is modified in a certain direction, the gradient fields at other timestamps could be influenced in a similar trend, since the model parameter is shared. Given the simple boomerang shape, we hypothesize that the gradient fields at most timestamps are modified closer to the directions of the desired ''shortcuts''.
This observation is primarily shown by our simple experiment in Section 3.1. The extrapolation ability shown in Figure 6 also provides evidence that the performance of SFD outperforms DDIM to a large degree, even with untrained NFE. Moreover, our main results further verify this, showing that SFD-v consistently outperforms SFD under the same averaged training iterations.
I appreciate authors clarifications, would raise my rating to weak accept.
It is encouraging to see diversity preserving on CIFAR10, but would be useful to report it on SDv1.5 variants of checkpoints. If precision-recall is expensive to compute/setup, something like LPIPS diversity with say ~30 prompts with 20-30 seeds would be informative for community to understand how this method performs at scale.
Agreed on importance of unrolling w.r.t distillation and like the smoothing interpretation of this work.
W.r.t Performance drop on LSUN/ImageNet, Question was not why is not SOTA but more about what details w.r.t forward diffusion, weighting etc and its implications on resultant trajectories effect the proposed method.
Proposed method robustness w.r.t polynomial coefficients is encouraging, though looking at different class of schedules and resultant trajectories would be interesting, e.g. Flow Matching Schedule vs scaled linear [1]. would encourage authors to consider that for later revision of the paper.
Given method has comparable performance SDv1.5 for 4-steps but significant gap to 2-steps compared to progressive distillation albeit at much lesser training.
Given method's potential to be more effective if resultant trajectories have less curvature, would consider this work useful and potentially effective for more recent Flow Matching style large models.
Would give it weak accept currently, as reasons of when it is effective is a bit unclear to adopt it across settings for broader community.
[1] Kingma et al. Understanding Diffusion Objectives as the ELBO with Simple Data Augmentation
[1] Naeem M F, Oh S J, Uh Y, et al. Reliable fidelity and diversity metrics for generative models[C]//International Conference on Machine Learning. PMLR, 2020: 7176-7185.
[2] Karras T, Aittala M, Aila T, et al. Elucidating the design space of diffusion-based generative models[J]. Advances in neural information processing systems, 2022, 35: 26565-26577.
[3] Rombach R, Blattmann A, Lorenz D, et al. High-resolution image synthesis with latent diffusion models[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 10684-10695.
[4] Lipman Y, Chen R T Q, Ben-Hamu H, et al. Flow matching for generative modeling[J]. arXiv preprint arXiv:2210.02747, 2022.
[5] Liu X, Gong C, Liu Q. Flow straight and fast: Learning to generate and transfer data with rectified flow[J]. arXiv preprint arXiv:2209.03003, 2022.
[6] Gu J, Zhai S, Zhang Y, et al. BOOT: Data-free Distillation of Denoising Diffusion Models with Bootstrapping[J]. arXiv preprint arXiv:2306.05544, 2023.
[7] Kohler J, Pumarola A, Schönfeld E, et al. Imagine flash: Accelerating emu diffusion models with backward distillation[J]. arXiv preprint arXiv:2405.05224, 2024.
[8] Chen D, Zhou Z, Wang C, et al. On the Trajectory Regularity of ODE-based Diffusion Sampling[J]. arXiv preprint arXiv:2405.11326, 2024.
Thank you for your suggestions and increasing your rating to weak accept!
We appreciate the reviewer's valuable suggestions, and further clarify several points below.
- Our method is also good at diversity preserving on SDv1.5. We evaluate the diversity of SFD-v measured by recall and coverage, using 5,000 generated images with random prompts and the MS-COCO 2017 validation set.
| Solver | Steps | FID | Recall | Coverage |
|---|---|---|---|---|
| SFD-v | 4 | 24.2 | 0.44 | 0.36 |
| SFD-v | 5 | 23.5 | 0.44 | 0.37 |
| DPM-Solver++(3M) | 8 | 25.1 | 0.42 | 0.39 |
| DPM-Solver++(3M) | 10 | 24.6 | 0.43 | 0.39 |
- For the details of forward diffusion, the used LSUN-Bedroom model adopts a Variance-preserving (VP) SDE framework in the latent space [3], while the ImageNet model adopts the EDM framework [2] which resembles a Variance-exploding (VE) SDE framework. Trajectories generated under these two frameworks can be transformed into each other through a simple coefficient. Trajectories generated by the ImageNet model resembles that of the CIFAR-10 model [8], while the LSUN-Bedroom model generates trajectories resemble Figure 9a in the main text.
- We will provide more detailed experimental results to analyze the robustness of our method w.r.t noise schedules (including Flow Matching Schedule) in the later revision of this paper.
- We will keep improving the effectiveness of this method, and we hope it will benefit the broader community in the future.
Dear reviewer bYyR,
We appreciate the time and effort you put in reviewing this paper as well as increasing your rating to weak accept! The discussion period will end in two days. This is a kind reminder for considering updating the score in the reviewer's console.
Best regards,
Authors
The paper introduces a new diffusion distillation approach that significantly reduces training and fine-tuning times by letting a student learn a full teacher trajectory (rather than learning to mimic a single step), reducing the overall distillation error.
The reviewers appreciated the strong empirical performance (reduced training costs) of the algorithm and the careful analysis carried out in the experiments part. Among weaknesses, reviewers pointed out that the paper could have done a better job at discussing the tradeoffs between runtime and sample diversity as well as the performance limits of the distillation approach. This concern was successfully resolved in the discussion phase, where the authors presented additional evidence supporting their claims. The reviewers recommended acceptance at the end of the discussion phase, and the AC agreed with this outcome.
Initial criticisms included the tradeoffs of faster generation on sample diversity. However, the authors successfully addressed this concern in the rebuttal period. As an aside, the authors should incorporate discussions on more recent work on solver-based methods (without running experiments) [1,2].
[1] Gonzalez, M., Fernandez Pinto, N., Tran, T., Hajri, H., & Masmoudi, N. Seeds: Exponential SDE solvers for fast high-quality sampling from diffusion models. Advances in Neural Information Processing Systems, 2023.
[2] Pandey, K., Rudolph, M., & Mandt, S. Efficient Integrators for Diffusion Generative Models. In International Conference on Learning Representations, 2024.