Training-Free Adaptive Diffusion with Bounded Difference Approximation Strategy

审稿意见

评分: 3置信度: 12024-06-20

The paper presents a novel approach titled "AdaptiveDiffusion" for accelerating diffusion models used in high-quality image and video synthesis. The core issue addressed is the high computational cost and latency associated with existing denoising techniques in diffusion models, which are typically based on step-by-step noise predictions.

Contributions of the Paper:

Adaptive Diffusion Process: The paper introduces AdaptiveDiffusion, which adaptively reduces the number of noise prediction steps during the denoising process. This is achieved by skipping steps where the potential redundancy is high, guided by the third-order latent difference that indicates stability between timesteps.
Plug-and-Play Criterion: A new criterion is proposed to decide whether to infer new noise predictions or reuse previous results based on the third-order difference distribution. This allows for an adaptive acceleration paradigm that is prompt-dependent.
Extensive Experiments: The method's effectiveness is demonstrated through extensive experiments on both image and video diffusion models. The results show significant speedups of 2 to 5 times on average in the denoising process without quality degradation.
Error Analysis: The paper provides a theoretical analysis of the upper bound of the error induced by the step-skipping strategy, ensuring that the quality of the final output is maintained.
Adaptive Acceleration: The approach is designed to be adaptive to different input prompts, offering a practical solution to the high computational costs of sequential denoising techniques.
Generalization Capability: AdaptiveDiffusion shows a strong generalization capability, being able to adapt to different models and tasks, including text-to-image, image-to-video, and text-to-video generation.

In summary, the paper offers a substantial advancement in efficient diffusion model acceleration, with the potential to enable real-time and interactive applications of diffusion models.

优点

Strengths Assessment of the Paper

Originality

The paper demonstrates a high degree of originality through the introduction of the AdaptiveDiffusion method, which offers a novel perspective on accelerating diffusion models. The approach创造性地 addresses the computational inefficiency inherent in traditional denoising techniques by adaptively reducing noise prediction steps. This innovation is not just a technical tweak but a strategic rethinking of the denoising process itself. The use of the third-order latent difference as a criterion for deciding when to skip steps is an ingenious way to balance efficiency and quality, which has not been explored in prior works.

Quality

The quality of the paper is reflected in its rigorous theoretical foundation and comprehensive empirical validation. The authors have provided a detailed error analysis to support their method's robustness, ensuring that the acceleration does not compromise the output quality. The experiments are thorough, covering a range of models and tasks, which substantiates the method's effectiveness and generalizability. The paper also discusses the relationship between different orders of latent differences and the optimal skipping path, which adds depth to the understanding of the proposed technique.

Clarity

The paper is well-structured, with a clear progression from the introduction of the problem to the explanation of the proposed solution, followed by a detailed methodology and extensive experimental results. The figures and tables are used effectively to illustrate the method and results, enhancing the readability and comprehension of the paper. The theoretical proofs and algorithm descriptions are presented in a manner that is accessible to readers with a background in the field.

Significance

The significance of this paper lies in its potential to transform the applicability of diffusion models. By significantly reducing the computational cost and latency, AdaptiveDiffusion opens up new possibilities for real-time and interactive applications of diffusion models, which are currently limited by their resource-intensive nature. The ability to tailor the denoising process to different prompts is also significant, as it allows for more flexible and responsive generative models that can cater to diverse content creation needs.

In summary, the paper is a substantial contribution to the field of generative modeling, offering a creative, high-quality, and clearly articulated solution to a pressing problem. Its significance extends beyond technical advancement, promising to enable new applications and use cases for diffusion models.

缺点

While the paper presents a compelling approach to accelerating diffusion models, there are areas where it could be further strengthened:

Theoretical Depth

Assumption Limitations: The paper relies on certain assumptions for its theoretical analysis, such as the Lipschitz continuity of the noise prediction model. It would be beneficial to discuss how violations of these assumptions might impact the method's effectiveness and under what conditions these assumptions hold true.

Experimental Scope

Diversity of Models: Although the method is tested on various tasks, the paper could benefit from testing on a broader range of diffusion models to further establish the generalizability of AdaptiveDiffusion.
Real-World Applications: Demonstrating the method's effectiveness in real-world applications or use cases would provide additional context and significance to the work.

Hyperparameter Sensitivity

Threshold δ and Cmax: The paper discusses the impact of these hyperparameters on performance but could provide more guidance on how to select these values in practice, especially given their critical role in balancing speed and quality.

Computational Complexity

Memory Usage: While the method aims to reduce computational cost, it would be insightful to have a discussion on memory usage, especially since diffusion models can be memory-intensive.

Long-Term Viability

Adaptability to Model Updates: The paper could address how well AdaptiveDiffusion might adapt to future updates in diffusion model architectures or training regimes.

Societal Impact Consideration

Ethical Considerations: Although the paper does not explicitly discuss societal impacts, it would be beneficial to include a brief discussion on potential ethical considerations, especially given the generative capabilities of the models involved.

Reproducibility

Code and Data Availability: Ensuring that the code and data used for experiments are publicly available would greatly enhance the reproducibility of the results.

Documentation

Algorithm Pseudocode: Providing pseudocode or flowcharts for the algorithms could help readers better understand the step-skipping strategy and its integration into the overall process.

Future Work

Extensions and Limitations: While the paper outlines future directions, a more detailed discussion on the limitations and potential extensions of the current work would be valuable.

By addressing these points, the paper could provide a more comprehensive understanding of AdaptiveDiffusion's capabilities and limitations, setting the stage for further research and development in this area.

问题

Assumption Validity: Could the authors elaborate on the conditions under which the Lipschitz continuity assumption for the noise prediction model holds? How do they ensure this assumption is valid across different models and datasets?
Generalization Across Models: The paper demonstrates results on a few models. What steps have been taken to ensure that AdaptiveDiffusion can generalize across a wider variety of diffusion models, especially those that may not conform to the same architectural patterns?
Hyperparameter Selection: The paper mentions the importance of hyperparameters δ and Cmax. Can the authors provide more detailed guidelines or methods for selecting these hyperparameters in different contexts or suggest any automated tuning processes?
Memory Usage Discussion: Given that diffusion models can be memory-intensive, could the authors discuss the memory usage implications of AdaptiveDiffusion, especially when scaling up to larger models or datasets?
Ethical Considerations: Although the paper focuses on a technical advancement, could the authors comment on any potential ethical implications of the work, particularly related to the generative capabilities of the models?
Reproducibility Assurance: To ensure the reproducibility of the results, will the authors commit to making their code and datasets publicly available, and if so, when?
Algorithm Visualization: For better understanding, especially for readers who may be less familiar with the proposed methods, can the authors provide pseudocode or flowcharts illustrating the step-skipping strategy?
Long-Term Viability: How does the authors' method accommodate or adapt to potential future changes in the architecture or training of diffusion models?
Limitation Discussion: The paper outlines future work but could benefit from a more explicit discussion of current limitations. Are there specific scenarios or model types where AdaptiveDiffusion might underperform?
Statistical Significance: The paper reports χ2 stats and p-values for the correlation between estimated and optimal paths. Could the authors provide more details on the statistical tests used and the rationale behind choosing these tests?
Real-World Application: While the method shows promise in controlled experiments, are there any real-world scenarios or use cases where AdaptiveDiffusion has been tested or is planned to be tested?
Societal Impact: The paper could be strengthened by a brief discussion on the potential societal impacts, both positive and negative, of the technology. This includes considerations of how the method might be used or misused.
Comparison with State-of-the-Art: How does AdaptiveDiffusion compare with the state-of-the-art in terms of computational efficiency and quality of results? Are there any specific advantages or disadvantages in particular scenarios?
Documentation and API: For practical adoption, what level of documentation and API support is available or planned for AdaptiveDiffusion to facilitate its integration into existing systems?

局限性

Based on the information provided and the typical guidelines of the NeurIPS Paper Checklist, it appears that the authors have made an effort to address limitations and societal impacts. However, I offer general advice on how authors can improve their discussion of these topics:

Clear Acknowledgment: Authors should explicitly acknowledge the limitations of their work in the main text of the paper. This includes potential constraints on the generalizability of their findings, assumptions made, and any conditions under which the method may not perform as expected.
Depth of Discussion: While acknowledging limitations, authors should provide a thorough explanation of how these limitations might affect the results and the applicability of their method. This could include a discussion of how the method behaves under different conditions or with different types of data.
Societal Impact Analysis: Authors should consider the broader societal impacts of their work, including both positive and negative outcomes. This discussion should be grounded in the context of the work and consider potential misuse, ethical concerns, privacy issues, and fairness.
Mitigation Strategies: If there are potential negative societal impacts, authors should discuss possible mitigation strategies. This could involve suggesting guidelines for the responsible use of the technology, potential regulatory frameworks, or technical safeguards.
Ethical Considerations: The paper should include a section on ethical considerations, especially if the work involves generative models that could be used to create misleading or harmful content.
Transparency: Authors should be transparent about any potential conflicts of interest, funding sources, or affiliations that might influence the research or its interpretation.
Openness to Feedback: Authors should demonstrate a willingness to engage with the community for feedback on the societal impacts of their work and be open to adjusting their approach based on this feedback.
Long-Term Vision: While discussing limitations, authors could also provide a long-term vision for how they anticipate overcoming these limitations in future work.

作者回复

2024-08-07

Q1: Method Explanation

Assumption Validity: We follow the assumptions proposed in DPM-solver $^{[1]}$ , which are commonly adopted for the high-order approximation in ODE solvers.
Algorithm Pseudocode: We have provided the pseudocodes of AdaptiveDiffusion and the greed search algorithms in Appendix A.2.3.

Q2: Experimental Discussion

Diversity of Models: We further explore the application of our method on unconditional image generation tasks. The results are shown in the following table. Specifically, following Deepcache, we perform unconditional image generation on CIFAR10 and LSUN-Bedrooms. Shown in the table below, our method still achieves a larger speedup ratio and higher image quality than Deepcache on both benchmarks.

Dataset	Method	FID	Speedup ratio
CIFAR10	Deepcache	10.17	2.07x
	Ours	7.97	2.09x
LSUN	Deepcache	9.13	1.48x
	Ours	7.96	2.35x

Hyperparameter Sensitivity: We have provided the sensitivity analysis of hyperparameters in Table 4.
Memory Usage: We have listed the memory usage of different models in Table 1, 2, and 3.

Q3: Limitation and Future Work

Currently, our work mainly focuses on the acceleration of diffusion with ODE solvers. In future work, we should explore the effectiveness of our method on more kinds of solvers and models. For example, the acceleration of SDE solvers should consider the interference of randomly-generated noise in the high-order estimation of the skipping strategy. For consistency models, the acceleration should consider the impact of distillation on the trajectory of image generation, which would change the continuity of noise prediction.

Reference:

[1] DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps. NeurIPS 2022.

2024-08-13

A reviewer has advised that we consider lowering the rating of the manuscript due to its lack of self-containment. A self-contained paper should include all necessary details to ensure that readers can fully understand its content. Specifically, the current manuscript fails to provide an explanation for the derivation of Equation 1, which leaves readers without a clear understanding of its origin and the methodology used to derive it.

评论- Clarification and Explanation of Eq. (1).

2024-08-13

Dear Reviewer,

We respectfully disagree with the comment that our work lacks self-containment. We would like to clarify that Eq. (1) is the general formulation of the ODE solver, which can be referred to Eq. (3.7) of the DPM-solver $^{[1]}$ and Algorithm 2 of the Euler Sampler $^{[2]}$ . Since we would like to formulate a unified and general derivation of the ODE solvers, the coefficients of $x_i$ and $\epsilon_\theta$ are symbolized as $f(i)$ and $g(i)$ , respectively. According to the formulations in DPM-solver and Euler Sampler, we can easily get the properties of $f(i)$ and $g(i)$ as mentioned in the rebuttal.

We would like to emphasize that Eq. (1) is not a novel contribution of our work but rather a summary of existing formulations of ODE solvers. We have provided the necessary citations in the original manuscript to support this (See Line 105 of the manuscript). If there are any remaining misunderstandings, we are open to further discussion and would greatly appreciate the opportunity to clarify them.

Reference:

[1] DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps. NeurIPS 2022.

[2] Elucidating the Design Space of Diffusion-Based Generative Models. NeurIPS 2022.

审稿意见

评分: 6置信度: 42024-07-02

Summary

This paper propose a greedy approach that accelerate the Probability Flow ODE solvers for txt to image diffusion models. Empirical results on SD 1.5 and SD XL with mulitple solvers (DDIM, DPM, Euler) demonstrate the advantage of their approach over previous acceleration techniques.

优点

Strength

The idea of prompt adaptative acceleration is interesting and promising. As previous solvers designed for more general PF-ODE are not adaptative to different prompts. It is quite natural for a prompt adaptative approach to obtain better result.
The empirical advantage over previous accelerations is obvious, especially in terms of image quality such as PSNR and LPIPS.

缺点

Weakness

A new trend in image generation diffusion model is the adoptation of Flow matching optimal transport (FM-OT) / Rectified flow (RF) (See Learning to Generate and Transfer Data with Rectified Flow). Those lines of works adopt a forward SDE that is neither VP nor VE. Their special SDE has a constant velocity in the corresponding PF ODE. Rectified flow can even achieve single step sampling with this PF ODE. Their PF-ODE has a path that is very close to straight line and be solved with less steps compared with VP / VE SDE. The latest Stable Diffusion 3 already adopt this type of diffusion. Those efforts in diffusion community also speed up sampling, not by solvers but by different models. Those line of works should be discussed as they have the same goal with this paper.

问题

Questions

For now the results are reported on 3 different ODE solvers. Can the proposed approach be applied to SDE solvers too? As sometimes the SDE has some advantage over ODE in terms of sample quality (See Closing the ODE-SDE gap in score-based diffusion models through the Fokker-Planck equation).

局限性

Yes

作者回复

2024-08-07

Q1: Discussion of Single-step Sampling Works

Thank the reviewer for the valuable suggestion. We will provide a detailed discussion in the revised manuscript. Here is a brief discussion of single-step sampling works.

In addition to the acceleration paradigms mentioned in Sec 2.2., a recent training-based acceleration paradigm is increasingly receiving attentions from the community. Different from previous acceleration works reducing sampling steps or model size during inference time, this paradigm adopts the idea of flow-matching optimal transport to directly achieve single-step sampling during training, whose trajectory is approximately a straight line $^{[1,2]}$ . Compared with this new trend, our method highlights the training-free acceleration of multi-step sampling models, with no need to train a new efficient diffusion model.

Q2: Effectiveness on SDE solvers

Compared with the ODE solver, the SDE solver includes an additional noise item for the latent update, which is unpredictable by previous randomly-generated noises. When the magnitude of random noise is not ignorable, the third-order derivative of the neighboring latents cannot accurately evaluate the difference between the neighboring noise predictions. Therefore, to apply our method to SDE solvers, we should design an additional indicator that decides whether the randomly-generated noise is minor enough or stably changed to trigger the third-order estimator. In this case, we design an additional third-order estimator for the scaled randomly-generated noise. When the third-order derivatives of both the latent and the scaled randomly-generated noises are under the respective thresholds, the noise prediction can be skipped by reusing the cached model output.

To validate the effectiveness of our improved method, we conduct experiments for SDXL with the SDE-DPM solver on COCO2017. The results are shown in the following table. Compared with Deepcache, our method can achieve higher image quality with a comparable speedup ratio, indicating the effectiveness of AdaptiveDiffusion on SDE solvers.

Method	PSNR $\uparrow$	LPIPS $\downarrow$	FID $\downarrow$	Latency (s)	Speedup Ratio
Deepcache	16.44	0.346	8.15	9.2	1.63x
Ours	18.80	0.232	6.03	9.8	1.53x

Reference:

[1] Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow. ICLR 2023.

[2] Flow matching for generative modeling. ICLR 2023.

2024-08-13

Thanks for the rebuttal, I still recommend to accept this paper.

2024-08-13

Thank the reviewer for your dedicated time and effort in reviewing our submission. Your valuable and positive feedback is greatly appreciated.

审稿意见

评分: 4置信度: 32024-07-09

To enhance the sampling speed in diffusion models, this paper introduces the AdaptiveDiffusion framework, which utilizes a skipping strategy. Specifically, this strategy is guided by the third-order latent difference, assessing the stability between timesteps throughout the denoising process. Experiments results on image and video diffusion models demonstrate the superiority of the proposed adaptively sampling framework.

优点

The motivation is reasonable: to accelerate the sampling speed by reducing redundant time steps.
The contribution is helpful in diffusion community.
The figures are pretty and the presentation is readable.
The proposed AdaptiveDiffusion is effective on multitask.

缺点

In my humble opinion, some improvements are marginal to me, especially on ImageNet 256×256.
I am not sure the Theorem one is guaranteed even in larger sampling step size.
Since many methods investigate reducing sampling steps by employing higher-order solutions, this approach lacks novelty.

问题

Can you derive the Theorem 1 when the step size is large? Since large step size will magnify the upper bound on the error, and fast sampling method always equals to large step size sampling.
Since the proposed method aims to accelerate the sampling speed, can it reduce the NFEs?
Is the proposed method effective on pure image generation?

局限性

Please see in Weaknesses and Questions. If all of my concerns are addressed, I will definitely improve my score.

作者回复

2024-08-07

Q1: Analysis of Improvements.

We describe the advantages of our method in two aspects.

Novel Method Design: Endorsed by three other reviewers, AdaptiveDiffusion is the pioneering framework that accelerates the diffusion process adaptively for diverse prompts. Unlike the SOTA method Deepcache, which caches features of multiple blocks uniformly across all stages and imposes a static caching rule, our approach introduces adaptive acceleration with theoretical underpinnings and memory efficiency via a single cache unit for predicted latent.
Comprehensive Performance Improvements: As noted by the other reviewers, AdaptiveDiffusion exhibits strong generalization across various models and tasks. Versus SOTA methods like DPM-solver and Deepcache, our method achieves a comparable or higher speedup ratio while maintaining superior image quality in both static image and video generation tasks. As shown in Tab. 2, due to the large sampling step number and limited prompt diversity, the conditional image generation task on ImageNet 256x256 is relatively easy for acceleration, with both methods achieving roughly 6x speedup with negligible quality loss (~0.09 LPIPS). Notably, for specific categories, e.g., 607th and 854th, etc., AdaptiveDiffusion reduces the NFEs to approximately 35 from 250, yielding ~7x speedup. When the generation diversity and complexity increases, e.g., video generation, the superiority of AdpativeDiffusion is clearly demonstrated in Tab. 3.

Q2: Theorem 1 with A Large Step Size.

We would like to clarify that the step size is not mentioned as a condition for Theorem 1. If understood correctly, the large step size mentioned by the reviewer may refer to the large skip step of noise prediction. In this case, as the one- and two-step skipping schemes have been explored in Appendix A.2.1 and A.2.2 respectively, we further derive the error estimation of an arbitrary skipping scheme.

Taking $i$ -th step to perform $k$ -step ( $k\geq 2$ ) skipping of noise prediction, we obtain the following update formulations.

$\quad x_i=f(i)\ x_{i+1}-g(i)\ \epsilon_\theta(x_{i+1},t_{i+1})$ ;

$x_{i-1}=f(i-1)\ x_i-g(i-1)\ \epsilon_\theta(x_{i+1},t_{i+1})$ ;

$x_{i-2}=f(i-2)\ x_{i-1}-g(i-2)\ \epsilon_\theta(x_{i+1},t_{i+1})$ ;

$\vdots$

$x_{i-k}=f(i-k)\ x_{i-k+1}-g(i-k)\ \epsilon_\theta(x_{i+1},t_{i+1})$ .

$\Rightarrow\varepsilon_{i-k}=\|x_{i-k}-x_{i-k}^{ori}\|$

$\quad\quad\quad \ =\| f(i-k)(x_{i-k+1}-x_{i-k+1}^{ori})-g(i-k)[\epsilon_\theta(x_{i+1},t_{i+1})-\epsilon_\theta(x_{i-k+1},t_{i-k+1})]\|$

$\quad\quad\quad \ \leqslant f(i-k)\varepsilon_{i-k+1}+g(i-k)\|\epsilon_\theta(x_{i+1},t_{i+1}) -\epsilon_\theta(x_{i-k+1},t_{i-k+1})\|$

$\quad\quad\quad \ \leqslant\sum_{m=1}^{k-1}{\|h^{k-m+1}(i-m)\cdot\mathcal{O}(t_{i-m+1}-t_{i-m+2})\|}+\sum_{m=1}^{k-1}{\|h^{k-m+1}(i-m)\cdot\mathcal{O}(x_{i-m+1}-x_{i-m+2})\|}$

$\quad\quad\quad\quad\ +\|g(i-k)\cdot\mathcal{O}(x_{i-k+1}-x_{i-k+2})\|+\|g(i-k)\cdot\mathcal{O} (t_{i-k+1}-t_{i-k+2})\|$

$\quad\quad\quad \ =\sum_{m=1}^k{\mathcal{O}(t_{i-m+1}-t_{i-m+2})+\mathcal{O}(x_{i-m+1}-x_{i-m+2})}$ .

The derivation utilizes the property that $\|\epsilon_\theta(x_i,t_{i+1})-\epsilon_\theta(x_{i+1},t_{i+1})\|$ and $\|\epsilon_\theta(x_i,t_i)-\epsilon_\theta(x_i,t_{i+1})\|$ are upper-bounded by $\mathcal{O}(x_{i}-x_{i+1})$ and $\mathcal{O}(t_i-t_{i+1})$ respectively according to the Lipschitz continuity. Here, $h^{k-m+1}(i-m)\coloneqq g(i-m)\prod\nolimits_{j=1}^{k-m}{f(i-m-j)}$ .

It is observed that the error of $k$ -step skipping is upper-bounded by the accumulation of previous latent differences. Thus, if the skipping step of noise prediction is large, the upper bound of the error will naturally increase, as also empirically demonstrated by Fig. 5(b).

Q3: Novelty of AdaptiveDiffusion.

We first clarify our acceleration mechanism. Then, we will clarify our method's novelty from two aspects.

1. Mechanism:

Generally, the diffusion process comprises two stages at each step: noise prediction and latent update. Given the preset number of inference steps, the main purpose of AdaptiveDiffusion is to reduce the number of function evaluations (NFEs) while keeping the latent update number (sampling steps) unchanged. By reducing NFEs, we can significantly accelerate the diffusion process.

2. Novelty:

Motivation-level Novelty: Prior studies like EDM $^{[1]}$ and DPM-solver leveraged high-order approximations between sampling steps to enhance image quality. In contrast, AdaptiveDiffusion employs these approximations to adaptively reduce NFEs. That is, while earlier high-order methods aimed at refining generation quality given a fixed sampling step number, our work prioritizes adaptive efficiency without compromising image quality.
Algorithm-level Novelty: Our 3rd-order estimator is distinct in its adaptive acceleration tailored to various prompts. Unlike other high-order methods that flexibly choose solver orders for subsequent high-quality generation, our estimator is both empirically and theoretically constrained to 3rd-order approximations to decide whether to skip noise prediction, as shown in Sec. 3.3 and the global response. To our knowledge, this insight is the first in the field of diffusion model acceleration.

Briefly, AdaptiveDiffusion targets a different motivation and is a novel approach to the acceleration community.

Q4: Experiments on Pure Image Generation

Following Deepcache, we perform image generation using DDPMs on CIFAR10 and LSUN-Bedrooms and build our method upon 100-step DDIM for DDPM. As shown in the table below, our method still achieves a larger speedup ratio and higher image quality than Deepcache on both benchmarks.

Dataset	Method	FID	Speedup ratio
CIFAR10	Deepcache	10.17	2.07x
	Ours	7.97	2.09x
LSUN	Deepcache	9.13	1.48x
	Ours	7.96	2.35x

Reference:

[1] Elucidating the Design Space of Diffusion-Based Generative Models. NeurIPS 2022.

2024-08-12

Thanks for your great efforts!

The rebuttal addresses most of my concerns. I will increase my score.

评论- Look forward to further discussion

2024-08-11

Dear Reviewer XP4x:

Thank you for your precious time on the review.

As the deadline for the discussion period is approaching (Author-reviewer discussion will be closed on Aug 13 11:59pm AoE), we sincerely hope that our response can address your concerns and we are looking forward to further discussion on any other issues regarding our manuscript.

Best regards,

Authors of Paper 230

评论- Sincere Appreciation and Humble Reminder

2024-08-13

Dear Reviewer,

Thank you for your valuable time to review our work and for your recognition! Your valuable and positive feedback is greatly appreciated. We noticed that the score has not been updated. There seems to be no final rating box this time, so the score may have to be adjusted in the original rating box. We would greatly appreciate it if you could make adjustments.

Best regards,

The Authors

审稿意见

评分: 6置信度: 42024-07-09

This paper proposes a strategy to speed up image and video diffusion generative models. The speed-up is achieved by skipping denoising steps. The authors suggest implementing an adaptive skipping schedule, where the decision of which steps to skip depends on the processed image or video. Specifically, the proposed algorithm calculates the norm of the third-order derivative in the latent space. It then skips denoising steps when the norm of the third-order derivative is below a predefined threshold while limiting the number of consecutive skips to a set maximum.

The authors evaluate the proposed skipping strategy through various experiments with image and video diffusion models, comparing the speed-up and quality of the generated content with DeepCache [1] and Adaptive DPM-Solver [2]. In most experiments, the proposed skipping strategy achieves higher speed-up and better image and video quality than competitors.

[1] Xinyin Ma, Gongfan Fang, and Xinchao Wang. Deepcache: Accelerating diffusion models for free. In Conference on Computer Vision and Pattern Recognition (CVPR), 2024 [2] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint arXiv:2206.00927, 2022.

优点

The proposed skipping strategy leverages the generated content to speed up image and video diffusion generative models through an adaptive skipping scheme. In experiments with image and video generation, the authors demonstrate that this adaptive skipping scheme can achieve higher speed and better quality than competitors. In my opinion, the concept of using an adaptive skipping scheme to speed up diffusion generative models is potentially valuable and could be of interest to the research community.

缺点

The choice of employing the norm of the third-order derivative as a criterion for skipping denoising steps was made empirically. Theorem 1 (Equation 3) explains why it makes sense to consider the first-order derivative as a criterion for skipping. However, the authors empirically demonstrate that the first-order and second-order derivatives barely correlate with an optimal skipping schedule found by a greedy search. There is a lack of mathematical justification for choosing the third-order derivative.

问题

It would be helpful if the authors could provide any mathematical justification or intuition (in addition to the empirical experiments in the paper) for their choice to employ third-order derivatives as the criterion for skipping denoising steps.

局限性

Yes.

作者回复

2024-08-07

Q1: Theoretical Analysis of the Relationship between the Third-order Estimator and the Skipping Strategy.

To explore the theoretical relationship between the third-order estimator and the skipping strategy, we need to formulate the difference between the neighboring noise predictions. According to Eq.(1), we can get the following first-order differential equations regarding the latent $x$ .

$\quad \varDelta x_i=x_i-x_{i+1}=[1-f(i)]x_{i+1}-g(i)\cdot\epsilon_{\theta}(x_{i+1},t_{i+1})$ ;

$\varDelta x_{i-1}= x_{i-1}-x_i=[1-f(i-1)]x_i-g(i-1)\cdot\epsilon_{\theta}(x_i,t_i)$ .

Now, let $u(i)\coloneqq 1-f(i-1)$ , and we further derive the second-order differential equations based on the above equations.

$\varDelta x_{i-1}-\varDelta x_i = u( i ) x_i-u( i+1 ) x_{i+1}+g( i ) \cdot \epsilon_{\theta}( x_{i+1}, t_{i+1} ) -g( i-1 ) \cdot \epsilon_{\theta}( x_i, t_i )$

$\quad\quad\quad\quad\quad\ \ \ =u( i ) ( x_i-x_{i-1} ) +u( i ) x_{i-1}-u( i+1 ) ( x_{i+1}-x_i ) -u( i+1 ) x_i+g( i ) \cdot \epsilon_{\theta}( x_{i+1}, t_{i+1} ) -g( i-1 ) \cdot \epsilon_{\theta}( x_i, t_i )$

$\quad\quad\quad\quad\quad\ \ \ =u( i ) \varDelta x_{i-1}-u( i+1 ) \varDelta x_i+\varDelta [ u( i ) x_{i-1} ] +g( i ) \cdot \epsilon_{\theta}( x_{i+1}, t_{i+1} ) -g( i-1 ) \cdot \epsilon_{\theta}( x_i, t_i )$

$\quad\quad\quad\quad\quad\ \ \ =u( i ) \varDelta x_{i-1}-u( i+1 ) \varDelta x_i+\varDelta [ u( i ) x_{i-1} ] +g( i ) [ \epsilon_{\theta}( x_{i+1}, t_{i+1} ) -\epsilon_{\theta}( x_i, t_i ) ] +[ g( i ) -g( i-1 ) ] \epsilon_{\theta}( x_i, t_i )$

$\quad\quad\quad\quad\quad\ \ \ =u( i ) \varDelta x_{i-1}-u( i+1 ) \varDelta x_i+\varDelta [ u( i ) x_{i-1} ] -g( i ) \varDelta \epsilon_{\theta}^{i}-\varDelta g( i ) \cdot \epsilon_{\theta}( x_i, t_i )$ .

After simplification of the above equation, we can get the following formulation:

$f( i-1 ) \varDelta x_{i-1}-f( i ) \varDelta x_i=\varDelta [ u( i ) x_{i-1} ] -g( i ) \varDelta \epsilon_{\theta}^{i}-\varDelta g( i ) \cdot \epsilon_{\theta}( x_i, t_i )$ .

From the above equation, we can observe that the difference between noise predictions $\varDelta \epsilon_{\theta}^{i}$ is related to the first- and second-order derivatives of $x_i$ , as well as the noise prediction $\epsilon_{\theta}( x_i, t_i )$ . Therefore, it would be difficult to estimate the difference without $\epsilon_{\theta}( x_i, t_i )$ . Now we consider the third-order differential equation. From the above equation, we further obtain the following formulation.

$f( i ) \varDelta x_i-f( i+1 ) \varDelta x_{i+1}=\varDelta [ u( i+1 ) x_i ] -g( i+1 ) \varDelta \epsilon_{\theta}^{i+1}-\varDelta g( i+1 ) \cdot \epsilon_{\theta}( x_{i+1}, t_{i+1} )$ .

$\Rightarrow \varDelta [ f( i-1 ) \varDelta x_{i-1} ] -\varDelta [ f( i ) \varDelta x_i ] =\varDelta ^{( 2 )}[ u( i ) x_{i-1} ] -\varDelta [ g( i ) \varDelta \epsilon_{\theta}^{i} ] -\varDelta [ \varDelta g( i ) \cdot \epsilon_{\theta}( x_i, t_i ) ]$ .

$\Rightarrow \varDelta [ \varDelta g( i ) \cdot \epsilon_{\theta}( x_i, t_i ) ] =-\varDelta ^{( 2 )}[ f( i-1 ) \varDelta x_{i-1} ] +\varDelta ^{( 2 )}[ u( i ) x_{i-1} ] -\varDelta [ g( i ) \varDelta \epsilon_{\theta}^{i} ]$ .

From the above equation, it can be observed that the difference of the neighboring noise predictions is explicitly related to the third- and second-order derivatives of $x_i$ , as well as the second-order derivative of $\epsilon_\theta^{i}$ . Since $\lim_{i\rightarrow 0} f( i ) =1,\lim_{i\rightarrow 0} u( i ) =0,\lim_{i\rightarrow 0} g( i ) =0$ , we can finally get the conclusion that $\varDelta \epsilon_{\theta}^{i} \| \_{i \rightarrow 0} \approx \mathcal{O} ( \varDelta ^{( 3 )}x_{i-1} )$ .

作者回复

2024-08-07

Dear AC and reviewers,

We are deeply appreciative of the reviewers for their valuable time and thoughtful comments. Their feedback has reinforced our confidence in the paper's clear presentation and organization (Reviewer XP4x, kqHE, rWvo, hdh3), the innovative approach of AdaptiveDiffusion in addressing computational inefficiencies (Reviewer kqHE, rWvo, hdh3), and the comprehensive experimental validation across various tasks and models (Reviewer XP4x, kqHE, rWvo, hdh3) with notable improvements in speed and quality (Reviewer kqHE, hdh3).

We have diligently addressed each reviewer's critical feedback. Our goal is to resolve all issues and improve our work through this collaborative process. We will integrate these valuable comments into our revision, confident they will significantly elevate our work's quality and benefit the field.

Here is a summary of what we have done in the rebuttal phase.

We conduct more experiments to cover the concerns of reviewers.
- More experiments on pure image generation using DDPMs on CIFAT10 and LSUN.
- Experiments using SDE solvers.
We provide the derivation of theorem 1 with a large step size.
We provide the theoretical analysis of the relationship between the third-order estimator and the skipping strategy.
We provide a further discussion on the novelty of our work.
We provide a discussion on single-step sampling works.

Below is the global response that might be commonly mentioned in the responses to several reviewers' comments.

Theoretical Analysis of the Relationship between the Third-order Estimator and the Skipping Strategy.

We supplement the theoretical relationship between the third-order estimator and the skipping strategy. Specifically, the difference between the neighboring noise predictions is formulated. According to Eq.(1), we can get the following first-order differential equations regarding the latent $x$ .