/10

Poster3 位审稿人

最低3最高4标准差0.5

ICML 2025

Differentiable Solver Search for Fast Diffusion Sampling

Shuai Wang,Zexian Li,Qipeng zhang,Tianhui Song,Xubin Li,Tiezheng Ge,Bo Zheng,Limin Wang

OpenReview PDF

提交: 2025-01-12更新: 2025-07-24

TL;DR

摘要

Diffusion models have demonstrated remarkable generation quality but at the cost of numerous function evaluations. Recently, advanced ODE-based solvers have been developed to mitigate the substantial computational demands of reverse-diffusion solving under limited sampling steps. However, these solvers, heavily inspired by Adams-like multistep methods, rely solely on t-related Lagrange interpolation. We show that t-related Lagrange interpolation is suboptimal for diffusion model and reveal a compact search space comprised of time steps and solver coefficients. Building on our analysis, we propose a novel differentiable solver search algorithm to identify more optimal solver. Equipped with the searched solver, rectified-flow models, e.g., SiT-XL/2 and FlowDCN-XL/2, achieve FID scores of 2.40 and 2.35, respectively, on ImageNet-$256\times256$ with only 10 steps. Meanwhile, DDPM model, DiT-XL/2, reaches a FID score of 2.33 with only 10 steps. Notably, our searched solver outperforms traditional solvers by a significant margin. Moreover, our searched solver demonstrates generality across various model architectures, resolutions, and model sizes.

关键词

solverdiffusion sampling

评审与讨论

审稿意见

评分: 32025-03-15

This paper proposes a novel solver search algorithm for fast sampling of diffusion models, which optimizes both timesteps and solver coefficients. The key idea is to treat the solver design as a learning problem, optimizing solver parameters to minimize the numerical error and improve image quality. Experiments on rectified-flow models (SiT-XL/2, FlowDCN-XL/2) and DDPM (DiT-XL/2) demonstrate that the searched solvers can achieve improved FID with 5-10 steps. The learned time steps and coefficients can generalize to different model architectures and resolutions empirically.

Update after rebuttal

Most initial concerns have been addressed or clarified during the rebuttal. I'm supportive of acceptance as it's effective and well-supported by empirical evidence.
That said, I would not advocate for a higher rating due to the limited broader impact and significance of the contribution.

给作者的问题

How does the proposed method compare to the prior work [1] which optimizes the time steps?
What is the total computation cost to train the time steps and coefficients? How does it compare to [1]? Could you elaborate on the computational cost and efficiency of the solver search process in more detail, particularly in relation to the performance gains achieved?
Are there any techniques or optimization strategies that could be explored to reduce the computational burden of the search process?
Since the discretization schemes of the reference trajectory (L steps) and learned trajectory (N steps) are different, how do you compute the MSE loss between these two trajectories?

[1]: Xue, Shuchen, et al. "Accelerating diffusion sampling with optimized time steps." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

论据与证据

Yes, most claims are well-supported.

方法与评估标准

Yes, the approach and evaluation criteria make sense.

理论论述

Yes.

实验设计与分析

Yes, the experimental designs make sense.

补充材料

Yes, I reviewed Appendix A-I.

与现有文献的关系

This paper is most related to [1] where only time steps are learned. This paper extends the search space to both time steps and coefficients and employs a different optimization objective and strategy.

[1]: Xue, Shuchen, et al. "Accelerating diffusion sampling with optimized time steps." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

遗漏的重要参考文献

A few prior works ([1,2]) have also explored the idea of accelerating the solver via learning. They should also be cited and discussed in the paper.

[1]: Watson, Daniel, et al. "Learning fast samplers for diffusion models by differentiating through sample quality." International Conference on Learning Representations. 2021.

[2]: Dockhorn, Tim, Arash Vahdat, and Karsten Kreis. "Genie: Higher-order denoising diffusion solvers." Advances in Neural Information Processing Systems 35 (2022): 30150-30166.

其他优缺点

Strengths

The proposed method consistently achieves improved FID score within few-step regime compared to previous methods like DPM-solver++ and UniPC.
The authors provide theoretical justification for the approach, including error bound analysis and theorems supporting.
The learned time steps and coefficients can generalize to different model architectures and resolutions empirically.

Weaknesses

Clarity: The paper currently contains numerous typographical errors, inconsistencies, and formatting issues, which affect readability and clarity. See details in “Other Comments or Suggestions”.
The proposed approach requires generating tens of thousands of reference trajectories to learn the time steps and coefficients, which could be computationally expensive both in terms of time and space.
The learned solver might become suboptimal for different guidance scales.

其他意见或建议

Below are examples of specific errors noted during my review. There are more in other parts of the paper. I strongly recommend that the authors thoroughly proofread and revise the manuscript to address these issues comprehensively.

Line 191: “with prerained model” should be “with pretrained model”
Line 229: “of Our solver” → “of our solver”
Line 230: the equation is out of page, $b_i^j$ is used without definition.
Line 260: remove comma from “{1-\sum_{j=0}^{i-1}c_i^j,}_{i=0}^{N-1}” and “{c_i^k, }”.
Line 314: “reconstruction error(in Appendix)” → “reconstruction error[need a space](in Appendix~[need to add reference])”
Line 315: “Euler-250 steps” → “250-step-Euler”?
Line 409: “Of Solver Parameters” → “of Solver Parameters”
Line 381: “Comparison with Distillation methods” → “Comparison with distillation methods”
The use of capitalization and periods in section headings and table captions is inconsistent, confusing, and distractive:
- Some sections only capitalize the first word like Section 4 “Optimal search space for a solver” while the other sections capitalize all the words like Section 2 “Related Works”
- Section 4.2: “Focus on Solver coefficients instead of the interpolation function” capitalize the first character of “focus” and “solver”, which is even more confusing.
- Section 5: why is there a additional period?
- Table 1: “Comparsion with Distillation methods” why is the first character of “Distillation” capitalized? Also, “Comparsion” → “Comparison”

作者回复

2025-03-28

We would like to express our heartfelt gratitude for the valuable feedback you've provided on our manuscript. Your in-depth analysis and suggestions are of great significance to us, and we are committed to using them to enhance the quality of our work.

Q.1 Writing typos and inconsistent presentations

Thank you for pointing out the detailed writing typos and inconsistent presentations. We sincerely apologize for these inadvertent errors. We will meticulously review and revise every detail to enhance the readability of the text.

Q.2 Comparison with DM-Nonuniform[1]

DM-Nonuniform[1] primarily centers on the theoretical optimal timesteps, yet it fails to take into account the solver coefficients and model statistics. In contrast, our method conducts a statistical search for both the coefficients and timesteps concurrently. Through theoretical analysis, we have demonstrated that our method has a smaller error bound compared to those that neglect coefficients. This shows the superiority of our approach in more comprehensively handling the relevant factors in this context.

We compared the performance with DM-Nonuniform[1] in tab.2 and tab.3. We copy the result here.

Methods \ NFEs	5	6	7	8	9	10
DPM-Solver++ with uniform- $\lambda$ -opt[1]	12.53	5.44	3.58	7.54	5.97	4.12
DPM-Solver++ with uniform- $t$ -opt[1]	12.53	5.44	3.89	3.81	3.13	2.79
DPM-Solver++ with EDM-opt[1]	12.53	5.44	3.95	3.79	3.30	3.14
UniPC with uniform- $\lambda$ -opt[1]	8.66	4.46	3.57	3.72	3.40	3.01
UniPC with uniform- $t$ -opt [1]	8.66	4.46	3.74	3.29	3.01	2.74
UniPC with EDM-opt [1]	8.66	4.46	3.78	3.34	3.14	3.22
Searched-Solver	7.40	3.94	2.79	2.51	2.37	2.33

For DiT-XL/2-R512

Methods \ NFEs	5	6	7	8	9	10
UniPC with uniform- $\lambda$ -opt[1]	11.40	5.95	4.82	4.68	6.93	6.01
UniPC with uniform- $t$ -opt[1]	11.40	5.95	4.64	4.36	4.05	3.81
Searched-solver(searched on DiT-XL/2-R256)	10.28	6.02	4.31	3.74	3.54	3.64

Q.3 Reduce the search burden

First, a significant amount of computational resources is wasted on constructing the target trajectory. Since this target trajectory can be reused for each step in the search for solvers, we can cache it to prevent redundant recomputation.

Furthermore, we have observed that the solver optimized on base-sized or even small-sized models exhibits a high degree of generalization when applied to XL-sized models. Thus, using a small model as a proxy is a viable and practical choice. This approach not only reduces computational overhead but also provides a more efficient way to achieve good performance across different model scales.

Q.4 Alignment between two trajectories

Since the learned trajectory of length $N$ has its corresponding timesteps, we select a subset of length $N$ from the reference trajectory based on the timesteps of the learned trajectory.

Q.5 Total burden of searching

Searching one solver step with 50,000 samples using FlowDCN-B/2 requires approximately 30 minutes on 8 × H20 computation cards.

[1]: Xue, Shuchen, et al. "Accelerating diffusion sampling with optimized time steps." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024

审稿人评论

2025-04-04

Thanks for your response. One follow-up question on the trajectory loss: The reference trajectory may not necessarily contain values at the learned time steps, right? Are you using interpolation to obtain values at the learned timesteps from the reference trajectory?

作者评论

2025-04-04

Yes, the reference trajectory may not necessarily contain values at the learned time steps. However, the reference trajectory has much more points ( $100$ reference steps in the default setting), so for each $x_s$ in the source trajectory, we can directly select the closest points from the reference trajectory based on the nearest timesteps, which is equivalent to the nearest neighbor interpolation.

审稿意见

评分: 32025-03-16

The paper aims to accelerate reverse diffusion by integrating a novel differentiable solver search algorithm for better diffusion solvers. The paper demonstrates that a data-driven approach in the post-training scenario can also enable fast sampling. Using a compact search space related to the timesteps and solver coefficients, the proposed method can find the optimal solver parameter for each diffusion model. The experiment shows the effectiveness of the proposed method on multiple models compared to the current solver-based fast sampling method.

update after rebuttal

The extended visualization and evaluation shows the improvement of the Solver Search. Although the proposed method can not be generalized to multi-resolution scenario. It still offers a good solution for optimal timestep determination. Thus, my recommendation for this paper is weak accept.

给作者的问题

What is the CLIP-score, the aesthetic score, and GenEval score for PixArt- $\alpha$ ? It would be helpful if the method could be evaluated on these metrics on large diffusion models, such as SD3.
What is the overhead to derive the optimal coefficients?

论据与证据

The paper claims that the error caused by the non-ideal velocity estimation model can be estimated by a function related to the timesteps and coefficients. The claim is verified in the appendix.

方法与评估标准

The proposed method is evaluated on text-to-image generation using multiple metrics. However, these metrics are limited. For example, CLIP-score, GenEval, aesthetic score, etc, are not included.

理论论述

I checked correctness of Theorem 4.4.

实验设计与分析

The experiments only provide a quantitative comparison for DDPM/VP for text-to-image models but not rectified flow models. A similar evaluation should also be conducted.
Solver-based methods are also included in the comparison with distillation in table 1.
The comparison between the proposed method and FlowTurbo is limited, more results should be exhibited such as on different models and metrics, other than FID and IS.

补充材料

I reviewed the A - H

与现有文献的关系

The proposed method might help reveal the error of each timestep and identify the importance of each diffusion timestep.

遗漏的重要参考文献

n/a

其他优缺点

The quality comparison is limited. Only Figure 2 provides a few examples.
More quality results would be helpful to demonstrate the effectiveness of the proposed method across different prompts and diffusion models.
More comparison should be focused on FlowTurbo since they both are parameterized velocity refiners.

其他意见或建议

using $\times$ instead of x in table 1.
There are replicated parts in the supplementary materials, G and L.

作者回复

2025-03-28

We sincerely appreciate your valuable feedback on our manuscript. Your insights are extremely helpful and have provided us with clear directions for improvement.

Q.1 Quality comparison

We plan to expand the quality comparison by including more models, such as SD3, Pixart- $\alpha$ -R512, and Pixart- $\alpha$ -1024. This will provide a more comprehensive evaluation of the performance and quality across a wider range of relevant models, enhancing the depth and validity of our analysis.

The anonymous visualization link: https://anonymous.4open.science/r/NeuralSolver-ICML25/README.md

Q.2 More Comparison with Flow-Turbo

We presented the performance comparison in Tab. 4. Additionally, we have summarized the sampling and searching complexity in the table below, in relation to FlowTurbo. It should be noted that the value of $n$ will not exceed 15 steps. .

	Steps	NFE	NFE-CFG	Cache Pred	Order	search samples	Params
Adam2	n	n	2n	2	2	/	n
Adam4	n	n	2n	4	4	/	n
heun	n	2n	4n	2	2	/	n
FlowTurbo	n	$>$ n	$>$ 2n	2	2	540000(Real)	$2.9\times10^7$
our	n	n	2n	n	n	50000(Generated)	n + 0.5(n $\times$ (n-1))

Q.3 Pixart on GenEval We provided the solver(searched on DiT-XL/2-R256) with PixArt- $\alpha$ on GenEval benchmark

Resolution 512 for PixArt- $\alpha$ on GenEval benchmark

	steps	cfg	colors	counting	color_attr	two_object	single_object	position	all
dpm++	5	1.5	72.07	27.19	5.75	26.26	91.25	3	0.37587
	8	1.5	77.66	32.19	6.75	36.36	94.06	4.5	0.41921
unipc	5	1.5	73.14	25.94	6.25	26.26	90.94	3	0.37588
	8	1.5	78.72	32.50	6.5	40.15	93.75	5.5	0.42875
ours	5	1.5	72.87	31.56	6	33.08	91.88	5	0.40065
	8	1.5	76.86	33.44	7	40.40	94.06	5.5	0.42878

Resolution 512 for PixArt- $\alpha$ on GenEval benchmark

	steps	cfg	colors	counting	color_attr	two_object	single_object	position	all
dpm++	5	2.0	76.60	30.94	6.50	33.08	91.25	4.75	0.40519
	8	2.0	76.86	37.19	5.25	39.65	93.75	5.75	0.43074
unipc	5	2.0	77.66	31.87	6.50	34.85	92.19	5.25	0.41387
	8	2.0	79.52	36.56	6.72	40.66	95.31	6	0.44134
ours	5	2.0	77.62	33.75	5.25	37.37	92.81	4.75	0.41933
	8	2.0	79.52	38.44	7.25	42.68	95.00	7.50	0.45064

Q.4 Total burden of searching

Searching one solver step with 50,000 samples using FlowDCN-B/2 requires approximately 30 minutes on 8 × H20 computation cards.

审稿人评论

2025-04-08

Thanks for authors reply. The extended visualization and evaluation shows the improvement of the Solver Search. I have a follow up question regarding the method. Current flow-matching models apply timestep shift while sampling resolution changed. Does the method still work for varied resolution?

作者评论

2025-04-08

Due to the strong coupling between our coefficients and time steps, we can no longer use timeshift simultaneously. However, we found that directly transferring the search results still yields satisfactory performance. To pursue ultimate performance, one should need to conduct a search specifically tailored for this resolution.

审稿意见

评分: 42025-03-18

This paper proposes a differentiable solver search algorithm to find an optimal ODE solver for reverse-diffusion solving of pre-trained diffusion models. The authors use gradient-based optimization to identify solver parameters that lead to improved sample quality with very few function evaluations. The approach is evaluated on both rectified flow models and DDPM/VP frameworks, showing improvements in FID scores on ImageNet benchmarks under 10 sampling steps.

给作者的问题

论据与证据

The authors claim that their differentiable search method significantly reduces discretization error compared to traditional solvers. This claim is supported by extensive experiments, including comparisons to state-of-the-art methods such as DPM-Solver++ and UniPC, as well as ablation studies examining the impact of search sample size and solver parameterization. The theoretical analysis, detailed in the appendix, provides error bounds that reinforce the empirical findings.

方法与评估标准

The methodology addresses limitations of t-related Lagrange interpolation in existing fast sampling solvers by reparameterizing solver coefficients and timesteps into a differentiable framework. The evaluation is comprehensive, utilizing FID and other metrics across multiple model architectures and resolutions. The choice of benchmarks, including ImageNet-256 and ImageNet-512, and the inclusion of both rectified flow and DDPM-based models, provide a strong basis for assessing the method’s generality and effectiveness.

理论论述

The paper provides theoretical support for its solver-search method by deriving explicit bounds on discretization error. Key results include Theorem 4.4, showing that solver error depends explicitly on solver coefficients and timesteps, and Theorem 4.2, establishing the optimality of expectation-based solver coefficients over traditional Adams-like interpolation. And also, Theorem 4.5 which argues analytically that the proposed solver achieves tighter error bounds than conventional multi-step methods. These results justify the approach theoretically, I also checked the proof of these claims but not very carefully.

实验设计与分析

The experimental evaluation seems to bde quite comprehensive, with detailed comparisons to recent solver-based methods. The ablation studies are informative, demonstrating how the performance of the searched solver varies with different numbers of search samples and parameter settings.

补充材料

The supplementary material includes extended experimental results, additional metrics (sFID, IS, Precision, Recall), and detailed proofs of the theoretical claims.

与现有文献的关系

The authors situate their work within the context of recent advances in fast diffusion sampling and solver-based methods. The paper builds on insights from prior works on DDPM/VP solvers and rectified flow models, providing relevant comparisons to state-of-the-art techniques like DPM-Solver++ and UniPC. This discussion clarifies how the proposed approach advances the current understanding of efficient diffusion sampling.

遗漏的重要参考文献

Authors reference most relevant works.

其他优缺点

Strengths:

The paper is well-written and well-structured
The experiments are quite comprehensive to evaluate the method

Weakness:

The improvements, while consistent, are incremental compared to existing solvers.
The paper would benefit from a more detailed discussion on the computational overhead of the search process.

其他意见或建议

作者回复

2025-03-28

Thanks for your valuable feedback on our manuscript.

Q.1 Total burden of searching

Searching one solver step with 50,000 samples using FlowDCN-B/2 requires approximately 30 minutes on 8 × H20 computation cards.

Q.2 More Quality comparison

The anonymous visualization link: https://anonymous.4open.science/r/NeuralSolver-ICML25/README.md

最终决定Accept (poster)

2025-05-01

All reviewers recommend acceptance (1 accept, 2 weak accepts). After reading the paper, reviews, and discussion, I agree with the reviewers and recommend acceptance.