PaperHub
6.0
/10
Poster3 位审稿人
最低6最高6标准差0.0
6
6
6
3.7
置信度
正确性3.0
贡献度3.0
表达2.7
ICLR 2025

PFDiff: Training-Free Acceleration of Diffusion Models Combining Past and Future Scores

OpenReviewPDF
提交: 2024-09-26更新: 2025-03-23
TL;DR

We propose a new training-free fast sampler for accelerated sampling of diffusion models, which is orthogonal to existing fast solvers.

摘要

关键词
diffusion modelsaccelerated samplingtraining-free samplerorthogonal sampling method

评审与讨论

审稿意见
6

This paper introduces a sampling method to accelerate the first-order ODE solvers by utilizing the past gradient (ϵ\epsilon) and future gradient, leading to FID improvements in the NFE regime between 4 and 20. Also, the authors accelerate the higher-order ODE solver by using the past gradients. Empirical results show that the sampling method works well for continuous and discrete diffusion models under unconditional and conditional settings.

优点

  1. Unlike other generic ODE solvers (e.g., Heun 2nd-order solver, DPM-solver), this paper proposes the reuse of the network output from previous time steps to accelerate the sampling based on the observation of the output similarity between two consecutive time steps.
  2. Experiments on various diffusion models (continuous, discrete, conditional, unconditional) verify that PFdiff-1 outperforms other ODE solvers in the low NFE regime (4-20)

缺点

  1. The proposed method lacks theoretical support and is mainly motivated by the output similarity observation shown in Figure 2a. How reliable is this similarity? How much does this observation vary in different diffusion frameworks? Please provide more details of experiments in Figures 2a and 2b, you can put the details in into appendix. I suggest the authors explore the connection between the sampling trajectory and the proposed sampling method. I think the curvature of the trajectory can explain the reuse of the gradient and your methods. Refer to [1] and [2] for details of trajectory shape.

  2. the overall writing is problematic and significantly affects the readability of this paper. I list some below:

  • the definition of Q is not clear, in line 222, plug in n=0 does not give xti1x_{t_{i-1}}. Please rethink the expression of Q since it is used throughout the paper.
  • if the proposal of using future gradients is based on Proposition 3.1, why not put the proposition at the beginning of section 3.3?
  • function s() is not defined in eq 7 and eq 8
  • function h() is not defined in eq 9
  • line 348, notations l and h are undefined
  • In Figure 2b, treating the samples derived from 1000NFE as the ground truth is not rigorous.
  1. The authors claim that PFDiff is effective and orthogonal to existing ODE solvers, please provide the FID results of PFDiff in the regime NFE>20 to support the claim.

  2. in Figure 4, some FID results of PFDiff are missing (NFE=4 and NFE>12). In Figure 5, some FID results of PFDiff are missing (NFE>10)

  3. in Figure 4b, why PFDiff is worse than the baseline Analytic-DDIM when NFE=6? A similar outlier in Figure 4a

  4. in Figure 5, the results of DPM-solver+PFDiff are missing.

  5. I encourage the authors to also compare the FID of PFDiff with [2]

others:

  1. line 52, the last two papers are published in 2023, not 2024, please cite papers correctly
  2. I suggest the authors move Figure 1 to the appendix to leave space for the main content.

[1] Sabour, Amirmojtaba, Sanja Fidler, and Karsten Kreis. "Align your steps: Optimizing sampling schedules in diffusion models." ICML. 2024

[2] Zhou, Zhenyu, et al. "Fast ode-based sampling for diffusion models in around 5 steps." CVPR. 2024

问题

have the authors also tried adding the future gradient step to higher-order ODE solvers?

评论

W7: The comparison of FID between PFDiff and AMED-Solver [R1].

A: In response to the reviewer's suggestion, we have included additional FID comparison results between PFDiff and AMED-Solver [R1]. Notably, [R1] is a TRAINING-BASED method, while PFDiff is a TRAINING-FREE method. Using the same pre-trained EDM model [R2], we calculated the FID on the ImageNet 64×\times64 dataset, as shown in the following table:

ImageNet 64×\times64, FID\downarrow

Method\NFE46810
DDIM58.4334.0322.5916.72
DDIM+AMED-Plugin [R1]41.8032.4922.0416.60
AMED-Solver [R1]30.0319.249.586.60
DDIM+PFDiff55.6317.828.005.38

The results show that the proposed PFDiff, which incurs no training cost, can even outperform AMED-Solver [R1], one of the current strongly competitive training-based methods, under most NFE. This further demonstrates the effectiveness of the proposed PFDiff.

Others: Correct the reference years and move Figure 1 to the appendix.

A: Thank you for the reviewer's comment. We have moved Figure 1 to Appendix D.9 as suggested and have corrected the reference year in line 52.

Q1: Have the authors also tried adding the future gradient step to higher-order ODE solvers?

A: Yes, we have. However, adding the future gradient into higher-order ODE solvers is highly sensitive to the choices of hyperparameters kk and hh, leading to unstable results. We believe this instability may arise from the fact that higher-order solvers do not converge with a few NFE (<10) [R3].

[R1] Zhou, Zhenyu, et al. "Fast ode-based sampling for diffusion models in around 5 steps." CVPR 2024.

[R2] Tero Karras et al., "Elucidating the design space of diffusion-based generative models." NeurIPS 2022.

[R3] Cheng Lu et al., "DPM-solver: a fast ODE solver for diffusion probabilistic model sampling in around 10 steps." NeurIPS 2022.

评论

I appreciate the authors' effort in addressing my concerns, I have increased my rating since some of my doubts are resolved.

I like the idea of using past and future scores. However, the proposed ODE solver PFDiff-k_h still has the issue of empirically searching for the optimal k and h in practice. Table 7 in the appendix shows that the sample quality (FID) is sensitive to the choice of k,h in different datasets and NFE.

While the authors acknowledge the correlation between the trajectory shape of the diffusion process and the performance of PFDiff-k_h, I suggest that further investigation into a systematic method for determining k and h is very beneficial, elucidating the underlying mechanics of the proposed method more clearly. It could also enhance the robustness and efficiency of the solver.

评论

We sincerely appreciate the reviewer for carefully reviewing our work and providing valuable comments. Below are our responses to all questions. We would greatly appreciate it if you could consider increasing the score if you are satisfied with our response.

W1.1: The proposed method lacks theoretical support and is mainly motivated by the output similarity observation shown in Figure 2a. How reliable is the similarity observed in Figure 2a, and how much does it vary across different diffusion frameworks?

A: It is important to emphasize that our method consists of two core components: the use of future gradients and the use of past gradients. In Equation (15) and Proposition 3.1, we employ a Taylor expansion and further derivations to theoretically demonstrate that the use of future gradients can reduce discretization errors. Furthermore, the use of past gradients, based on the similarity observation in Figure 1a (original Figure 2a), is highly reliable. We have supplemented additional experimental results in Figure 5a and further analysis in Appendix D.8. The results demonstrate that this similarity exhibits strong consistency across three different diffusion frameworks: DDPM, ScoreSDE [R1], and EDM [R2], which fully validates the reliability of the similarity observation.

W1.2: Please provide more details of experiments in Figures 2a and 2b.

A: We have provided more details of the experiments regarding Figures 1a (original Figure 2a) and 1b (original Figure 2b) in Appendix D.8. Additionally, we added Figure 5 in Appendix D.8 to further expand the experimental content in this part. Figure 5a demonstrates that the similarity observation is consistent across different diffusion frameworks, while Figure 5b further validates that future gradient is more reliable than "springboard".

W1.3: Please explore the connection between the proposed method and the curvature of the sampling trajectory.

A1: Thank you for your valuable suggestions. We have carefully reviewed the literature on sampling trajectories [R3], [R4], [R5] and have added relevant analysis to Section 3.5. In summary, [R4] indicates that diffusion model sampling trajectories lie in a low-dimensional subspace and are close to a straight line. This supports the rationale behind the use of past gradients. [R5] notes that the trajectories exhibit a "boomerang" shape, meaning the curvature of the sampling trajectory starts small, then increases, and finally decreases. Based on this observation, we further analyze that PFDiff utilizes future gradients to predict the future update direction in the large curvature regions of the trajectory, thereby correcting discretization errors introduced by sampling along the tangent direction. Additionally, the PFDiff correction process shown in Figure 1c (original Figure 2c) is consistent with this analysis.

W2: Regarding writing issues.

A: We sincerely thank the reviewer for highlighting the writing issues. While we respectfully acknowledge the concerns raised, we have a differing perspective on some points and will address each one individually.

  • The definition of QQ is not clear. We corrected the definition of QQ on lines 183 to 184. Specifically, when n=0n=0, t^_0=t_i1\hat{t}\_{0} = t\_{i-1}.
  • Why not put Proposition 3.1 at the beginning of section 3.3? We made this arrangement to ensure that the logical description of the PFDiff update process in Sections 3.2 and 3.3 is more coherent.
  • Function s()s() is not defined in eq 7 and eq 8. In fact, the function s()s() is already defined in eq (7). Specifically, s(ϵθ(xt,t),xt,t):=f(t)xt+g2(t)2σtϵθ(xt,t)s(\epsilon_\theta(x_t,t),x_t,t) :=f(t)x_t+\frac{g^2(t)}{2\sigma_t}\epsilon_\theta(x_t,t).
  • Function h()h() is not defined in eq 9. In fact, the function h()h() is already defined in lines 185 to 189.
  • Line 348, notations ll and hh are undefined. We added the definition of hh in lines 327 to 328. Notably, the "1" is the digit "1" (not the letter ll).
  • Using 1000 NFE as the ground truth is not rigorous. We removed the expression of "ground truth" in Figure 1b (original Figure 2b).

[R1] Yang Song et al., "Score-based generative modeling through stochastic differential equations." ICLR 2021.

[R2] Tero Karras et al., "Elucidating the design space of diffusion-based generative models." NeurIPS 2022.

[R3] Sabour, Amirmojtaba, Sanja Fidler, and Karsten Kreis. "Align your steps: Optimizing sampling schedules in diffusion models." ICML 2024.

[R4] Zhou, Zhenyu, et al. "Fast ode-based sampling for diffusion models in around 5 steps." CVPR 2024.

[R5] Defang Chen, et al. "On the Trajectory Regularity of ODE-based Diffusion Sampling." ICML 2024.

评论

W3: Additional FID results of PFDiff for NFE > 20.

We supplemented the FID experimental results for NFE > 20 on the CIFAR10 and CelebA 64×\times64 datasets, as shown in the table below:

CIFAR10, FID\downarrow

Method\NFE20501002001000
DDIM7.044.814.304.004.04
DDIM+PFDiff3.683.483.563.57\

CelebA 64×\times64, FID\downarrow

Method\NFE20501002001000
DDIM14.769.226.574.773.51
DDIM+PFDiff5.663.992.452.58\

PFDiff is designed for diffusion model acceleration, which is the reason we focus on less NFE. Still, as can be seen from the table, under the condition of NFE > 20, PFDiff consistently improves the sampling quality of DDIM and achieves faster convergence.

W4: Some FID results of PFDiff are missing.

A: In fact, the missing FID results for PFDiff were originally provided in Tables 2-6 of Appendix D. These results were deliberately omitted to make the comparisons in the figures clearer. For example, when NFE=12 in Figure 3 (original Figure 4) and NFE=10 in Figure 4 (original Figure 5), PFDiff already outperforms the comparison methods with 20 NFE. Therefore, omitting the subsequent PFDiff results helps to highlight the effectiveness of our method.

W5: Why PFDiff is worse than the baseline Analytic-DDIM when NFE=6?

A: This phenomenon may be due to the introduction of new errors in the estimation of the optimal variance by Analytic-DDIM [R1], which are amplified by PFDiff when NFE is small, leading to some unpredictable performance degradation. Specifically, although Analytic-DDIM derives the optimal variance for DDIM sampling from a theoretical perspective, the optimal variance contains an unknown term: E_q_n(x_n)_x_nlogq_n(x_n)2d\mathbb{E}\_{q\_{n}\left(\boldsymbol{x}\_{n}\right)} \frac{\| \| \nabla\_{\boldsymbol{x}\_{n}} \log q\_{n}\left(\boldsymbol{x}\_{n}\right)\| \|^{2}}{d}, which is approximated on different datasets [R1]. This introduces new errors.

W6: In Figure 5, the results of DPM-solver+PFDiff are missing.

A: We did not include the experimental results for the following reasons: First, DDIM and DPM-Solver-1 are equivalent [R2]. Second, with fewer NFE, higher-order DPM-Solvers perform worse than DDIM on Stable-Diffusion [R3]. Finally, DDIM+PFDiff has outperformed other previously best-performing solvers. Additionally, we also provide experimental results applying PFDiff to DPM-Solver in Stable Diffusion with a guidance scale of 7.5, as shown in the table below:

Stable Diffusion, FID\downarrow

Method\NFE468101520
DPM-Solver-1 [R2]35.4820.3317.4616.7816.0815.95
DPM-Solver-1+PFDiff29.0215.4713.2613.0613.5713.97
DPM-Solver-2 [R2]184.21157.95148.67135.8192.6240.47
DPM-Solver-2+PFDiff147.20106.2457.0731.6617.8714.13

As shown in the table above, PFDiff further improves the sampling quality of DPM-Solver-1 and DPM-Solver-2. Notably, the results in the table come from the original DPM-Solver implementation, while the DPM-Solver results in the paper utilized additional tricks from [R3] to achieve better performance.

[R1] Fan Bao, et al. "Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models.", ICLR 2022.

[R2] Cheng Lu et al., “DPM-solver: a fast ODE solver for diffusion probabilistic model sampling in around 10 steps.” NeurIPS 2022.

[R3] Cheng Lu et al., "Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models." arXiv:2211.01095.

评论

Thank you for your response and for raising the rating! We are so glad to hear that some of your doubts have been resolved!

Regarding the new issue about the search for parameters kk and hh, we emphasize that the sample quality is only slightly affected by the choice of kk and hh. By fixing k=2k=2 and h=1h=1, PFDiff achieves fairly satisfactory results across different NFE and datasets, as shown in Table 7. Furthermore, in Appendix D.6.1, we have proposed that by sampling only 5k samples to compute the FID, the optimal values of kk and hh can be easily determined. Notably, the search can also be easily generalized to other evaluation metrics. As suggested by the reviewer, we further analyzed the trajectory shape of the diffusion process and proposed a new Automatic Search strategy with almost no additional cost based on truncation errors for selecting kk and hh. The details are as follows:

For the trajectory shape analysis, as discussed in Section 3.5, PFDiff uses future scores to correct the discretization errors of the baseline solver in regions of large curvature along the trajectory. This indicates that PFDiff accumulates less truncation error compared to the baseline solver. To validate this conclusion, we added Figure 5 in Appendix D.6.1, which shows a comparison of the truncation errors of PFDiff with and without DDIM. Figure 5 clearly shows that PFDiff significantly reduces truncation errors.

Based on the correlation between PFDiff's improvement in sample quality and its reduction in truncation errors, we propose the automatic search strategy: we "warm up" only 256 samples for different kk and hh respectively, compute the average truncation error using MSE for each, and use this to further determine the values of kk and hh. To validate the effectiveness of this strategy, we conducted experiments on CIFAR10, varying the values of NFE, kk, and hh. For each combination, we sampled 256 samples and computed the average truncation error relative to DDIM with 1000 NFE. The results are shown in Table A1 below (for clearer comparison, we also provide the FID results with 50k samples in Table A2 (Table 7 in the paper)):

Table A1: Truncation error (MSE\downarrow), only 256 samples

Method\NFE46810121520
DDIM0.10090.06080.04140.03140.02550.01990.0152
+PFDiff-10.05420.02170.01310.01000.00890.00820.0081
+PFDiff-2_10.02770.01370.01100.01040.00980.00930.0088
+PFDiff-2_20.10010.04680.02770.01840.01450.01220.0105
+PFDiff-3_10.02180.01670.01460.01300.01200.01070.0098
+PFDiff-3_20.06140.02280.01330.01010.00890.00830.0082
+PFDiff-3_30.17900.08200.04440.02990.02240.01650.0126

Table A2: FID\downarrow, 50k samples

Method\NFE46810121520
DDIM65.7029.6818.4513.6611.018.807.04
+PFDiff-1124.7319.455.784.954.634.254.14
+PFDiff-2_159.619.847.016.315.585.184.78
+PFDiff-2_2167.1253.228.434.954.414.103.78
+PFDiff-3_122.3813.409.407.706.736.035.05
+PFDiff-3_2129.1819.355.644.574.394.194.08
+PFDiff-3_3205.8776.6220.845.714.734.413.68

Based on Table A1, we selected the kk and hh corresponding to the minimal truncation error, and further obtained the corresponding FID values {22.38, 9.84, 7.01, 4.95, 4.63/4.39, 4.25, 4.14} with NFE \in{4, 6, 8, 10, 12, 15, 20} in Table A2. This set of FID values is comparable to the optimal FID, indicating that determining kk and hh based on truncation error is reasonable. Notably, in Table A1, we "warmed up" only 256 samples, so the additional cost introduced by the automatic search strategy is negligible. The 256 samples are sufficient to capture the dataset's statistical properties because the shapes of sampling trajectories are highly similar [R1]. The above-mentioned relevant results and analyses have been added to lines 477-479 of Section 4.3 and Appendix D.6.1.

[R1] Defang Chen, et al. "On the Trajectory Regularity of ODE-based Diffusion Sampling." ICML 2024.

评论

Thank you again for the detailed response and great effort.

Could you extend the automatic strategy experiments to additional datasets beyond CIFAR-10 to demonstrate its robustness?

I will undoubtedly raise my rating if the automatic search shows good generalization.

评论

Thank you again for your response and insightful suggestions!

Per the reviewer’s suggestion, we have extended the automatic search strategy to additional datasets, including CelebA 64×\times64, LSUN Church 256×\times256, and LSUN Bedroom 256×\times256. Specifically, we "warm up" only 256 samples for different values of kk, hh, and NFE, respectively, and computed the average truncation error using MSE relative to DDIM with 1000 NFE. The results are presented in Tables A1, A3, and A5 below. For clearer comparison, we also provide the FID results in Tables A2, A4, and A6. Based on Tables A1, A3, and A5, we selected the values of kk and hh corresponding to the minimal truncation error, and further obtained the corresponding FID values {13.29, 8.38, 5.88, 5.41, 5.24, 5.18, 5.19}, {37.90, 18.30, 14.34, 13.41, 12.90, 12.61, 11.76}, and {78.57, 18.72, 12.35, 9.55, 8.82, 8.34, 7.70} from Tables A2, A4, and A6, with NFE \in {4/5, 6, 8, 10, 12, 15, 20}. Consistent with CIFAR-10, this set of FID values is comparable to the optimal FID, indicating that the automatic search strategy is robust and generalizable.

Table A1: CelebA 64×\times64, Truncation error (MSE\downarrow), only 256 samples

Method\NFE46810121520
DDIM0.08140.06140.04940.04260.03880.03570.0336
+PFDiff-10.06890.04270.03540.03280.03220.03230.0329
+PFDiff-2_10.04700.03390.03120.03100.03110.03160.0321
+PFDiff-2_20.11840.05780.04240.03780.03620.03570.0354
+PFDiff-3_10.03890.03210.03150.03150.03160.03180.0320
+PFDiff-3_20.09030.04020.03380.03270.03240.03290.0331
+PFDiff-3_30.18590.08080.05210.04400.04060.03890.0375

Table A2: CelebA 64×\times64, FID\downarrow, 50k samples

Method\NFE46810121520
DDIM37.7620.9914.1010.869.017.676.50
+PFDiff-190.6024.945.494.884.714.704.72
+PFDiff-2_148.667.535.885.415.245.184.97
+PFDiff-2_2221.5253.5510.715.414.904.794.79
+PFDiff-3_113.298.387.036.335.925.735.19
+PFDiff-3_2131.0222.635.064.714.604.704.68
+PFDiff-3_3286.12100.722.956.555.615.215.08
评论

Table A3: LSUN Church 256×\times256, Truncation error (MSE\downarrow), only 256 samples

Method\NFE56810121520
DDIM0.07720.06210.04440.03370.02690.02060.0146
+PFDiff-10.55230.33910.05980.03450.02320.01360.0068
+PFDiff-2_10.02560.01950.01410.01110.00910.00710.0050
+PFDiff-2_20.83450.60970.27090.06620.04240.02510.0131
+PFDiff-3_10.03230.02680.01920.01510.01220.00950.0067
+PFDiff-3_20.21080.04560.01400.01030.00780.00550.0033
+PFDiff-3_30.96810.76270.40340.10330.05840.03500.0183

Table A4: LSUN Church 256×\times256, FID\downarrow, 50k samples

Method\NFE56810121520
DDIM73.4350.0230.0422.0417.6614.5812.49
+PFDiff-1214.89143.6055.0038.0128.9521.8016.48
+PFDiff-2_137.9018.3014.3513.2712.4712.2511.64
+PFDiff-2_2277.36231.01113.1847.7135.4726.0018.64
+PFDiff-3_126.6921.1915.4813.3612.0511.7711.12
+PFDiff-3_2169.0254.3514.3413.4112.9012.6111.76
+PFDiff-3_3313.39274.65116.9556.9539.8028.9019.82

Table A5: LSUN Bedroom 256×\times256, Truncation error (MSE\downarrow), only 256 samples

Method\NFE56810121520
DDIM0.06750.05300.03700.02760.02160.01580.0108
+PFDiff-10.53970.31720.03340.01670.01110.00660.0035
+PFDiff-2_10.01870.01310.00900.00650.00540.00400.0028
+PFDiff-2_20.86930.71900.23730.03800.02320.01400.0072
+PFDiff-3_10.02370.01970.01370.01100.00850.00620.0041
+PFDiff-3_20.20810.03990.00850.00520.00370.00240.0015
+PFDiff-3_30.99050.91360.40910.07180.03470.02080.0110

Table A6: LSUN Bedroom 256×\times256, FID\downarrow, 50k samples

Method\NFE56810121520
DDIM71.0247.4026.7319.2615.2311.689.26
+PFDiff-1242.74172.3394.2367.2752.7238.6226.25
+PFDiff-2_178.5718.7211.509.288.367.767.14
+PFDiff-2_2287.01244.58131.9474.2457.7043.9331.28
+PFDiff-3_125.3219.0713.5910.679.188.037.14
+PFDiff-3_2221.82114.9112.359.558.828.347.70
+PFDiff-3_3317.00279.26142.3182.8860.6846.4733.17
评论

Although the measured truncation error does not perfectly align with the optimal FID, the automatic search strategy still yields sub-optimal FID, so I increased my rating.

评论

Thank you for recognizing the automatic search strategy and for increasing the rating!

We also greatly appreciate the reviewer’s valuable and insightful feedback during the review process, which has greatly contributed to the improvement of our manuscript!

审稿意见
6

The PFDiff paper introduces a novel, training-free, and orthogonal timestep-skipping mechanism to improve existing ODE solvers used in Diffusion Probabilistic Models. The proposed approach helps to reach solutions with fewer NFE, with the aid of springboard along with foresight updates. This addresses a significant challenge in reducing computational cost while keeping high sample quality. Furthermore, PFDiff improves the efficiency and quality of diffusion model samplig.

优点

1). The paper is well written and easy to understand

2). The given illustrations, and provided Algorithms further helps understanding the paper.

3). The paper identifies a limitation of DPMs, which is their sampling efficiency is low as they often require multiple number of denoising steps. Existing methods tend to amplify discretization errors when NFE is below 10, often leading to convergence issues. The proposed approach, named PFDiff, is a training-free and orthogonal timestep-skipping algorithm that helps mitigating these errors while operating with fewer NFEs

4). PFDiff employed the potential for improvements in existing training free accelerated methods, and the sequence of observations that led to the development of PFDiff is remarkable.

5). The proposed sampler can be integrated to any order of ODE solvers regardless of the type.

缺点

I believe this is a good paper as it provide valuable insights while providing solid reasoning, but I have some questions regarding the scalability, as well as about k and h values of the proposed method.

1). I would like to know how will PFDiff maintain quality across different types of diffusion models other than those mentioned in paper?

2). As the algorithm's construction is based on gradients, I would like to know what happens if gradients show a dispersion. How this kind of a scenario is handled? Also is there a possibility of accumulating errors in the proposed approach?

3). A more ablation on the parameters k and h will further enhance the paper. For instance, is it possible to further increase the value of k? At that kind of instance, how would PFDiff work?

问题

Please see the weaknesses section.

评论

We sincerely appreciate the reviewer's valuable review of the manuscript and the recognition of our work of the work presented in the paper. Below are our responses to all questions. We kindly hope you could consider increasing the score if you are satisfied.

W1: How will PFDiff maintain quality across different types of diffusion models?

A: All diffusion models can be unified under the perspective of SDE/ODE modeling [R1], and the sampling process of PFDiff is built upon the theoretical foundation of ODE. Therefore, theoretically, PFDiff can be applied to all types of diffusion models. Additionally, beyond the DDPM and ScoreSDE [R1] framework models mentioned in the paper, we have also included experimental results from a more advanced diffusion model within the EDM [R2] framework, as shown in the table below:

ImageNet 64×\times64, FID\downarrow

Method\NFE46810121520
DDIM58.4334.0322.5916.7213.1410.047.42
DDIM+PFDiff55.6317.828.005.384.363.723.35

As can be seen from the table, PFDiff also effectively improves the sampling quality of DDIM within the EDM framework, further validating the broad adaptability of PFDiff across different types of diffusion models.

W2.1: How does the algorithm handle scenarios where gradients exhibit dispersion?

A: In fact, the primary design goal of PFDiff is to address the gradient dispersion issue observed in baseline fast solvers (e.g., DDIM). Gradients represent the direction of sampling updates, and when the NFE is small, dispersion arises from misalignment between the intermediate states that need to be updated and the current direction of the sampling update (i.e., neural network output). PFDiff mitigates this issue by introducing the foresight update mechanism of Nesterov momentum, which predicts the future sampling update direction. This effectively alleviates the misalignment (i.e., gradient dispersion) between the gradients and the intermediate states that need to be updated in baseline fast solvers, as theoretically guaranteed in Proposition 3.1.

W2.2: Is there a possibility of accumulating errors in the proposed approach?

A: The error accumulation in PFDiff arises from the baseline fast solvers (e.g., DDIM). Notably, PFDiff accelerates sampling by reducing the discretization error of the existing fast solvers. Therefore, the error accumulation in PFDiff will be smaller than that in the baseline fast solvers.

W3: Can the value of kk be further increased, and how would PFDiff perform in that case?

A: Thank you for the reviewer’s suggestion. We have increased the value of the parameter kk to 4 on the CIFAR10 dataset and conducted further performance analysis, as shown in the following table:

CIFAR10, FID\downarrow

Method\NFE46810121520
DDIM65.7029.6818.4513.6611.018.807.04
+PFDiff(k3k \le 3)22.389.845.644.574.394.103.68
+PFDiff(k=4k=4)27.208.656.534.564.173.983.72

From the table, it can be seen that when the value of kk is increased to 4, PFDiff improves the sampling quality of DDIM at certain NFE. However, increasing kk leads to the problem of over-searching. Therefore, k3k \le 3 might be a more suitable choice, as it provides a better balance between sampling quality and the cost of parameter search.

[R1] Yang Song et al., "Score-based generative modeling through stochastic differential equations." ICLR 2021.

[R2] Tero Karras et al., "Elucidating the design space of diffusion-based generative models." NeurIPS 2022.

评论

Dear Reviewer BQ25,

Thank you once again for your valuable review of the manuscript and for your recognition of our work! As the discussion period draws to a close, we would appreciate it if you could let us know whether your concerns have been addressed. We would be very pleased to continue the discussion if you have any further questions!

Best, Authors

评论

Thank you very much for your detailed response, and my concerns have been addressed. I keep my score unchanged.

评论

We are so glad to hear that your concerns have been addressed! Thank you once again for recognizing our work!

审稿意见
6

This paper introduces a novel methodology to accelerate diffusion model sampling. The core concept involves reusing past score predictions to generate a preliminary estimate (springboard) for the next step. Then, future score prediction is obtained from this springboard. By leveraging this future score prediction, the method enables step skipping, directly calculating the point two steps ahead from the current position. This approach offers practical advantages as it is orthogonal to existing advanced samplers and does not require additional training. Extensive experiments demonstrate its effectiveness in significantly accelerating diffusion model sampling when integrated with various state-of-the-art samplers.

优点

  1. Extensive experiments conducted with diverse models and baselines underscore both the superiority and generality of the proposed methodology. Comprehensive results reveal a substantial improvement in the efficiency of diffusion sampling.
  2. By approaching diffusion model acceleration through time-skipping, the authors introduce a technique that is orthogonal to existing advanced samplers. This characteristic, coupled with its training-free nature, enhances its practical applicability.
  3. Despite its simplicity and ease of implementation, the methodology presented in the paper yields significant benefits.

缺点

  1. The multi-step solver's exclusion of future gradients, a core component of the proposed methodology, undermines the claimed orthogonality. Additionally, the absence of experimental results (Stable diffusion) integrating the method with the DPM-Solver series raises doubts about its performance enhancement potential and the extent of its orthogonality when applied to multi-step solvers.
  2. While the methodology is presented as an orthogonal wrapper for arbitrary ODE solvers, its classification as a standalone ODE solver is also plausible, depending on the perspective.
  3. The use of "gradient guidance" in the title and text is potentially misleading. In the context of diffusion models, this term is typically associated with guiding the sampling process using external model gradients (e.g. classifier guidance). For better clarity, using terms like "score" or "predicted noise" would be more appropriate.
  4. The direct comparison between the future gradient and the springboard in Figure 2(b) is questionable. Given their different scales, a direct MSE comparison might not be the most accurate approach to assess their relative reliability.

问题

  1. In the Stable diffusion experiment, why was the proposed methodology not applied to DPM-Solver? If the results were presented in the paper, please provide a reference.
  2. In Equation 14, is it correct to plug the nn points obtained from the Δt\Delta t interval ODE solver into the 2Δt2\Delta t interval ODE solver? Do I understand it correctly?
  3. Are the MSE scales of the future gradient and the springboard directly comparable? Would the author(s) think that using the MSE of the image updated with the future gradient instead of the future gradient in Figure 2(b) provides a more meaningful comparison?
  4. Is the mention of Nesterov momentum solely due to the similarity in form between the proposed springboard prediction method and Nesterov momentum? Have any properties of Nesterov momentum, such as improved convergence speed, been leveraged in the theoretical analysis or practical implementation of the proposed method?
评论

We sincerely appreciate the reviewer's meticulous review and insightful comments, which have helped improve our paper. Below are our responses to all the questions. Please consider increasing the score if you are satisfied.

W1.1: The multi-step solver undermines the orthogonality.

A: It is important to emphasize that, in theory, higher-order (not multi-step) solvers and future gradients are orthogonal, as shown in Equation (14). However, in the practical experiments, we found that introducing future gradients into higher-order solvers is highly sensitive to the choices of hyperparameters kk and hh, leading to unstable results. Removing the future gradients and only adding past gradients (which is also the core component of our method) to the higher-order solver yields more stable performance, which significantly mitigates the convergence issues, as shown in Figure 3(c) (originally Figure 4(c)).

W1.2 and Q1: Why are there no experimental results applying PFDiff to DPM-Solver in Stable Diffusion?

A: We did not include the experimental results for the following reasons: First, DDIM and DPM-Solver-1 are equivalent [R1]. Second, with fewer NFE, higher-order DPM-Solvers perform worse than DDIM on Stable-Diffusion [R2]. Finally, DDIM+PFDiff has outperformed other previously best-performing solvers. Additionally, we also provide experimental results applying PFDiff to DPM-Solver in Stable Diffusion with a guidance scale of 7.5, as shown in the table below:

Stable Diffusion, FID\downarrow

Method\NFE468101520
DPM-Solver-1 [R1]35.4820.3317.4616.7816.0815.95
DPM-Solver-1+PFDiff29.0215.4713.2613.0613.5713.97
DPM-Solver-2 [R1]184.21157.95148.67135.8192.6240.47
DPM-Solver-2+PFDiff147.20106.2457.0731.6617.8714.13

As shown in the table above, PFDiff further improves the sampling quality of DPM-Solver-1 and DPM-Solver-2. Notably, the results in the table come from the original DPM-Solver implementation, while the DPM-Solver results in the paper utilized additional tricks from [R2] to achieve better performance.

W2: The methodology can be classified as a standalone ODE solver.

A: We strongly agree with the insightful classification: the methodology can be classified as a standalone ODE solver, depending on the perspective. We have incorporated this classification in Section 3.4 (line 337 to line 339).

W3: The use of "gradient guidance" in the title and text is potentially misleading.

A: We appreciate the reviewer for pointing out the potential misunderstanding in the title. As revised, the title now reads: "PFDiff: Training-Free Acceleration of Diffusion Models Combining Past and Future Scores." Additionally, we have made corresponding changes in the paper, replacing "gradient" with "score" and removing the expression of "gradient guidance".

W4 and Q3: Are the MSE scales of the future gradient and the springboard directly comparable?

A: Thank you for raising this very valuable question. Based on the suggestion in Q3, we have added a fairer comparison in the paper, showing the MSE of the status separately updated using the "springboard" and future gradient. The corresponding results are presented in Figure 6(b) in Appendix D.8. Notably, the trends in Figure 6(b) and Figure 1(b) (original Figure 2(b)) are consistent, which validates the effectiveness of using the future gradient.

Q2: In Equation 14, is it correct to plug the nn points obtained from the Δ\Delta interval ODE solver into the 2Δ2\Delta interval ODE solver?

A: Yes, you are nearly correct. Equation 14 represents plugging the pp (not nn) points obtained from the Δ\Delta interval ODE solver into the 2Δ2\Delta interval ODE solver. However, for first-order ODE solvers (e.g., DDIM), p=1p=1. This means that we use the gradient corresponding to the endpoint ti+1t_{i+1} of the interval Δ\Delta as a replacement for the gradient corresponding to the endpoint tit_{i} of the interval 2Δ2\Delta. This replacement effectively reduces the discretization error of the first-order ODE solver, and a detailed theoretical analysis can be found in Proposition 3.1.

[R1] Cheng Lu et al., "DPM-solver: a fast ODE solver for diffusion probabilistic model sampling in around 10 steps." NeurIPS 2022.

[R2] Cheng Lu et al., "Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models." arXiv:2211.01095.

评论

Q4: Is there a theoretical analysis of Nesterov momentum on convergence speed?

A: In this work, we only discuss Nesterov momentum as the motivation for using future gradients. Nevertheless, in Equation (15) and Proposition 3.1, we employ a Taylor expansion and further derivations to theoretically demonstrate that the use of future gradients can reduce discretization errors, thereby accelerating the convergence of the sampling process. This theoretical result indirectly suggests that the foresight update mechanism of Nesterov momentum can effectively accelerate convergence.

评论

Dear Reviewer sz2Y,

Thank you once again for your insightful feedback! These suggestions have greatly enhanced the quality of our manuscript. As the discussion period draws to a close, we would appreciate it if you could let us know whether your concerns have been addressed. We would be very pleased to continue the discussion if you have any further questions!

Best, Authors

评论

Thanks to the authors for addressing most of my concerns. I've increased the final score accordingly.

To further assert orthogonality, I suggest adding a new table for W1.2 to the main text. Demonstrating PFDiff's improvement across multiple solvers would be ideal.

For the final version, I'd prefer Figure 6(b) over Figure 1(b) as the comparison within the same image domain is more intuitive.

Thanks again for the authors' efforts.

评论

Thank you very much for your response and for increasing the rating! We are so glad to hear that most of your concerns have been addressed!

In response to the reviewer’s suggestions, we have revised the relevant content and uploaded the revised paper. Specifically, we added Table 1 (i.e., the W1.2 table) to Section 4.2 to demonstrate the orthogonality of PFDiff. Meanwhile, we replaced Figure 1(b) with Figure 6(b) to provide a more intuitive comparison within the same image domain. Furthermore, we also added and modified the corresponding descriptions in Section 4.2 and Appendix D.8.

Thank you once again for your valuable and insightful feedback, which has greatly improved the quality of our manuscript.

评论

We sincerely appreciate the effort of all the reviewers for their detailed review and insightful suggestions. We would like to present the following modifications to the paper (the revised version has been uploaded, and the changes are highlighted in blue within the document):

  • We have revised the title and aligned the relevant expressions in the paper, including replacing "gradient" with "score" and removing the expression of "gradient guidance", to eliminate potential misinterpretations (Reviewer sz2Y, W3).
  • We have added Appendix D.8, which provides additional experimental details and results for Figures 1(a) and (b) (original Figures 2(a) and (b)). (Reviewer sz2Y, W4/Q3; Reviewer 7KLd, W1)
  • Section 3.5 has been added to explain the effectiveness of PFDiff through the geometric shape of the trajectories. (Reviewer 7KLd, W1)
  • An insightful comment regarding PFDiff has been added to lines 337–339 of Section 3.4. (Reviewer sz2Y, W2)
  • Several writing expressions have been revised, specifically in lines 68-69, 183–184, and 327–328. (Reviewer 7KLd, W2)
  • We have moved Figure 1 to Appendix D.9 (original Appendix D.8) and corrected the citation years for some references. (Reviewer 7KLd, Others)
  • We added Table 1 in Section 4.2 to validate the orthogonality of PFDiff in Stable Diffusion and also revised the relevant descriptions in Section 4.2 and Appendix D.5. (Reviewer sz2Y, W1.2/Q1; Reviewer 7KLd, W6)
  • We added a description and experimental results regarding the automatic search strategy for selecting kk and hh in lines 493-495 of Section 4.3 and in Appendix D.6.1. (Reviewer 7KLd, New question)
AC 元评审

The work proposed PFDiff, a training-free, and orthogonal timestep-skipping mechanism that improves existing ODE solvers in Diffusion Probabilistic Models. By leveraging springboard and foresight updates, the method achieves solutions with fewer function evaluations (NFE), significantly reducing computational costs while preserving high sample quality. This approach effectively improves both the efficiency and the quality of diffusion model sampling, addressing a key challenge in the field.

All reviewers find the idea interesting, with strong empirical results.

审稿人讨论附加意见

During the rebuttal:

  • the authors have revised the title and aligned the relevant expressions in the paper to to eliminate potential misinterpretations of the results.

  • The authors provided additional experimental details and results, comparing using different solvers (i.e., DPM) and stable diffusion for more comprehensive evaluations.

最终决定

Accept (Poster)