PaperHub
4.9
/10
Poster4 位审稿人
最低1最高4标准差1.1
3
3
1
4
ICML 2025

Improved Discretization Complexity Analysis of Consistency Models: Variance Exploding Forward Process and Decay Discretization Scheme

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24
TL;DR

The state-of-the-art discretization complexity for consistency models.

摘要

Consistency models, a new class of one-step generative models, have shown competitive performance with multi-step diffusion models. The most challenging part of consistency models is the training process, which discretizes the continuous diffusion process into $K$ steps and trains a one-step mapping function on these discretized timepoints. Despite the empirical success, only a few works focus on the discretization complexity $K$, and their setting is far from that of empirical works. More specifically, the current theoretical works analyze the variance preserving (VP) diffusion process with a uniform stepsize, while empirical works adopt a variance exploding (VE) process with a decay discretization stepsize. As a result, these works suffer from large discretization complexity and fail to explain the empirical success of consistency models. To close the gap between theory and application, we analyze consistency models with (1) VE process and (2) decay stepsize and prove the state-of-the-art discretization complexity for consistency models. This result is competitive with the results of diffusion models and shows the potential of consistency models. To balance the computation and performance, previous empirical work further proposes a $2$-step consistency algorithm. In this work, we also analyze the role of $2$-step sampling and show that it improves the discretization complexity compared with one-step generation.
关键词
Consistency ModelsDiscretization Complexity

评审与讨论

审稿意见
3

The paper proposes a novel discretization complexity analysis of Consistency Models, by incorporating the variance exploding kernel and the non-uniform step size. The results are closer to diffusion models than previous methods, providing a better analysis of conistency models.

Update after rebuttal

The several reviews and answers clarified some parts of the paper, and I will maintain my previous score.

给作者的问题

1- From your results, is it correct that the complexity decreases as aa approaches \infty? Would that mean that in practice, schedules with big aa should be preferred?

论据与证据

The claims of achieving better complexity analysis are supported by the proofs. The framework represents more closely what is commonly done in the practice, which results in complexity results more close to the ones of diffusion models, which could help motivating the great empirical performance of consistency models.

方法与评估标准

Besides appendix F, there are no evaluations, as the claims are theoretical and supported by proofs.

理论论述

The assumptions (4.1 to 4.4) are reasonable and in line with related literature. The results from theorem 4.7, corollaries 4.8 and 4.12 with proofs in appendix B seem correct.

实验设计与分析

As the paper is mostly theoretical, there is no real experimental section, besides some simulations in Appendix F.

补充材料

In addition to the sections discussed above, I went through section D to get a better understanding of what was done in previous work, as well as section F to verify the Lipschitz assumption.

与现有文献的关系

Consistency Models are a novel generative modeling framework which can achieve performance similar to diffusion models with significantly less sampling steps. Deepening our theoretical understanding of these models is relevant to further improve their performance.

遗漏的重要参考文献

N/A

其他优缺点

Given the analysis from the paper, I wonder if one can derive practical considerations to design better consistency models. Having a discussion about this could make the paper more relevant for applied research.

其他意见或建议

Would be useful to name corollaries and lemmas in the same way between main text and appendix.

作者回复

Thank you for your valuable comments and suggestions. We provide our response to each question below.

Weakness 1: The guidance on the design of better consistency models.

This paper makes the first step to elucidate the design space of consistency models under the different diffusion processes and reveals their different advantages and disadvantages, which will heavily influence the discretization complexity and is fundamental in designing the consistency models. For the VP-based consistency models, the early stopping parameter δ\delta has order ϵW22\epsilon_{W_2}^2, which is worse than the VE-based consistency models (δ\delta has order ϵW2\epsilon_{W_2}) and is the source of large discretization complexity. However, the VE-based consistency models also have their disadvantage: the polynomial diffusion time TT, which is much larger than T=log(1/ϵ)T=\log(1/\epsilon) for the VP-based consistency models and introduces additional ϵW2\epsilon_{W_2} dependence.

Hence, from the discretization perspective, a better consistency model should enjoy a Logarithmic TT and a δ\delta with order ϵW2\epsilon_{W_2}, which would lead to better complexity results. We note that the rectified flow-based one-step models have this potential. We will add the above discussion to our next version and view the design of better consistency models as important future work.

Question 1: The choice of aa.

Our results show that with a larger aa, the discretization complexity is better than the uniform discretization scheme (a=1a=1).This phenomenon is also observed in the empirical work EDM [2], which observed that when 1a71\leq a\leq 7, a larger aa will help the diffusion models to achieve better performance (Figure 13 (c) of [2]). When aa is larger than 77, the improvement is not significant. Consistency models follow the choice of aa in EDM. In our Theorem 4.7, we also show that with a=7a=7, the discretization complexity has order 1/ϵW223/71/\epsilon_{W_2}^{23/7}, which is close to the 1/ϵW231/\epsilon_{W_2}^{3} of exponential decay stepsize.

[1] Lyu, Junlong, Zhitang Chen, and Shoubo Feng. "Sampling is as easy as keeping the consistency: convergence guarantee for Consistency Models." In Forty-first International Conference on Machine Learning. 2024.

[2] Karras, Tero, Miika Aittala, Timo Aila, and Samuli Laine. "Elucidating the design space of diffusion-based generative models." Advances in neural information processing systems 35 (2022): 26565-26577.

审稿意见
3

The paper analyzed the consistency model of VE process and decay step size, and proved the discretization complexity of the consistency model.

给作者的问题

Please see the weakness. Can the author clarify whether their conclusion can be extended to consistency training and continuous time consistency models.

论据与证据

The paper bridges the gap between theory and application of consistency models by analyzing the discretization complexity through mathematical derivation and support the main claims.

方法与评估标准

This paper is a theoretical work and does not contain any empirical results.

理论论述

As I am not an expert in diffusion model theory, it is difficult for me to keep up with some parts of the paper, so I did not check the correctness of all the theorems.

实验设计与分析

This paper does not include experiments. They provided a discretization complexity analysis of the consistency model in mathematics.

补充材料

I review the discussion on the previous work in the supplementary material.

与现有文献的关系

In my opinion, this work is the first time it has closed the gap between the discretization complexity analysis for the consistency model and the practical setting. And it overcomes some limitations of prior work such as [1][2].

[1] Zehao Dou, Minshuo Chen, Mengdi Wang, and Zhuoran Yang. Theory of consistency diffusion models: Distribution estimation meets fast sampling. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024.

[2] Junlong Lyu, Zhitang Chen, and Shoubo Feng. Sampling is as easy as keeping the consistency: convergence guarantee for consistency models. In Forty-first International Conference on Machine Learning, 2024.

遗漏的重要参考文献

The author has thoroughly discussed the relevant literature.

其他优缺点

Strengths The authors also provide the 2-step Sampling analysis, which is widely used.

Weaknesses The conclusion in the article is important for consistency distillation. As is well known, consistency training can independently train consistency models, which is missing in the article.

其他意见或建议

I find no typos currently.

作者回复

Thank you for your valuable comments and suggestions. We provide our response to each question below.

Weakness & Suggestion: The analysis for consistency training and continuous time consistency models.

As shown by the professional reviewer, the consistency training and continuous-time consistency models are both important part of of consistency models. In this part, we discuss some possible method to obtain the discretization complexity for this method.

Consistency Training. If we can not obtain the pre-trained score function, we can construct an empirical score by using nn samples from the target data distribution X0,i_i=1n\\{X_{0,i}\\}\_{i=1}^n

s_emp(Xt;t)=1σt2[Xt_i=1NN(Xt;X_0,i,σt2I)X0,i_i=1NN(X_t;X0,i,σt2I)],s\_{\mathrm{emp}}(X_t ; t)=-\frac{1}{ \sigma_t^2}\left[X_t- \frac{\sum\_{i=1}^N \mathcal{N}\left(X_t ; X\_{0,i}, \sigma_t^2 I\right) X_{0,i}}{\sum\_{i=1}^N \mathcal{N}\left(X\_t ; X_{0,i}, \sigma_t^2 I\right)}\right],

which has an explicit formulation, needs no additional training ([1] also use this formula) and converges to the ground-truth score function with a rate n1/dn^{-1/d}. Hence, we can replace the pretrained score function sϕs_{\phi} in eq. (4) with semps_{\mathrm{emp}}. Then, the ϵscore\epsilon_{\text{score}} becomes n1/dn^{-1/d} and achieve the guarantee for consistency models without a pre-trained score function under the VE process and EDM stepsize.

We note that though this results does not rely on the pretrained score function, it use the reverse PFODE process of diffusion. On the contrary, the consistency training paradigm only use the forward diffusion process. Hence, the above result is not the discretization complexity of consistency training paradigm. To achieve this goal, one possible way is to use similar method with [1], which use s_emps\_{\mathrm{emp}} to construct a baseline consistency function (a bridge between the target data distribution and the consistency function learned by the consistency training paradigm) instead of directly using it in the training objective function. However, as shown in our Remark 4.10, the construction of the baseline consistency function run MM-step PFODE instead of one-step PFODE (used in application), which leads a large discretization complexity 1/ϵW1101/\epsilon_{W_1}^{10}. Since this result is significant larger than our 1/ϵ_W231/\epsilon\_{W_2}^3 results, we left the discretization complexity analysis (compareable with the CD paradigm) for the CT paradigm under the setting used in application.

Continuous-time consistency models. Since continuous time model use df_θ(X_t,t)dt\frac{\mathrm{d} \boldsymbol{f}\_{\theta^{-}}(X\_t, t)}{\mathrm{d} t} instead of f_θ(X_tΔt,tΔt))\boldsymbol{f}\_{\theta^{-}}(X\_{t-\Delta t}, t-\Delta t)) (Δt\Delta t is hk+1h_kh_{k+1}-h\_{k} in our work. Here we use the uniform stepsize for convenience), there are not well-defined discretization complexity K=T/ΔtK=T/\Delta t for continuous-time models. However, due to the absence of Δt\Delta t, the training process of continuous time models is less stable than discrete time consistency models, which is the core problem for continuous time models. Recently, [2] make a great effort to stabilize the training process of continuous time models.

Thanks again for the comments on a broader area of consistency models and we will add the above discussion in our next version.

[1] Dou, Zehao, Minshuo Chen, Mengdi Wang, and Zhuoran Yang. "Theory of consistency diffusion models: Distribution estimation meets fast sampling." In Forty-first International Conference on Machine Learning. 2024.

[2] Lu, Cheng, and Yang Song. "Simplifying, stabilizing and scaling continuous-time consistency models." arXiv preprint arXiv:2410.11081 (2024).

审稿意见
1

This paper examines the convergence of the consistency model under the VE process with a decaying step size. It focuses on consistency distillation and establishes convergence results based on the Wasserstein distance between the generated and target distributions. Additionally, it demonstrates that 2-step sampling enhances discretization efficiency.

给作者的问题

None

论据与证据

The main result, Theorem 4.7, heavily depends on Assumption 4.4, which lacks supporting evidence. See below for details.

方法与评估标准

Theory paper, not applicable.

理论论述

I find Theorem 4.7 to be not very informative. According to Appendix B, the first term in the error decomposition is Lf,0RL_{f,0} R, which does not asymptotically converge to zero. Therefore, the condition on Lf,0L_{f,0} must be strict: even if Lf,0=O(1)L_{f,0} = O(1), the error bound in Theorem 4.7 remains O(R)O(R). Since RR represents the diameter of the target distribution’s support, an error bound of O(R)O(R) is trivial. This paper only establishes Lf,0=R/TL_{f,0} = R/T for the Gaussian distribution, which is quite limited. To derive a more meaningful result, the paper should rigorously demonstrate that Lf,0=R/TL_{f,0} = R/T holds for a broader class of distributions rather than merely assuming it (Assumption 4.4).

实验设计与分析

In Appendix F, the calculation of the Lipschitz constant is not clearly explained.

补充材料

I reviewed Appendix B and F.

与现有文献的关系

This paper investigates the convergence of the consistency model under the variance-exploding process, whereas prior work primarily focuses on the variance-preserving process.

遗漏的重要参考文献

No.

其他优缺点

This paper examines the Lipschitz coefficient of the consistency function, a crucial step in understanding consistency models.

其他意见或建议

The author could analyze the Lipschitz constant of the consistency function at (x,t)=(0,T)(x,t) = (0,T) for the bimodal Gaussian mixture model 0.5N(1,σ2)+0.5N(1,σ2)0.5 N(-1,\sigma^2) + 0.5 N(1,\sigma^2) for a small σ\sigma.

作者回复

Thank you for your valuable comments and suggestions. We provide our response to each question below.

Q1: Theoretical Claims: The discussion on Lf,0L_{f,0} assumption and remove it.

Following the suggestion of the professional reviewer, we consider the Lf,0L_{f,0} in 2-mode GMM in the following Suggestion 1 and show L_f,0L\_{f,0} has the order 1/T1/T, which is necessary for a W_2W\_2 guarantee. In this part, we mainly discuss how to remove this assumption. We prove that if considering a weaker W_1W\_1 guarantee, we can remove L_f,0=O(R/T)L\_{f,0}=O(R/T) assumption and achieve a Lf1+1/a/ϵ_W1L_f^{1+1/a}/\epsilon\_{W_1} result:

When considering W1W_1 guarantee, the first term of line 615 (appendix B) becomes LfW_1(N(0,T2Id),qT)L_fW\_1(\mathcal{N}(0,T^2I_d),q_T) (Using uniform LfL_f instead of R/TR/T). Different from the W2W_2 distance, the W1W_1 distance can be bounded by weight TV distance (Case 6.16 [1]):

W1(N(0,T2Id),qT)RTV(N(0,T2Id),qT)R2/T,W_1(\mathcal{N}(0,T^2I_d),q_T)\leq R\mathrm{TV}(\mathcal{N}(0,T^2I_d),q_T)\leq R^2/T,

where the second inequality follows the fact of [2]. Hence, we do not require Lf,0=O(R/T)L_{f,0}=O(R/T). The other proof is exactly the same with W2W_2 distance. To guarantee LfW_1(N(0,T2Id),qT)L_fW\_1(\mathcal{N}(0,T^2I_d),q_T) smaller than ϵW1\epsilon_{W_1}, we require TLfR2/ϵ_W1T\ge L_fR^2/\epsilon\_{W_1}, which is the source of additional Lf1/aL_f^{1/a}. We will add the above result in the next version.

Q2: Experimental Analysis: the calculation of Lf,0L_{f,0} in simulation experiments.

Since Lf,0L_{f,0} can be obtained by calculate the F-norm of Y0f(Y0,0)\nabla_{Y_0}\boldsymbol{f}(Y_0,0), we calculate the following equal to approximate it

fv(Yt,t)fv(Yt+ΔY,t)ΔY,\left|\frac{\boldsymbol{f}^{\boldsymbol{v}}\left(Y_{t^{\prime}}, t^{\prime}\right)-\boldsymbol{f}^{\boldsymbol{v}}\left(Y_{t^{\prime}}+\Delta Y, t^{\prime}\right)}{\Delta Y}\right|,

where Ytq_TtY_{t'}\sim q\_{T-t'} (sample 10001000 times and take average) and ΔY=0.01\Delta Y = 0.01.

Suggestion 1: The Lipschitz constant (Mixture of Gaussian).

We sincerely thanks again for the comments. We consider 2-mode GMM X01/2N(μ,σ2Id)+1/2N(μ,σ2Id)X_0 \sim 1/2N(\mu, \sigma^2I_d)+1/2N(-\mu, \sigma^2I_d). The score has the following form (Appendix A.2 of [3], we transform it from VP to VE)

logqt(Xt)=tanh(μXtσt2+σ2)μσt2+σ2Xtσt2+σ2.\nabla \log q_t(X_t)=\tanh(\frac{\mu^{\top} X_t}{\sigma_t^2+\sigma^2}) \frac{\mu}{\sigma_t^2+\sigma^2}-\frac{X_t}{\sigma_t^2+\sigma^2}.

Since fex(Y0,0)f^{\mathrm{ex}}(Y_0,0) the associate backward mapping of the following PFODE (in the following part, we ignore the superscript of tt'):

dYt=tanh(μYt(Tt)2+σ2)μ(Tt)(Tt)2+σ2Yt(Tt)(Tt)2+σ2dt,dY_t=\tanh(\frac{\mu^{\top} Y_t}{(T-t)^2+\sigma^2}) \frac{\mu(T-t)}{(T-t)^2+\sigma^2}-\frac{Y_t(T-t)}{(T-t)^2+\sigma^2}dt,

we need to solve it to obtain fex(Y0,0)f^{\mathrm{ex}}(Y_0,0). Since the score is highly nonlinear, it is hard to obtain a closed-form solution. There are two choices to overcome this hardness. The first choice is to do simulation experiments to simulate the solution (Appendix F).

The second choice is to add some assumptions on the target data to simplify the above ODE. We assume μ\mu is smaller enough to guarantee tanh(μYt(Tt)2+σ2)\tanh \left(\frac{\mu^{\top} Y_t}{(T-t)^2+\sigma^2}\right) can be approximated by μYt(Tt)2+σ2\frac{\mu^{\top} Y_t}{(T-t)^2+\sigma^2}, which simplify PFODE to a linear ODE (in fact, the distribution gradually closes to Gaussian)

dYt=(μμYt(Tt)((Tt)2+σ2)2Yt(Tt)(Tt)2+σ2)dt,\mathrm{d} Y_t=\left(\frac{\mu^{\top} \mu Y_t(T-t)}{\left((T-t)^2+\sigma^2\right)^2}-\frac{Y_t(T-t)}{(T-t)^2+\sigma^2}\right) \mathrm{d} t,

which have the following solution

Yt=Y0(σ2+(Tt)2σ2+T2exp(μ22(1σ2+(Tt)21σ2+T2)))_C(t).Y_t=Y_0\underbrace{\left(\sqrt{\frac{\sigma^2+(T-t)^2}{\sigma^2+T^2}} \cdot \exp \left(\frac{\mu^2}{2}\left(\frac{1}{\sigma^2+(T-t)^2}-\frac{1}{\sigma^2+T^2}\right)\right)\right)}\_{C(t)}.

The above results indicate

YT=Y0(σ2σ2+T2exp(μ22(1σ21σ2+T2)))Y_T=Y_0\left(\sqrt{\frac{\sigma^2}{\sigma^2+T^2}} \cdot \exp \left(\frac{\mu^2}{2}\left(\frac{1}{\sigma^2}-\frac{1}{\sigma^2+T^2}\right)\right)\right)

Taking the derivative of Y0Y_0, we know that the Lf,0L_{f, 0} have order 1/T1 / T. This result also matches our intuition that Y0Y_0 has a large variance (order T2T^2 ), and we need to multiply a 1/T1 / T to avoid the influence of large variance (lines 282-287).

Further Discussion on the error of linear approximation. The above part makes a linear approximation to simplify the ODE when μ\mu is close to 00, which introduces some small errors. Assuming Y0qT=1/2N(μ,(T2+σ2)I)+1/2N(μ,(T2+σ2))Y_0\sim q_T= 1/2N(\mu, (T^2+\sigma^2)I)+1/2N(-\mu, (T^2+\sigma^2)). For the variance, YT=Y0C(T)Y_T=Y_0C(T) recover σ2\sigma^2. For the mean, the recover μ\mu of the above consistency function is approximately μσ2/(σ2+T2)\mu\sqrt{\sigma^2/(\sigma^2+T^2)}, which is smaller than μ\mu. However, since assuming μ\mu is close to 00, this error term is small and is possibly introduced by the nonlinear term.

We will add the above discussion in the next version.

[1] Villani, Cédric. Optimal transport: old and new. Vol. 338. Berlin: springer, 2008.

[2] Yang et al,. Leveraging Drift to Improve Sample Complexity of Variance Exploding Diffusion Models. NeurIPS 2024.

[3] Shah et al,. Learning mixtures of gaussians using the DDPM objective. NeurIPS 2023.

审稿人评论

Thank you for your response. Please see my comments below:

  1. Discussion on the Lipschitz condition:
    I find both inequalities in question to be problematic.

    • The first inequality relies on the fact that W1(P1,P2)RTV(P1,P2)W_1(P_1, P_2) \le R \cdot TV(P_1, P_2), where RR is the diameter of the support of both distributions. However, in your setting, both N(0,T2Id)N(0, T^2 I_d) and PTP_T are clearly unbounded, making this inequality inapplicable.
    • The second inequality uses convergence results from a paper that operates under a different setup. Specifically, [2] analyzes a forward SDE with a drift term, while the forward SDE in your paper does not include a drift. Therefore, the results from [2] do not apply here.
  2. Empirical evaluation of the Lipschitz constant and Assumption 4.4:
    Assumption 4.4 posits that supyyf(y,0)Lf,0\sup_y ||\nabla_y f(y, 0)|| \le L_{f,0}. However, according to the rebuttal, the experimental evaluation computes EypT[yf(y,0)]E_{y \sim p_T}[||\nabla_y f(y, 0)||]. These are not equivalent; in fact, supyyf(y,0)EypT[yf(y,0)]\sup_y ||\nabla_y f(y, 0)|| \ge E_{y \sim p_T}[||\nabla_y f(y, 0)||]. So, the empirical evaluation does not support the assumption.

  3. The 2-mode GMM example:

    • Simulation of the Lipschitz constant: As noted above, there is a mismatch between the theoretical assumptions and the empirical estimation of the Lipschitz constant.
    • Error from linear approximation: I have several concerns here:
      1. Grönwall’s inequality suggests that the approximation errors can have exponential effects on the solution. This raises doubts about the validity of a linear approximation.
      2. It is unclear whether the term μYt(Tt)2+σ2\frac{\mu^\top Y_t}{(T - t)^2 + \sigma^2} can be treated as small, even if μ\mu is small:
        • YtY_t may be unbounded;
        • (Tt)20(T - t)^2 \to 0 as tTt \to T;
        • σ\sigma could also be small.
      3. Even if we accept the linear approximation, the resulting Lipschitz constant grows exponentially as σ0\sigma \to 0, leading to a vacuous theoretical bound.

Given these issues, I will maintain my evaluation.

作者评论

We sincerely thank the professional and helpful reviewer for further feedback and comments. We provide our response to each question below.

Q1: The discussion of W1W_1 results.

As pointed out by the helpful reviewer, the distribution N(0,T2I)\mathcal{N}(0,T^2I) and qTq_T is unbounded. Then, we can not remove this second part of Assumption 4.4 (R/TR/T assumption) even considering the W1W_1 distance (Hence, we will not add this discussion in our paper. Thanks again!). To verification our assumption, we do more simulation experiments with the uniform sampling YY (instead of sampling YtY_t according to qTtq_{T-t'}) to simulate the \sup\_y\left\\|\nabla_y f(y, 0)\right\\| instead of E\_{y \sim q\_T}\left \left\|\nabla_y f(y, 0)\right\|\right and show that in a large range of YY, the Lf,0L_{f,0} have the order 1/T1/T (Y1,2,3,...,40Y\in \\{1,2,3,...,40\\}).

Q2: The further simulation experiments using uniform sampling instead of sampling according to qtq_t.

As mentioned in Q1, in this part, we do simulation experiments on 3 GMM with different YY (and ΔY=0.01\Delta Y=0.01) to verify the Lipschitz constant has order 1/T1/T in a large range of YY (Y{1,2,3,...,40}Y\in \{1,2,3,...,40\}). The kindly reviewer can see the simulation experiments in the following link.

Simulation Experiment Link: https://anonymous.4open.science/r/ICML_Consistency_Simulation-8AF6/Rebuttal_Simulation_Consistency.pdf

Q3: The linear approximation.

(a) Since the closed-form solution for the PFODE with a nonlinear score function is hard to obtain, we make a linear approximation in the nonlinear score of 2-GMM to clearly discuss the order of Lipschitz constant (This linear approximation has been used in previous theoretical works on diffusion models with GMM distribution due to the difficult nonlinear terms (Lemma 8 of [1])). As discussed by the reviewer, the linear approximation will introduce some approximation errors. For this error, at the end of our rebuttal, we show that the obtained consistency function (C(T)C(T)) can approximately recover the target 2-GMM.

(b) The influence of σ\sigma.

For the variance term of GMM, since the current image datasets is usually normalized, σ2\sigma^2 is not close to 00 in application (then, we can view it as a constant, such as 11). Hence, it will not introduce an additional exponential term.

(c) The choice of μ\mu.

We know that, with a high probability, YtY_t falls in the range [3((Tt)2+σ2),3((Tt)2+σ2)][-3(\sqrt{(T-t)^2+\sigma^2}), 3(\sqrt{(T-t)^2+\sigma^2})] (since μ\mu is close to 00, the 2-GMM is close to Gaussian)(We also defined a truncated operator in this interval for YtY_t). Then, we choose a small enough μ\mu that guarantees μYt\mu^\top Y_t with the truncated YtY_t is smaller (for YtY_t out of this interval, intuitively, it can control by the tail bound of Gaussian and introduce additional truncated error). Hence, the linear approximation is possible and will not introduce an exponential term (with a constant σ\sigma in (b)).

The nonlinear score is hard to deal with in the area of diffusion models, and we sincerely hope the above discussion can address the concerns of the professional reviewer. We also hope that the insightful reviewer will re-evaluate this work based on our discussion.

Best,

Authors

[1] Shah et al,. Learning mixtures of gaussians using the DDPM objective. NeurIPS 2023.

审稿意见
4

This paper aims to provide a theoretical explanation for the strong empirical performance of consistency models — specifically focusing on how many discretization steps KK are needed during training to guarantee high-quality one-step sampling at test time. Prior theoretical analyses of consistency models typically used variance-preserving (VP) forward processes with uniform steps, leading to large and possibly unrealistic complexity bounds. This work, instead, targets the variance-exploding (VE) forward process and the EDM (decay) time-step scheduling. Under these more practical assumptions (matching real applications in, e.g., Karras et al., 2022 or Song et al., 2023), the authors derive improved discretization complexity bounds – polynomial in O(1/ε)O(1/\varepsilon) with exponents significantly better than previous results. They also show that 2-step sampling (a widely used trick in consistency models) can further reduce the required number of steps to achieve a given Wasserstein-2 error.

给作者的问题

Please refer to the "Weaknesses" section.

论据与证据

In this paper, the authors claimed that analyzing VESDE plus EDM steps yields a polynomial discretization bound for consistency models that is significantly smaller than in previous theoretical studies, and this complexity is close to that of the best known diffusion results. In addition, 2-step sampling further reduces the exponent in ε\varepsilon.

To show these claims, the authors provided rigorous proof in the main text and appendix. They compared the final complexity expressions to older results, showing strict improvement. Besides, simulation experiments for multi-modal Gaussian distributions illustrate that their key assumption on Lipschitz constants is possible.

方法与评估标准

As a theoretical paper, there is no benchmark or datasets needed.

理论论述

The paper states each assumption explicitly, references prior standard assumptions (like bounded support for data or Lipschitz continuity of the consistency function). The proofs revolve around standard SDE manipulations, approximate PDE expansions, and the idea that “time-dependent” bounding of the score drift is more precise.

实验设计与分析

There are no experiments needed in this theoretical paper.

补充材料

Yes, I reviewed all the supplementary material. They are basically extra proofs of lemmas and theorems.

与现有文献的关系

This is the first analysis that specifically uses VE forward SDE plus a decaying step approach for consistency. It corrects prior mismatches in theoretical assumptions vs. real usage. The authors connect the final complexity to that of diffusion, bridging a gap that older works left open. They cite relevant works on diffusion complexity (Song et al., Gao & Zhu, Chen et al.), on prior consistency theory (Dou et al. 2024, Li et al. 2024, Lyu et al. 2024). They also mention Karras et al. for EDM steps. Therefore, I would like to say the references are quite comprehensive.

遗漏的重要参考文献

Nothing crucial seems missing. The standard relevant theoretical diffusion or consistency references appear.

其他优缺点

Strengths: The focus on VE and EDM steps is precisely the realistic setting used in modern SOTA consistency models, bridging earlier theoretical-limitation criticisms. It is a good progress to achieve O~(1ε3+2/a)\tilde{O}(\frac{1}{\varepsilon^{3+2/a}}) with 2-step sampling compared with the previous O(1/ε7)O(1/\varepsilon^7). The paper is well-organized, with main results clearly stated and strictly proved.

Weaknesses:

  1. The entire analysis relies on an assumption that the score approximation is sufficiently accurate. Is it possible for the authors to remove this assumption and handle the end-to-end training complexity?
  2. The multi-step analysis is restricted to 2 steps; though that’s the main empirical scenario, I still want to ask about the scenario with more steps or sampling schedules. Have you considered a more general NN-step approach for consistency? Could that yield further improvements or do you expect diminishing returns after 2 steps?
  3. Could your “time-dependent lemma” approach be extended to other step-size patterns beyond EDM (like a piecewise approach)? Are there potential further gains if we do a more sophisticated scheduling than a single exponent aa.

其他意见或建议

Please refer to the "Weaknesses" section.

伦理审查问题

No ethical concerns since it is a completely theoretical work.

作者回复

Thank you for your valuable comments and suggestions. We provide our response to each question below.

Weakness 1: The approximated score and consistency function error (end-to-end analysis).

In this work, we assume the pretrained score and consistency function are accurate enough to achieve the final discretization complexity. Though they are standard assumptions in the complexity analysis area, as the friendly reviewer mentioned, the end-to-end analysis is also important, and we can use the current estimation error analysis results to achieve this goal. More specifically, for the approximated score, we use the results of [1] and replace ϵscore\epsilon_{score} with n_score2/dn\_{score}^{-2/d} (where n_scoren\_{score} is the number of data used to train the score function). For the approximated score function, we use the result of [2] and replace ϵ_cm\epsilon\_{cm} with n_cm1/2(d+5)n\_{cm}^{-1/2(d+5)}. Then, we can obtain the end-to-end complexity analysis.

Weakness 2: The Results of Multi-step Sampling Algorithm.

In fact, our analysis can be extended to NN-step sampling algorithm and can achieve nearly Lf/ϵ_W23+1/aL_{f}/\epsilon\_{W_2}^{3+1/a} (which is better than Thm. 4.7 and Coro 4.12) under the EDM stepsize. We use 33-step sampling algorithm as an example (τ1=T,τ2=3T/4,τ_3=T/2\tau_1=T,\tau_2=3T/4, \tau\_3=T/2). Under this setting, the result becomes (here we ignore ϵ_score\epsilon\_{score}, ϵcm\epsilon_{cm}, R,dR,d and focus on the dominated term)

δ+1/T3++Lf(T/δ)1a/(Kδ2).\delta+1/T^3++L_f (T / \delta)^{\frac{1}{a}} /\left(K \delta^2\right).

To guarantee the above term smaller than ϵW2\epsilon_{W_2}, we require δ=ϵ_W2\delta=\epsilon\_{W_2} and KLfT1/aδ2+1/aϵ_W2=LfT1/aϵ_W23+1/aK\ge \frac{L_fT^{1/a}}{\delta^{2+1/a}\epsilon\_{W_2}}=\frac{L_fT^{1/a}}{\epsilon\_{W_2}^{3+1/a}}, which is the same with the one-step and two-step sampling algorithms. However, 3-step algorithm only require T1/ϵ_W21/3T\ge 1/\epsilon\_{W_2}^{1/3}, which is better than 1/ϵ_W21/\epsilon\_{W_2} of 1-step and 1/ϵ_W21/21/\epsilon\_{W_2}^{1/2} of 2-step. Hence, the discretization complexity for 33-step sampling algorithm is Lf/ϵ_W23+4/(3a)L_f/\epsilon\_{W_2}^{3+4/(3a)}, which is better than 22-step algorithm. The above steps can be extended to NN steps, and the influence of TT decreases, and finally, TT does not affect the discretization complexity, which leads to a Lf/ϵ_W23+1/aL_{f}/\epsilon\_{W_2}^{3+1/a} results. We will add the above discussion in our next version.

Weakness 3: The Discretization Complexity for Piecewise Discretization Scheme (beyond the EDM with single aa).

Before providing the complexity result for the piecewise discretization scheme, we first discuss the performance of the EDM scheme in the application (Consistency models follow the choice of aa of EDM.). EDM [3] shows that when 1a71\leq a\leq 7, a larger aa will help the diffusion models to achieve better performance (Figure 13 (c) of [2]), which also matches our theoretical results. However, when aa is larger than 77, the improvement is not significant and even becomes worse [3]. As a result, the exponential decay stepsize is theoretically friendly and is not widely used in applications. One possible explanation is that at the end of the reverse process, diffusion models generate image details and require a small stepsize (However, the exponential decay stepsize is too large.)

Hence, we can design a two-stage discretization scheme: (a) when t[0,T1]t'\in [0, T-1], we use the exponential decay stepsize; (b) when t(T1,Tδ]t'\in (T-1, T-\delta], we use the EDM stepsize. With this scheme, the discretization complexity becomes Lf/ϵ_W23+1/aL_{f}/\epsilon\_{W_2}^{3+1/a}, which is better than Thm. 4.7 with EDM (single aa). This result shows the improvement of the two-stage discretization scheme from the theoretical perspective, and we leave the empirical application of this scheme as an interesting future work. We will add the above discussion in our next version.

[1]Oko, Kazusato, Shunta Akiyama, and Taiji Suzuki. "Diffusion models are minimax optimal distribution estimators." In International Conference on Machine Learning, pp. 26517-26582. PMLR, 2023.

[2]Dou, Zehao, Minshuo Chen, Mengdi Wang, and Zhuoran Yang. "Theory of consistency diffusion models: Distribution estimation meets fast sampling." In Forty-first International Conference on Machine Learning. 2024.

[3]Karras, Tero, Miika Aittala, Timo Aila, and Samuli Laine. "Elucidating the design space of diffusion-based generative models." Advances in neural information processing systems 35 (2022): 26565-26577.

最终决定

This paper studies discretization complexity of consistency diffusion models. Compared to the existing results, the settings in the paper closely aligns with the practice: 1) The forward process is a variance exploding SDE and 2) the time discretization follows EDM. Under assumptions, the established complexity bounds improve over existing ones.

Theoretical contributions are well recognized by reviewers, including the problem setup to reflect real applications, novel analysis to address the challenges, and clear presentation of the main results. These strength indeed justifies an acceptance of the paper.

However, there are outstanding weaknesses preventing a firm acceptance. The concern centers around the assumptions and the scope of Theorem 4.7. In particular, the authors assume that the clean data distribution is compactly supported and the consistency function ff is Lipschitz continuous in YY at t=0t = 0. These two assumptions on their own do not directly contradict. Nonetheless, later examples (example 4.5 and the simulation in Appendix F) violate the compact data assumption to argue the Lipschitz continuity. This leaves the study much weaker, even though the theoretical results remain valid. Additional simulation results do not help to reinforce the theoretical part.

In order to resolve the issues, my suggestion is two-fold: 1) relax Assumption 4.1 to light-tailed distributions. In that case, we can use a truncation argument to restrict our attention to a bounded domain. Then the discussion on Gaussian and Gaussian mixture models are supported under the assumption, or 2) remove the discussion on Gaussian and Gaussian mixture models, but try to show the clean data regularity leads to Assumption 4.5 (at least, provide a reasonable example, say the clean data distribution is Holder continuous on the compact domain). That being said, either of the fix requires an another round of review. Therefore, I am recommending a weak acceptance.