PaperHub
5.5
/10
Poster4 位审稿人
最低4最高7标准差1.1
5
7
6
4
3.3
置信度
正确性3.0
贡献度2.8
表达3.0
NeurIPS 2024

Adapting to Unknown Low-Dimensional Structures in Score-Based Diffusion Models

OpenReviewPDF
提交: 2024-05-13更新: 2025-01-13

摘要

This paper investigates score-based diffusion models when the underlying target distribution is concentrated on or near low-dimensional manifolds within the higher-dimensional space in which they formally reside, a common characteristic of natural image distributions. Despite previous efforts to understand the data generation process of diffusion models, existing theoretical support remains highly suboptimal in the presence of low-dimensional structure, which we strengthen in this paper. For the popular Denoising Diffusion Probabilistic Model (DDPM), we find that the dependency of the error incurred within each denoising step on the ambient dimension $d$ is in general unavoidable. We further identify a unique design of coefficients that yields a converges rate at the order of $O(k^{2}/\sqrt{T})$ (up to log factors), where $k$ is the intrinsic dimension of the target distribution and $T$ is the number of steps. This represents the first theoretical demonstration that the DDPM sampler can adapt to unknown low-dimensional structures in the target distribution, highlighting the critical importance of coefficient design. All of this is achieved by a novel set of analysis tools that characterize the algorithmic dynamics in a more deterministic manner.
关键词
diffusion modelsscore-based generative modelslow-dimensional structurecoefficient design

评审与讨论

审稿意见
5

The paper considers the problem of convergence of probability distribution learned by diffusion models to target data distribution when the data distribution lies has some low-dimensional structure. Previous work either provides a convergence rate bound in Wasserstein distance for low-dimensional distribution or provides a bound in total variation distance with polynomial dependence on the ambient dimension. This paper improves the convergence rate to logarithmically depend on the ambient dimension and polynomially on the intrinsic dimension.

优点

  • It is interesting that the authors do not only consider the case of data on a k-dimensional manifold but consider a general version of a low-dimensional structure provided by the covering number of the support.
  • The paper is overall well-written and easy to follow.

缺点

  • The result in the paper is proved under the assumption of the bounded support of the data distribution (i.e. xR|| x || \leq R where R=poly(T)R = \textrm{poly}(T)). The results in the previous works are under the assumption that the data has a bounded second moment therefore, the bounded support assumption is more restrictive than the assumptions considered in the previous work.
  • The argument about the uniqueness of coefficients that leads to dimension-independent discretization error in Section 3.2 is not convincing because it proves a lower bound on the error incurred in one denoising step which is an upper bound on the total variation distance (the authors also mention it).

问题

  • I understand that showing the lower bound on the total variation error in section 3.2 might be difficult but can you experimentally show that the claimed coefficient in eq.(2.4) leads to a better convergence compared to other coefficients (e.g., coefficients used in practice) on some synthetic low-dimensional distribution?
  • The final bound on T in theorem 1 seems to depend on polylogarithmically on the dimension of the distribution (instead of polynomially in previous works). Is the polylogarithmical dependence of dimension necessary?

Minor comment: it might be worth including a discussion in the paper on why earlier approaches get the dimension-dependent error.

局限性

Yes, the authors discuss the limitations.

作者回复

Thank you for your review. We address your comments below.

Bounded assumption on the target distribution. Thank you for raising this point!

  • We agree that the bounded support assumption is stronger than, for example, the bounded second moment assumption, and it excludes Gaussian distributions. However, we argue that this condition is satisfied in most practical scenarios. Arguably, the most important applications of diffusion models are to generate new images from a target distribution, where image data is naturally bounded. For instance, the CIFAR and ImageNet datasets consist of images with pixel values ranging from 0 to 255 or [0,1][0, 1], and it is common to normalize them to the range [1,1][-1, 1]. Then the 2\ell_2 norm of an image from, e.g., the CIFAR dataset, is typically below 6060. Moreover, for DDPM in image generative tasks, TT is typically around 100100. Hence, we believe it is reasonable to assume in theoretical analysis that the radius of the data distribution is bounded by TcRT^{c_R}, where cRc_R is any given fixed constant that can be very large.
  • We acknowledge that it is unfortunate not to cover common distributions like the Gaussian. Although we believe the bounded support assumption is not essential, it greatly facilitates our analysis, especially since we use a novel set of analytical tools that characterize the dynamics of DDPM in a more deterministic manner to establish convergence guarantees in the presence of low-dimensional structure. While we can relax this assumption in some specific cases (e.g., for Gaussian distribution, as illustrated in the next paragraph), we find it challenging to achieve a clean and concise result that holds generally for any distribution with a finite second moment. We leave this to future investigation.

We will incorporate these discussions in our next revision.

Regarding the gap between our lower bound and the TV error. Thank you for your suggestion! We conduct an experiment to examine whether our coefficient design (3.2) is indeed the unique coefficient design that leads to dimension-independent TV error TV(q1,p1)\mathsf{TV}(q_1,p_1), described as follows.

Using the degenerated Gaussian distribution considered in Theorem 2 as a tractable example, we ran the DDPM sampler with exact score functions (so that the error only comes from discretization) and plotted the error curves of TV(q1,p1)\mathsf{TV}(q_1,p_1) versus the ambient dimension dd under several different choices of TT. We implemented both our coefficient design (2.4) and another widely-used design ηt=σt=1αt\eta_t = \sigma_t = 1-\alpha_t. The results demonstrate that the TV error TV(q1,p1)\mathsf{TV}(q_1,p_1) is independent of the ambient dimension dd under our coefficient design (2.4), while it grows with dd when using the other design. Please refer to the figures in the attached PDF in the response to all reviewers. This supports our key message that (3.2) represents a unique coefficient design for DDPM in achieving dimension-independent error.

Dependency on logd\log d. Thank you for raising this point! The polylogarithmic dependency w.r.t.~the ambient dimension dd arises from the analysis tools that allows us to characterize the algorithmic dynamics in a more deterministic manner. This analysis framework is crucial for tackling the convergence of DDPM in the presence of low-dimensional structure.

In the proof, we identified a "typical" high-probability set where we are able to characterize the evolution of conditional density precisely (see our discussion at the beginning of Section 4). However, the algorithmic dynamic outside this set is difficult to track, and we have to bound some quantities related to the dynamics outside the typical set at the order of dd. As a result, we have to set the radius of the "typical" high-probability set to be large enough (larger than some logarithmic factor of dd), such that the exceptional probability is small (e.g., smaller than O(1/poly(d))O(1/\mathsf{poly}(d)) in order to offset the impact of these terms. We think it might be possible to improve the polylogarithmic dependency on dd via more refined analysis, and we leave it to future investigation.

Discussion on coefficient design. Thank you for the suggestion! We have extended Theorem 2 to a general lower bound that works for arbitrary low-dimensional data distributions. Please see our response to all reviewers for this general lower bound, as well as an outline of the proof. We believe that this general result can help us understand the role of our coefficient design (2.4), as well as why earlier designs get the dimension-dependent error. To facilitate discussion, we copy the lower bound here: \\mathbb{E}_{x\_{t}\sim q\_{t}}[\\mathsf{KL}(p\_{X\_{t-1}|X\_{t}}(\\cdot| x\_{t})\\parallel p\_{Y\_{t-1}|Y\_{t}}(\\cdot| x\_{t}))] \\geq \\bigg(\\frac{\\sigma\_{t}^{\\star2}}{\\sigma\_t^2} + 2\\log\\frac{\\sigma\_t}{\\sigma\_t^\\star}- 1\\bigg)\\frac{d}{2} + \\frac{c\_0(\\eta\_{t}-\\eta\_{t}^{\\star})^2d}{2\\sigma\_{t}^{2}(1-\\overline{\\alpha}\_t)} - C\_{5}^2\\frac{k^{4}\\log^{6}T}{T^2}\\bigg(3+\\frac{\\sigma\_{t}^{\\star2}}{\\sigma_t^2}\\bigg) - C\_{5}\\frac{k^{2}\\log^{3}T}{T}\\bigg|\\frac{\\sigma\_{t}^{\\star2}}{\\sigma\_t^2} - 1\\bigg|\\sqrt{d} - \\exp(-c\_1k\\log T) This bound is achieved by a novel set of analysis tools that characterize the algorithmic dynamics in a more deterministic manner, which together with our analysis in Theorem 1 provide a sharp characterization of the impact of coefficient design (i.e., ηt\eta_t and σt\sigma_t) in determining the error induced in each denoising step. It can be seen that, unless we take ηt=ηt\eta_t=\eta_t^\star and σt=σt\sigma_t=\sigma_t^\star as suggested in (2.4), there will be inevitable error that scales at least linear in dd incurred in each denoising step.

评论

I thank the authors for providing the detailed rebuttal and also providing the additional experiment for the uniqueness of the coefficient.

  • I understand some of the explanations provided by the authors but as authors mentioned, each pixel is normalized between 0 to 1 or -1 to 1. If we think of each pixel as a coordinate in the input, then the norm of the input is typically O(d)O(\sqrt{d}) but the result is proven under the assumption that norm of the input is poly(T)=poly(k,log(d))poly(T) = poly(k, log(d)).
  • The experiments on the synthetic data are interesting. Can you provide more details on the baseline choices for ηt\eta_t and σt\sigma_t? The original DDPM paper [1] uses different αt\alpha_t and σt2=αtβt\sigma_t^2 = \sqrt{\alpha}_t \beta_t (note that the definition of σt\sigma_t differs in this work and the original work). The follow-up work [2] also seems to have different choices than ηt\eta_t and σt\sigma_t than the one used in the above experiment so can you provide some of the references that use the choices (ηt=σt=1αt\eta_t = \sigma_t = 1-\alpha_t)?

[1] Denoising Diffusion Probabilistic Models

[2] Improved Denoising Diffusion Probabilistic Models

评论

Thank you for your reply!

On the dd-dependence. Thanks for the insightful comment. We agree that the norm of the input image X0X_0 is typically d\sqrt{d}. However this won't affect our main message, and the reason is as follows:

  • Our paper requires that, for some fixed (arbitrary) constant cRc_R, the norm of X0X_0 is bounded by TcRT^{c_R}. This means that we need Td1/(2cR)T \geq d^{1/(2c_R)}. Since cRc_R can be chosen arbitrarily, this is weaker than any polynomial dependence on dd.
  • In fact, we can see that the dd-dependence in this requirement can be removed at the price of an additional logd\log d factor in the final error bound. Although we treat cRc_R as a universal constant in the current paper and hide its dependence in Theorem 1, the final upper bound in Theorem 1 actually depends polynomially on cRc_R as follows: mathsfTVleft(q_1,p_1right)lesssimc_R2fracleft(k+logdright)2log3TsqrtT+varepsilon_mathsfscorelogT.\\mathsf{TV}\\left(q\_{1},p\_{1}\\right)\\lesssim c\_R^2\\frac{\\left(k+\\log d\\right)^{2}\\log^{3}T}{\\sqrt{T}}+\\varepsilon\_{\\mathsf{score}}\\log T. We will make this dependence explicit in the next revision. If we take cRlogdc_R \asymp \log d, then our error bound becomes larger by a factor of log2d\log^2 d, which is still nearly as good as the current one. However, notice that d1/Ω(logd)=O(1)d^{1/\Omega(\log d)} = O(1), the requirement Td1/(2cR)=O(1)T \geq d^{1/(2c_R)} = O(1) is independent of dd.

On the choices of ηt\eta_t and σt\sigma_t. Thanks for raising this point. We use the same ηt\eta_t as [1] and [2], and we agree that the choices of σt\sigma_t in [1] and [2] are different with our choices in the experiment. Here, [1] chose σt2=αtβt\sigma_t^2 = \alpha_t\beta_t, while our choice is σt2=βt\sigma_t^2 = \beta_t. We use this type of σt2\sigma_t^2 as our baseline, since it is commonly adopted in the theoretical literature (e.g., [3]) and has been shown to achieve poly(d)/T\mathsf{poly}(d)/T convergence. In addition, we have also implemented the choice σt2=αtβt\sigma_t^2 = \alpha_t\beta_t in [1] in the numerical experiments, and the performance is very similar to using σt2=βt\sigma_t^2 = \beta_t. We will include this result in our next revision. Our lower bound based on the degenerated Gaussian distribution also show that this choice σt2=αtβt\sigma_t^2 = \alpha_t\beta_t will incur some error that is linear in dd.

Note that σt2\sigma_t^2 in [2] is learned from the dataset in the hope of achieving optimal performance, which takes a different approach from our current work, where we intend to identify the coefficients based on the learning rates of the forward process (i.e., αt\alpha_t's) in a non-data-driven way. will discuss this point more clearly in the next revision.

[3] Towards Non-Asymptotic Convergence for Diffusion-Based Generative Models, G. Li, Y. Wei, Y. Chen and Y. Chi, ICLR 2024.

评论

Thanks again for your efforts in reviewing our paper and for your helpful comments! We have carefully considered your questions and addressed them in our response. The discussion phase is due to conclude in less than 20 hours, and we would like to know whether our response has appropriately addressed your questions and concerns about our paper. If we have addressed your concerns, we would appreciate it if you consider increasing your score for our paper. Please let us know if you have further comments or concerns about our paper. Thank you!

审稿意见
7

In DDPM-style diffusion models, we need to discretize a continuous process. Benton et al [3] showed that ~ d/eps^2 steps suffice over d dimensions. But what if (as is typical) the data lie in a space of lower intrinsic dimension (e.g. k-sparse or a k-dimensional manifold)? This paper shows how to replace the d in the iteration complexity by k^4.

优点

This paper shows that DDPM can be nearly dimension independent, up to log factors only depending on the intrinsic dimension. This holds for a very general definition of intrinsic dimension---just the log covering number. I was pretty surprised to learn that such a result is possible.

Doing so requires particular choices of step size, which they show is necessary, at least for the typical Girsanov-style proof approach.

缺点

Obviously it would be nice to see k rather than k^4. We don't typically expect intrinsic dimension << d^{1/4}, so this isn't giving a real quantitative improvement.

There's very little intuition for what's going on. Why are these the right step sizes?

问题

How does the schedule (2.4) compare to the suggestion from [3]?

What happens if X isn't exactly low dimensional, but is close? Say, there is eps TV mass outside the cover.

局限性

None.

作者回复

Thank you for your review. We address your comments below.

Intuition for our coefficient design. Thanks for raising this point. We have extended Theorem 2 to a general lower bound that works for arbitrary low-dimensional data distributions. Please see our response to all reviewers for this general lower bound, as well as an outline of the proof. We believe that this general result can help us understand the role of our coefficient design (2.4). To facilitate discussion, we copy the lower bound here: \\mathbb{E}_{x\_{t}\sim q\_{t}}[\\mathsf{KL}(p\_{X\_{t-1}|X\_{t}}(\\cdot| x\_{t})\\parallel p\_{Y\_{t-1}|Y\_{t}}(\\cdot| x\_{t}))] \\geq \\bigg(\\frac{\\sigma\_{t}^{\\star2}}{\\sigma\_t^2} + 2\\log\\frac{\\sigma\_t}{\\sigma\_t^\\star}- 1\\bigg)\\frac{d}{2} + \\frac{c\_0(\\eta\_{t}-\\eta\_{t}^{\\star})^2d}{2\\sigma\_{t}^{2}(1-\\overline{\\alpha}\_t)} - C\_{5}^2\\frac{k^{4}\\log^{6}T}{T^2}\\bigg(3+\\frac{\\sigma\_{t}^{\\star2}}{\\sigma_t^2}\\bigg) - C\_{5}\\frac{k^{2}\\log^{3}T}{T}\\bigg|\\frac{\\sigma\_{t}^{\\star2}}{\\sigma\_t^2} - 1\\bigg|\\sqrt{d} - \\exp(-c\_1k\\log T) This bound is achieved by a novel set of analysis tools that characterize the algorithmic dynamics in a more deterministic manner, which together with our analysis in Theorem 1 provide a sharp characterization of the impact of coefficient design (i.e., ηt\eta_t and σt\sigma_t) in determining the error induced in each denoising step. It can be seen that, unless we take ηt=ηt\eta_t=\eta_t^\star and σt=σt\sigma_t=\sigma_t^\star as suggested in (2.4), there will be inevitable error that scales at least linear in dd incurred in each denoising step.

Comparison with other coefficient designs. Thank you for raising this point. In the prior work [3], the marginal distribution of the forward process is XtketkX0+1e2tkWX_{t_k} \sim e^{-t_k}X_0 + \sqrt{1-e^{-2t_k}}\,\overline{W} where WN(0,Id)\overline{W}\sim\mathcal{N}(0,I_d). This means that e2tke^{-2t_k} plays the role of overlinealpha_t\\overline{\\alpha}\_t in our paper. By examining their discretization rule (see Eq.~(4) therein), we find that they use the following update rule:

Yt1=(1+Δt)Yt+2Δtst(Yt)+ΔtZt,Y_{t-1} = (1+\Delta_t)Y_t + 2\Delta_ts_t(Y_t) + \sqrt{\Delta_t}Z_t,

where Δt=12log1αt\Delta_t = \frac{1}{2}\log\frac{1}{\alpha_t}. By applying a similar calculation as in Theorem 2, we can show that:

mathbbE_x_tsimq_t[KL(pXt1Xt(xt)pYt1Yt(xt))]dT2.\\mathbb{E}\_{x\_{t}\\sim q\_{t}}\left[\mathsf{KL}\left(p_{X_{t-1}|X_{t}}\left(\cdot| x_{t}\right)\,\Vert\,p_{Y_{t-1}|Y_{t}}\left(\cdot| x_{t}\right)\right)\right] \gtrsim \frac{d}{T^2}.

Hence, the coefficient design in [3] will also incur dimension-dependent error in each denoising step. We will incorporate these discussions in our next revision.

Approximately low-dimensional structure. Thank you for raising this point. We first would like to clarify that we are not assuming that the data distribution is exactly low dimensional. Our characterization of low-dimensionality is based on the covering number of the support, as defined in Section 2, which accommodates approximately low-dimensional structure. For example, our setup includes the case that supp(pdata)iNεB(ci,ε)\mathsf{supp}(p_\mathsf{data}) \in \cup_{i \le N_{\varepsilon}} \mathbb{B}(c_i, \varepsilon), the balls centered at cic_i with ε\varepsilon radius, whose dimension is dd instead of kk. Therefore even if the support of the data distribution is not exactly low-dimenaional, our results can still be applied. (We apologies if we misunderstood your question.)

It is an interesting question whether our result is stable against adding an ε\varepsilon-mass outside the cover. This setting is beyond our current result, since this mass can be spread out in the full space (thus having large covering number that depends on dd). We conjecture that our result will be stable to this perturbation, namely it is possible to show an error bound similar to Theorem 1, but has an additional term proportional to ε\varepsilon (or εd\varepsilon d). Currently, we don't know how to prove this result, and we leave this for future investigation.

Improving the quartic dependency on kk. Thank you for raising this point. Currently, we don't know how to improve this dependency, which we believe requires new analysis frameworks and tools. We leave this for future investigation.

评论

Thank you for your response.

审稿意见
6

This paper investigates score-based diffusion models when the underlying distribution is near low-dimensional manifolds in a higher-dimensional space. It addresses the gap in theoretical understanding of diffusion models, which are suboptimal in the presence of low-dimensional structures. For the DDPM, the error dependency on the ambient dimension dd is generally unavoidable during each denoising step in the existing literature. However, the authors identify a unique set of coefficients that yields a convergence rate of O(k2/T)O(k^2/\sqrt{T}), where kk is the intrinsic dimension and TT is the number of steps. The analysis employs novel tools that characterize the algorithmic dynamics in a deterministic manner. Additionally, the paper establishes that the DDPM sampler's error, influenced by time discretization and score estimation, is nearly dimension-free, with the ambient dimension dd appearing only in logarithmic terms.

优点

1. Theoretical Advancement in Understanding Diffusion Models. The paper makes a good contribution to the theoretical understanding of score-based diffusion models in the context of low-dimensional manifolds. By identifying a unique set of coefficients that yield a convergence rate dependent on the intrinsic dimension kk rather than the ambient dimension dd, the authors provide a refined theoretical framework that addresses previously suboptimal theoretical support.

2. New Analytical Tools and Methodology. The paper introduces a new set of analytical tools to characterize the algorithmic dynamics of the DDPM sampler in a deterministic manner. This innovative approach allows for a more precise analysis of the error sources—time discretization and score estimation errors.

3. In general, the paper is well-written and joyful to read.

缺点

1. Strict Assumption on the Data Distribution In Line 126, the authors assume that the support set of the data distribution is bounded. This assumption is overly restrictive and may not hold for many real-world data distributions, such as Gaussian distributions, which have unbounded support. The authors should consider relaxing this assumption or providing justification for its necessity, as well as discussing the implications of this restriction on the generalizability and robustness of their findings.

2. Lack of Empirical Demonstration. Although this paper presents a solid and sharp theoretical analysis of the convergence rate of DDPM, the authors don't provide any experimental results to support their theory. Without experimental evidence, it is difficult to assess the real-world performance and robustness of the proposed coefficient design and convergence rate improvements.

问题

Q1. Is it widely used in the literature to employ ϵ\epsilon-net and cover number to measure the intrinsic dimension of a data distribution? The authors should provide a more thorough discussion on this point.

Q2. In Line 126, the authors assume that the support of the data distribution is bounded. However, the support of the Gaussian distribution in Theorem 2 is not bounded. It is believed that this is not a good example to demonstrate the uniqueness of coefficient design.

局限性

Not applicable

作者回复

Thank you for your review. We address your comments below.

Bounded assumption on the target distribution.

  • We agree that the bounded support assumption is stronger than, for example, the bounded second moment assumption, and it excludes Gaussian distributions. However, we argue that this condition is satisfied in most practical scenarios. Arguably, the most important applications of diffusion models are to generate new images from a target distribution, where image data is naturally bounded. For instance, the CIFAR and ImageNet datasets consist of images with pixel values ranging from 0 to 255 or [0,1][0, 1], and it is common to normalize them to the range [1,1][-1, 1]. Then the 2\ell_2 norm of an image from, e.g., the CIFAR dataset, is typically below 6060. Moreover, for DDPM in image generative tasks, TT is typically around 100100. Hence, we believe it is reasonable to assume in theoretical analysis that the radius of the data distribution is bounded by TcRT^{c_R}, where cRc_R is any given fixed constant that can be very large.
  • We acknowledge that it is unfortunate not to cover common distributions like the Gaussian. Although we believe the bounded support assumption is not essential, it greatly facilitates our analysis, especially since we use a novel set of analytical tools that characterize the dynamics of DDPM in a more deterministic manner to establish convergence guarantees in the presence of low-dimensional structure. While we can relax this assumption in some specific cases (e.g., for Gaussian distribution, as illustrated in the next paragraph), we find it challenging to achieve a clean and concise result that holds generally for any distribution with a finite second moment. We leave this to future investigation.

We will incorporate these discussions in our next revision.

Truncating Gaussian distribution in the lower bound. We agree that the degenerated Gaussian distribution considered in Theorem 2 is not a good example because it is unbounded. As we mentioned in footnote 2 on page 5, this lower bound can be extended to the truncated Gauss distribution. We have changed the result to truncated Gaussian in the following.

Before presenting the details, we also want to highlight that we have extended Theorem 2 to a general lower bound that works for arbitrary low-dimensional data distributions, which we believe also helps addressing this comment. Please see our response to all reviewers for this general lower bound, as well as an outline of the proof. Now we continue presenting results for the truncated Gaussian.

Let X0N(0,Ik)X_0 \sim \mathcal{N}(0,I_k), and define its truncated counterpart widetildeX_0\\widetilde{X}\_0 as

p_widetildeX_0(x0)proptopX0(x0)1(x0T).p\_{\\widetilde{X}\_0} (x_0) \\propto p_{X_0} (x_0) 1(\\|x_0\\|_{\infty} \le T).

Define widetildeX_t=sqrtalpha_twidetildeX_t1+sqrt1alpha_tZ_t\\widetilde{X}\_t = \\sqrt{\\alpha\_t}\\widetilde{X}\_{t-1} + \\sqrt{1-\\alpha\_t}Z\_t, and construct the reverse process widetildeY_t\\widetilde{Y}\_t with score estimation of widetildeX_t\\widetilde{X}\_t. Then we can establish exactly the same lower bounds for widetildeX_t\\widetilde{X}\_t and widetildeY_t\\widetilde{Y}\_t. Notice that widetildeX_t\\widetilde{X}\_t and widetildeY_t\\widetilde{Y}\_t have independent entries, hence it suffices to studying each entry separately (we use widetildeX_t,i\\widetilde{X}\_{t, i} and widetildeY_t,i\\widetilde{Y}\_{t, i} to denote their ii-th entry), i.e.,

mathsfKL(p_widetildeX_t1widetildeX_t(cdotx_t)parallelp_widetildeY_t1widetildeY_t(cdotx_t))gesum_i>kmathsfKL(p_widetildeX_t1,iwidetildeX_t,i(cdotx_t,i)parallelp_widetildeY_t1,iwidetildeY_t,i(cdotx_t,i)).\\mathsf{KL}(p\_{\\widetilde{X}\_{t-1}|\\widetilde{X}\_t}(\\cdot | x\_t) \\parallel p\_{\\widetilde{Y}\_{t-1}|\\widetilde{Y}\_t}(\\cdot | x\_t)) \\ge \\sum\_{i > k} \\mathsf{KL}(p\_{\\widetilde{X}\_{t-1, i}|\\widetilde{X}\_{t, i}}(\\cdot | x\_{t, i}) \\parallel p\_{\\widetilde{Y}\_{t-1, i}|\\widetilde{Y}\_{t, i}}(\\cdot | x\_{t, i})).

The right hand side of the above inequality obeys

textRHS=sum_i>kmathsfKL(p_X_t1,iX_t,i(cdotx_t,i)parallelp_Y_t1,iY_t,i(cdotx_t,i))gefracd4left(eta_teta_tstarright)2+fracd40left(fracsigma_tstar2sigma_t21right)2,\\text{RHS}= \\sum\_{i > k} \\mathsf{KL}(p\_{X\_{t-1, i}|X\_{t, i}}(\\cdot | x\_{t, i}) \\parallel p\_{Y\_{t-1, i}|Y\_{t, i}}(\\cdot | x\_{t, i}))\\ge \\frac{d}{4}\\left(\\eta\_{t}-\\eta\_{t}^{\\star}\\right)^{2}+\\frac{d}{40}\\left(\\frac{\\sigma\_{t}^{\\star2}}{\\sigma\_{t}^{2}}-1\\right)^{2},

where the first relation holds since the truncation does not affect the entries with zero variance, and the second relation is proved in Line 550-552 in Appendix B.

Empirical demonstration. Thank you for raising this point. We have conducted a numerical experiment using the data distribution considered in Theorem 2. The results demonstrate that the TV error and KL divergence are independent of the ambient dimension dd under our coefficient design (2.4), while they grow with dd under other widely-used designs. Please refer to the figures in the attached PDF in the response to all reviewers, where we plot the error curves of KL(q1p1)\mathsf{KL}(q_1\parallel p_1) and TV(q1,p1)\mathsf{TV}(q_1,p_1) versus dd, with all other setups fixed. This supports the conjecture that (2.4) represents a unique coefficient design for achieving dimension-independent error.

While the lack of GPUs prevents us from examining this design in more complex, large-scale tasks, we believe that our theoretical findings and empirical observations are already meaningful and interesting.

Our use of ε\varepsilon-covering. Thanks for raising this point! While we believe that it's a good idea to use covering number for characterizing low-dimensional structure in our problem, we are not aware of other existing literature that does the same. We choose to use ε\varepsilon-net and covering number to define approximate low-dimensional structure because we believe that this is less stringent and more general than assuming an exact low-dimensional structure (e.g., by assuming that the support of the data distribution lives in a low-dimensional subspace). As a sanity check, we showed in Section 2 that when the support of the data distribution lives in an rr-dimensional subspace, our intrinsic dimension kk defined through covering number is of order rr, confirming that our definition is indeed more general. We will incorporate these discussions in our next revision.

评论

Thanks for the reviewers' rebuttal. I have two further comments: (i) For the bounded assumption, can this proof be applied to a uniform distribution oversphere? (ii) Can you provide some literature on using ϵ\epsilon-net to characterize intrinsic dimension?

评论

Thank you for your response!

Uniform distribution over sphere. Yes, our proof and result can be applied to a uniform distribution over sphere, since the only assumption we imposed on the data distribution pdatap_{\mathsf{data}} is boundedness. However, the intrinsic dimension kk of, e.g., the unit sphere Sd1\mathbb{S}^{d-1} in Rd\mathbb{R}^d is d1d-1, which is not a typical low-dimensional structure. Although our theory holds for general kk, the most interesting regime is kdk\ll d, where our results significantly improve the convergence rate that has polynomial dependence on dd.

Using covering number to characterize intrinsic dimension. Thank you for asking this. Here we provide some related literature and discussion on this issue.

  • In fact, our definition of the intrinsic dimension kk is actually the metric entropy of X\mathcal{X}, the support of pdatap_{\mathsf{data}}. Metric entropy is defined using covering number, and is widely used in statistics and learning theory to characterize the complexity of a set/class in a metric space, which is useful in proving sample complexity and generalization bounds for algorithms; see e.g., Sections 5 and 14 in [1] for the reference. The low-dimensionality is also a concept of complexity, therefore we believe it is very natural to use covering number, or metric entropy to characterize the intrinsic dimension.
  • Prior literature [2], which studied diffusion model on low-dimensional data, assumes that the data is supported on a low-dimensional linear subspace. More generally, another work [3] assumes that the distribution is supported on a union of low-dimensional linear subspace. As we discussed in Section 2 and in the rebuttal, our intrinsic dimension kk defined through covering number is of order kk for a kk-dimensional linear subspace, and we can also easily seen that i=1mki\sum_{i=1}^{m} k_i for the union of mm linear subspace (each with dimension kik_i). Therefore using covering number to characterize intrinsic dimension actually admits the setup in these prior literature as special examples, and is more general and robust.

The discussion phase is due to conclude in 20 hours, and we would like to know whether our response has appropriately addressed your questions and concerns about our paper. If we have addressed your concerns, we would appreciate it if you consider increasing your score for our paper. Please let us know if you have further comments or concerns about our paper. Thank you!

[1] High-Dimensional Statistics: A Non-Asymptotics Viewpoint, M. J. Wainwright, Cambridge University Press, 2019.

[2] Score Approximation, Estimation and Distribution Recovery of Diffusion Models on Low-Dimensional Data, M. Chen, K. Huang, T. Zhao, M. Wang, ICML 2023.

[3] Robust Subspace Clustering, M. Soltanolkotabi, E. Elhamifar, E. J. Candes, Annals of Statistics, 2014.

评论

Thanks for the authors' response, which addresses most of my concerns. I have increased my score.

审稿意见
4

This paper discusses the DDPM sampler's capability to adapt to unknown low-dimensional structures in the target distribution, and studies how this informs the coefficient design.

优点

The paper is written clearly and has a nice structure in general. It contributes to an important topic in diffusion models about adapting to lower dimensional structure. In the first part of the paper, the authors show that, with a particular coefficient design (2.4), the TV error of the DDPM sampler has an upper bound that depends on the intrinsic dimension. In the second part of the paper, the authors exemplify the unique choice of the coefficients.

缺点

The main weakness of the paper is the generality. The adaptivity to the lower dimensional structure is based on a particular coefficient design, however, this design is not shown to be unique for the TV error, or for general target data. It is this reviewer's opinion that it is better to write the contribution section more precisely in terms of the setup and the scope of the results. The concerns are specified in the questions section.

问题

  1. How does this paper's result compare with [1], which does not require knowing/estimating the manifold? Can the result in the paper lead to a tighter bound when a subspace is given a priori? For example, how does it compare to [2]?

[1] Rong Tang and Yun Yang. Adaptivity of diffusion models to manifold structures. In International Conference on Artificial Intelligence and Statistics, 2024. [2] Kazusato Oko, Shunta Akiyama, and Taiji Suzuki. Diffusion models are minimax optimal distribution estimators. arXiv preprint arXiv:2303.01861, 2023.

  1. Section 3.2 about unique coefficient design is very interesting, however, the results lack generality.
    • Theorem 2's result is derived in the case that the target data distribution is a standard Gaussian distribution. What can you get when considering a general data distribution?
    • As the authors noted at the end of the section, the uniqueness shown is with respect to the upper bound of the TV error. How does one make use of the uniqueness with respect to the upper bound in practice? What are the possible ways to address the coefficient design for the actual TV error, or tighten the gap?

局限性

The authors have disccussed the limitations.

作者回复

Thank you for your review. We address your comments below.

Comparison with prior works. Thank you for the reference. These two works and the current paper approach the diffusion model from different perspectives, making direct comparison challenging. Specifically, the two prior works establish error bounds for estimating density under the assumption that the true target density meets certain smoothness conditions. Their results rely on a score-matching procedure designed based on the level of smoothness, situating their findings within the context of nonparametric statistics.

In contrast, our paper approaches the problem from an applied mathematics perspective, decoupling the error of DDPM into discretization error and score matching error, and characterizing them separately, akin to prior works [3, 4, 6, 13]. Our results assume minimal conditions on the target distribution, avoiding smoothness assumptions, and accommodating arbitrary score-matching procedures. Therefore, our results cannot be directly compared with these prior works.

Our paper does not require knowledge or estimation of the manifold, and our error bound is expressed as the sum of discretization error and score matching error:

mathsfTV(q1,p1)underbraceC(k+logd)2log3TT_discretization error+underbraceCεscorelogT_score matching error. \\mathsf{TV}(q_{1},p_{1})\leq \\underbrace{C\frac{\left(k+\log d\right)^{2}\log^{3}T}{\sqrt{T}}}\_{ \text{discretization error}} +\\underbrace{C\varepsilon_{\mathsf{score}}\log T}\_{\text{score matching error}}.

When a subspace is given a priori, we believe it will not help in improving the discretization error bound, as it already adapts to the low-dimensional structure automatically. While there might still be room for improvement (e.g., reducing the polynomial dependency on kk), we believe this requires new analytical tools rather than assuming access to the low-dimensional structure. However, knowing the subspace can aid in improving the score-matching error, as it is possible to exploit the subspace information to design an efficient score-matching procedure that achieves a smaller εscore\varepsilon_{\mathsf{score}}. This, however, is not the main focus of our paper, as our result accommodates any score-matching procedure, which we believe is more general. We leave this for future investigation.

We will incorporate and discuss these references in our next revision.

Extending the lower bound to arbitrary data distribution. Thank you for raising this point. As far as we know, it is quite general to use the worst-case error (or risk) to characterize the inherent difficulty of a problem in areas like statistics (e.g., minimax lower bound) and optimization (e.g., algorithmic lower bound). We believe that establishing a lower bound for the (degenerated) standard Gaussian distribution effectively illustrates our point: if the algorithm cannot perform well without using the proposed coefficient design in probably the simplest case, we can hardly expect it to work well in more complicated examples. However, the good news is that for this problem, we can actually show a similar lower bound for arbitrary low-dimensional data distributions. Please see our response to all reviewers for this general lower bound, as well as an outline of the proof. We will include this result in our next revision.

Regarding the gap between our lower bound and the TV error. Thank you for raising this point. First, we would like to point out that we actually have an upper bound for the KL divergence between q1q_1 and p1p_1. In the analysis, we control TV(q1,p1)\mathsf{TV}(q_1, p_1) by bounding KL(q1p1)\mathsf{KL}(q_1\parallel p_1) (see Eq.~(3.2)). Then the established KL lower bound in Theorem 2 is meaningful. We will added the KL divergence bound in our main result.

In addition, the lower bound for KL divergence can also be extended to TV distance. According to the calculation in Appendix B (see Line 545 and 547), we know pXt1Xtp_{X_{t-1}|X_t} and pYt1Ytp_{Y_{t-1}|Y_t} are two gaussian distributions. Then with basic calculations, it can also be shown that

mathsfTV(pXt1Xt,pYt1Yt)gtrsimminbigg1,dleft(ηtηtright)2+dleft(σt2σt21right)2bigg.\\mathsf{TV}(p_{X_{t-1}|X_{t}}, p_{Y_{t-1}|Y_{t}}) \\gtrsim \\min\\bigg\\{1, d\\left(\eta_{t}-\eta_{t}^{\star}\\right)^{2}+d\\left(\frac{\sigma_{t}^{\star2}}{\sigma_{t}^{2}}-1\\right)^{2}\\bigg\\}.

We will also include this result in the next revision.

Finally, due to the lack of GPUs, we conducted a numerical experiment using the data distribution considered in Theorem 2. The results demonstrate that the TV error and KL divergence are independent of the ambient dimension dd under our coefficient design (2.4), while they grow with dd under other widely-used designs. Please refer to the figures in the attached PDF in the response to all reviewers, where we plot the error curves of KL(q1p1)\mathsf{KL}(q_1\parallel p_1) and TV(q1,p1)\mathsf{TV}(q_1,p_1) versus dd, with all other setups fixed. This supports the conjecture that (2.4) represents a unique coefficient design for achieving dimension-independent error.

评论

I thank the authors for the extensive rebuttal addressing the comments, and for the additional proof and numerical experiments. This comment again asks about the requirement on the target data, and the generality of the results. The generalization to a more general target data is not trivial as the probability measure of the target data is not absolute continuous with respect to the measure of the prior distribution. So, depending on the target data, the error in the denoising steps could explode. The choice of the time schedule would also be relevant here.

评论

Thank you for the reply! We are not sure whether we understand the "prior distribution" in your comment correctly. Please let us know if we misunderstood anything.

  • First, we would like to clarify that the only assumption we imposed on the target distribution pdatap_{\mathsf{data}} is that it has bounded support. We do not need any further assumption, e.g., absolute continuousness, in order to establish our results. In what follows, we will explain why we don't need to worry about the explosion or error, both in the upper bound (Theorem 1) and lower bound (Theorem 2).
  • Regarding the upper bound in Theorem 1, our error metric is the TV distance between the distribution of X1X_1 and Y1Y_1 (i.e., q1q_1 and p1p_1), which are both absolutely continuous w.r.t.~the Lebesgue measure (i.e., they both have densities). This circumvents any potential issue when the data distribution pdatap_{\mathsf{data}} of X0X_0 is not continuous. Since X0X_0 and X1X_1 are exceedingly close (since β1=Tc0\beta_1=T^{-c_0} is vanishingly small), this error metric also reflects the closeness of the generator distribution and the target distribution.
  • The generalized lower bound stated in the rebuttal is established for the error incurred in each denoising step, which is defined as the expected KL divergence between two conditional distributions pXt1Xtp_{X_{t-1} | X_t} and pYt1Ytp_{Y_{t-1} | Y_t}, for 2tT2\leq t \leq T (again, it does not concern t=1t=1 to circumvent the case when the data distribution is not absolutely continuous). These two distributions are both absolutely continuous w.r.t.~the Lebesgue measure (i.e., they both have densities), regardless of whether the target distribution pdatap_{\mathsf{data}} of X0X_0 is absolutely continuous or not.
  • To further support our claim above, we would like to mention that prior works [3,4,6,13] showed that even without using our coefficient design (2.4), other reasonable coefficient designs also lead to convergence rates poly(d)/T2\mathsf{poly}(d)/T^2 for the error incurred in each of the denoising steps (and they sum up to an overall error poly(d)/T\mathsf{poly}(d)/T). This also suggests that the error in the denoising steps will not explode even if pdatap_{\mathsf{data}} is not continuous -- they just suffer from dependence on the ambient dimension dd.
  • While our upper bound is established under the specific time schedule (i.e., βt\beta_t's) in Section 2, the lower bound, including the one for degenerated Gaussian as well as the generalized one, holds for a reasonably large class of time schedules (only with the exception of some corner cases). We will specify this in our next revision.

We are happy to discuss more details with you in case you have concern on the generalized lower bound. Nevertheless, as we discussed in the rebuttal, we believe that our original lower bound established for a simple example (degenerated Gaussian) already effectively illustrates our point. We present the generalized lower bound in the rebuttal just because we think it is a beautiful, concise, yet powerful result that can make the paper better.

评论

Thanks again for your efforts in reviewing our paper and for your helpful comments! We have carefully considered your questions and addressed them in our response. The discussion phase is due to conclude in less than 20 hours, and we would like to know whether our response has appropriately addressed your questions and concerns about our paper. If we have addressed your concerns, we would appreciate it if you consider increasing your score for our paper. Please let us know if you have further comments or concerns about our paper. Thank you!

作者回复

We thank all reviewers for their feedback. Here we address some common comments and questions.

Extending the lower bound to arbitrary data distribution. We agree with several reviewer's comments that the lower bound in Theorem 2 only covers Gaussian distribution, which can be restrictive. Here we generalize Theorem 2 to establish a similar lower bound for general low-dimensional distribution:

Theorem. Consider arbitrary data distribution p_datap\_{\mathsf{data}} satisfying the assumptions in Section 2. For the DDPM sampler (2.3) with perfect score estimation and arbitrary coefficients ηt\eta_t and σt\sigma_t, we have \\mathbb{E}_{x\_{t}\sim q\_{t}}[\\mathsf{KL}(p\_{X\_{t-1}|X\_{t}}(\\cdot| x\_{t})\\parallel p\_{Y\_{t-1}|Y\_{t}}(\\cdot| x\_{t}))] \\geq \\bigg(\\frac{\\sigma\_{t}^{\\star2}}{\\sigma\_t^2} + 2\\log\\frac{\\sigma\_t}{\\sigma\_t^\\star}- 1\\bigg)\\frac{d}{2} + \\frac{c\_0(\\eta\_{t}-\\eta\_{t}^{\\star})^2d}{2\\sigma\_{t}^{2}(1-\\overline{\\alpha}\_t)} - C\_{5}^2\\frac{k^{4}\\log^{6}T}{T^2}\\bigg(3+\\frac{\\sigma\_{t}^{\\star2}}{\\sigma_t^2}\\bigg) - C\_{5}\\frac{k^{2}\\log^{3}T}{T}\\bigg|\\frac{\\sigma\_{t}^{\\star2}}{\\sigma\_t^2} - 1\\bigg|\\sqrt{d} - \\exp(-c\_1k\\log T) for each 2tT2\leq t\leq T, where c0,c1,C5>0c_0,c_1,C_5>0 are some universal constants.

Notice the fact that x22logx10x^2 - 2\log x -1 \geq 0 for any x>0x>0, and the equality holds only if x=1x=1. Therefore the above results suggests that, unless ηt=ηt\eta_t=\eta_t^\star and σt=σt\sigma_t=\sigma_t^\star, the corresponding denoising step will incur an error that is linear in dd, when dd is sufficiently large. We will include this result in our next revision.

Proof sketch. Following the equation around Line 245 in Section 4.4, we start with the following bound:

\\int\_{x\_{t-1}, x\_{t}} p\_{X\_{t-1}|X\_{t}}\\left(x\_{t-1} | x\_{t}\\right)\\log\\left(\\frac{p\_{Y\_{t-1}^{\\star}|Y\_{t}}\\left(x\_{t-1} | x\_{t}\\right)}{p\_{Y\_{t-1}|Y\_{t}}\\left(x\_{t-1} | x\_{t}\\right)}\\right)p\_{X\_{t}}\\left(x\_{t}\\right)\\mathrm{d}x\_{t-1}\\mathrm{d}x\_{t} =:\\mathcal{I}\_{t},$$ where we recall the definition of $Y\_{t-1}^{\\star}$ in (4.1). It boils down to understanding $\\mathcal{I}\_t$: $$\\mathcal{I}\_{t} = d\\log \\frac{\\sigma\_{t}}{\\sigma\_{t}^{\\star}} + \\int\_{x\_{t}, x\_{t-1}} p\_{X\_{t}}\\left(x\_{t}\\right)p\_{X\_{t-1}|X\_{t}}\\left(x\_{t-1} | x\_{t}\\right) \\bigg(\\frac{\\Vert\\sqrt{\\alpha\_{t}}z\_t-(\\eta\_{t}-\\eta\_{t}^{\\star})s\_{t}^{\\star}\\left(x\_{t}\\right)\\Vert\_{2}^{2}}{2\\sigma\_{t}^{2}} - \\frac{\\Vert\\sqrt{\\alpha\_{t}}z\_t\\Vert\_{2}^{2}}{2\\sigma\_{t}^{\\star2}}\\bigg)\\mathrm{d}x\_{t-1}\\mathrm{d}x\_{t},$$ where $\\sqrt{\\alpha\_{t}}z\_t := \\sqrt{\\alpha\_{t}}x\_{t-1}-x\_{t}-\\eta\_{t}^{\\star}s\_{t}^{\\star}(x\_{t}).$ Using similar analysis as in Lemma 6, the above integral on $\\mathcal{A}\_{t}^{\\mathrm{c}}$ (outside the typical set) can be controlled to the order of $\\exp(-\\Omega(k\\log T))$. According to Lemma 3, for $(x\_t,x\_{t-1})\\in \\mathcal{A}\_{t}$ we have $$\\left|\\frac{p\_{X\_{t-1}|X\_{t}}\\left(x\_{t-1} | x\_{t}\\right)}{p\_{Y\_{t-1}^{\\star}|Y\_{t}}\\left(x\_{t-1} | x\_{t}\\right)}-1\\right|\\leq C\_{5}\\frac{k^{2}\\log^{3}T}{T}.$$ Therefore, $$\\mathcal{I}\_{t} \\ge d\\log \\frac{\\sigma\_{t}}{\\sigma\_{t}^{\\star}} + \\int\_{x\_t}\\bigg(\\frac{\\Vert(\\eta\_{t}-\\eta\_{t}^{\\star})s\_{t}^{\\star}\\left(x\_{t}\\right)\\Vert\_{2}^{2}}{2\\sigma\_{t}^{2}} - C\_{5}\\frac{k^{2}\\log^{3}T}{T}\\frac{\\sigma\_{t}^{\\star}\\Vert(\\eta\_{t}-\\eta\_{t}^{\\star})s\_{t}^{\\star}(x\_{t})\\Vert\_{2}}{\\sigma\_{t}^2}\\bigg)p\_{X\_{t}}(x\_{t})\\mathrm{d}x\_{t} + \\bigg(\\frac{\\sigma\_{t}^{\\star2}}{\\sigma\_t^2} - 1\\bigg)\\frac{d}{2} - C\_{5}\\frac{k^{2}\\log^{3}T}{T}\\bigg|\\frac{\\sigma\_{t}^{\\star2}}{\\sigma\_t^2} - 1\\bigg|\\sqrt{d} - \\exp(-c\_1k\\log T).$$ Here, we make use of the observation that for $0 < \delta < 1$, $$ \\mathbb{P}\\bigg(\\bigg|\\frac{\\Vert\\sqrt{\\alpha\_{t}}z\_t\\Vert\_{2}^{2}}{\\sigma\_{t}^{\\star2}} - d\\bigg|^2 \\ge 2d\\log \\frac{1}{\delta}\\bigg) \\le \delta.$$ Then the desired lower bound follows from the following facts: $$\\int\_{x\_t}C\_{5}\\frac{k^{2}\\log^{3}T}{T}\\frac{\\sigma\_{t}^{\\star}\\Vert(\\eta\_{t}-\\eta\_{t}^{\\star})s\_{t}^{\\star}\\left(x\_{t}\\right)\\Vert\_{2}}{\\sigma\_{t}^2}p\_{X\_{t}}\\left(x\_{t}\\right)\\mathrm{d}x\_{t} \\le \\int\_{x\_t}\\frac{\\Vert(\\eta\_{t}-\\eta\_{t}^{\\star})s\_{t}^{\\star}\\left(x\_{t}\\right)\\Vert\_{2}^{2}}{4\\sigma\_{t}^{2}}p\_{X\_{t}}\\left(x\_{t}\\right)\\mathrm{d}x\_{t} + C\_{5}^2\\frac{k^{4}\\log^{6}T}{T^2}\\frac{\\sigma\_{t}^{\\star2}}{\\sigma\_t^2},$$ as well as $$ \\int\_{x\_t} \\|s\_{t}^{\\star}(x\_{t})\\|\_{2}^{2}p\_{X\_{t}}(x\_{t})\\mathrm{d}x\_{t} \\asymp \\frac{d}{1-\\overline{\\alpha}\_t}.$$ We will add the complete proof to our next revision. **Setup of the numerical experiment.** We conduct some experiments to examine whether (2.4) is indeed the unique coefficient design that leads to dimension-independent error. We provide the setup for this experiment. We use the degenerated Gaussian distribution $p_\mathsf{data}=\mathcal{N}(0,I_k)$ in Theorem 2 as a tractable example, and run the DDPM sampler with exact score functions (so that the error only comes from discretization). We fix the low intrinsic dimension $k=8$, and let the ambient dimension $d$ grow from $10$ to $10^3$. We implement the experiment for different number of steps $T \in\\{100,200,500,1000\\}$. We implemented both our coefficient design (2.4) and another widely-used design $\eta_t = \sigma_t = 1-\alpha_t$. The error curve of $\mathsf{KL}(q_1\parallel p_1)$ and $\mathsf{TV}(q_1,p_1)$ versus $d$ can be found in the attached PDF. This supports our key message that (2.4) represents a unique coefficient design for DDPM in achieving dimension-independent error.
评论

Dear authors, dear reviewers,

the discussion period has begun as the authors have provided their rebuttals. I encourage the reviewers to read all the reviews and the corresponding rebuttals: the current period might be an opportunity for further clarification on the paper results and in general to engage in an open and constructive exchange.

Many thanks for your work. The AC

最终决定

The goal of the paper is the analysis of denoising diffusion probabilistic models under the assumption that the underlying target distribution is concentrated on a low-dimensional manifold with respect to the ambient dimension dd. The Authors characterize the convergence of the model to the target data distribution, whose rate is shown to depend on the dimensionality kk of the aforementioned low-dimensional manifold. The paper has been appreciated for its clarity and for its contribution to the study of the performance of diffusion models in the presence of an underlying low-dimensional data structure, also showing how a proper design of the coefficients in the model can allow to reach convergence rate depending on kk and not dd. A limitation stressed by the Reviewers is related to the underlying hypothesis on the target distribution, which is assumed to have bounded support.

In conclusion, the evaluation of the submission is positive, and I recommend acceptance. As an additional note, numerical experiment has been performed during the rebuttal phase to further support the results. I recommend the authors to incorporate the results presented in the rebuttal phase in the manuscript and clearly discuss the limitations of the results due to the technical assumptions.