Conditional Diffusion Models are Minimax-Optimal and Manifold-Adaptive for Conditional Distribution Estimation
摘要
评审与讨论
This paper uses distribution regression to estimate the conditional distribution induced by conditional forward-backward diffusion models. Authors utilize neural network to estimate the key points and also allow for high-dimensional X and Y with low-dimensional structures. Solid theoretical results are provided.
优点
-
Authors introduce SDE in a simple way and made it easy to connect SDE to distribution regression.
-
The paper extends the unconditional distribution estimator to the conditional one and provides theoretical results. Theories are given in an elegant way that they clearly explain the connections to the theories of unconditional distribution estimator, mean regression problems, and nonparametric regression under l2 error. Theoretical results are interesting and inspiring.
缺点
The paper mentions that it provides practical guidance to design the neural network, including the network size, smoothness level and so on. Though it is a theoretical work, some empirical results would significantly demonstrate the usefulness of the paper. I would recommend authors run some simulations to show how the theoretical results guide the design of neural networks. For example, under what scenarios, the errors are optimally controlled?
The theoretical results are interesting. Can the authors highlight the difficulties and summarize the new ways of proof if there are? This could be a huge contribution to the statistical learning community. I didn't see the challenges from my end and all seem to be standard.
问题
What is p_data on Line 122?
Authors need to check all the parenthetical and narrative citations throughout the whole paper. Many were mis-used. For example, Line 316 should be parenthetical citation.
Definition of neural network class is already given in Tang and Yang, 2024 (AISTATS). Authors need to reference it or directly call it from the citation.
In remark 1, can authors discuss the case when alpha_X>1? What does this mean to the minimax rate?
In Remark 2, can the authors clarify why D_Y need to satisfy this condition to have the first term dominate the second one?
Rong Tang and Yun Yang. Adaptivity of diffusion models to manifold structures. 27th International Conference on Artificial Intelligence and Statistics, 2024.
Point 1: simulation and practical guidance.
Reply: Thank you for your valuable suggestions. We have included a simulation in our paper, which can be found in Appendix A of the supplementary material. Please see our global comment for more details.
Point 2: Technical difficulties.
Reply: One main technical challenge in extending the analysis from the unconditional case to the conditional case is developing new analytical approaches to bound the approximation error of the conditional score function using neural networks due to the inclusion of the extra covariate variable . In particular, as we assume the support of to be a general, unknown nonlinear submanifold dependent on the covariate value , the task of handling unknown supports for the conditional distributions requires us to deal with infinitely many unknown and nonlinear submanifolds with . This poses a significant technical challenge that we tackle in our work. In particular, we need to partition the joint space of into small pieces with varying resolution levels in and , tailored to times : the resolution level in should decrease with increasing , reflecting the diminished importance of finer details in due to noise injection over time; while the resolution level in should increase with , emphasizing the growing significance of the global dependence on as finer details in are smoothed out.
Point 3: What is on Line 122?
Reply: Apologies for the confusion; there was a typo. The term should refer to the target distribution, which is correctly denoted as .
Point 4: In remark 1, can authors discuss the case when ? What does this mean to the minimax rate?
Reply: The rate is strictly decreasing in terms of , so allowing for a larger can lead to a faster convergence rate. We expect that a similar phenomenon to occur in the minimax rate as well.
Point 5: In Remark 2, can the authors clarify why need to satisfy this condition to have the first term dominate the second one?
Reply: Since the two terms are the transition boundary is . Thus the dominant term is if ; and the dominant is if . The underlying reason for this transition is that a larger increase the complexity of estimating the conditional density of , making the task of conditional distribution estimation (distribution regression) in Wasserstein distance more challenging than nonparametric mean regression.
Other comments.
Thank you for pointing out these important issues. We will address them in the revised version of our main text.
-
The paper presents a derivation of the estimation error bound for the diffusion models with sampling guided by the covariate regressor. The authors closely build upon the methodology from Oko et al 2023 https://proceedings.mlr.press/v202/oko23a/oko23a.pdf who explored the statistical learning theory for diffusion models via score matching estimators and proved nearly minimax optimal estimation rates in total variation and Wasserstein metric of order 1 for fully connected neural-networks (including suggestions on generalisation to U-nets and ConvNets).
-
The presented paper aims to introduce the statistical learning theory for conditional diffusion models using the distribution regression of the covariate structure guiding the sampling. Authors investigate the properties of the estimators using the finite sample analysis of the statistical error bound under various metrics, the relations of latent lower dimensional data structures and the smoothness properties of the data.
-
The main contribution is:
- Given the assumptions on the size of the architecture, the estimation error bound of the conditional diffusion estimator is minimax optimal under total variation and Wasserstein distance of order 1, authors include the discussion on the generalization to the unconditional case as a special case,
- if there is a low-dimensional latent space structure within the high-dimensional data manifold (say within the feature space or feature and label space), the resulting conditional distribution estimators adapts to the lower-dimensional latent space.
优点
- The paper is mathematically well written. The notation is consistent and precise. Definitions and proofs are provided in the extensive Appendix. These can be useful for deriving and investigating other novel properties and algorithms.
- The problem is relevant and well chosen and the previous work is well mapped.
- There is an interesting potential applications for measuring the quality of the conditional diffusion estimators and guided sampling. The paper also provides a background for the reasoning about the selection of the model architecture.
- The interesting contribution is in analyzing the latent dimension representations of the data and the adaptability of the higher dimensional estimator.
- The results provide insight into the impact of the smoothness of the underlying data manifold on the performance of the estimator.
缺点
-
Paper is missing motivational examples and practical experiments to demonstrate how the presented theory can be used on real-life problems. The results are theoretically interesting and there is no issue with the soundness of the arguments. Nevertheless, without a practical demonstration on how to use the derived results, the main contribution of the paper may be harder to access to the wider machine learning community.
-
While it is great to see the methodology related to the proofs as a part of the main paper, it would be great to see particular direct applications in the main body of the paper.
-
The usual approach to estimate the gradients of log p_t(y_t|y) (see line 200) is to use the gradient of unconditional model and add gradient of log-likelihood of the ‘reverse predictor' which detects ‘the covariate values’ x based on the features constructed from whom the unconditional values ‘y_t’. Authors decided to use marginal forward diffusion which results in empirical estimation of the joint density between the covariate X (the guiding covariates) and Y (the unconditional diffusion). The discussion on incorporating computational efficiency of this approach in comparison to the 'reverse predictor' approach is missing.
-
The proven Theorems assume the initial constraints of the sizes of the architecture such as depth, width, norm of the parameters, etc. These constraints are set as functionals of the order of magnitude of the data sample . It is not well discussed what happens if these constraints are not satisfied.
问题
-
Are the covariates X also located in the Besov space support?
-
The marginals M_X and M_Y are assumed to be [-1,1] D-X, resp. D_Y dimensional cube. How does the scaling of the data to the [-1, 1] cube impact some other intrinsic features, e.g. latent translations? What type of scaling transformations would you recommend?
-
Can you provide guidelines on how you estimate and or do you assume these values?
-
How does presented results in Theorem 2 relate to latent diffusion models?
-
p.7 Remark 1: Does this remark imply that the kernel estimator can well approximate the performance of the conditional diffusion based estimator?
-
It would be great if authors can include experiments for different embedded lower dimensional structures, and demonstrate how the presented theory supports and navigates the experimentally obtained results. Using even simple synthetic geometric objects would be great.
-
Can you provide more discussion on the implications of the different regimes discussed in Remark 3? In particular how would you select the smoothness parameters to investigate the transition phases?
8 ) How does Lemma 2 helps to assess the generalization error?
Point 1: Practical experiments.
Thank you for your valuable suggestions. We have included a simulation in our paper, which can be found in Appendix A of the updated supplementary material. Please see our global comments for more details.
Point 2: The discussion on "reverse predictor" approach.
Given that the training objective we are considering differs from that of the unconditional diffusion model only in the input of the score network—by the addition of covariates—it should not increase much on the computational efficiency. However, the classifier-guidance method requires extra effort to learn the external classifier . Furthermore, from a theoretical perspective, our goal is not to learn the score function of the joint density of and , but rather the conditional density of given . Therefore, we do not need to impose any smoothness assumptions to the marginal density of , but only need to assume that smoothly depends on . However, for learning the classifier in the classifier-guidance method, we might need to assume that the marginal density of is also smooth, an assumption that can be avoided in the classifier-free method we are considering.
Point 3: Constraints to NN and implication of Lemma 2.
Reply: Thank you for raising this important point. This architecture is designed so that the estimation error and approximation can be well balanced. If we are choose a NN larger than the specified size, we will achieve a smaller approximation error, i.e., the term in Lemma 2, but will also lead to a larger estimation error (or excess risk), i.e., the term in Lemma 2. Consequently, this results in a larger total generalization error when these terms are summed. Similarly, selecting a NN smaller than the specified size will lead to a smaller excess risk but a larger approximation error due to the reduced capacity of the NN. This trade-off also culminates in a larger total generalization error. In summary, if these constraints are not satisfied, we may end up with an estimator that exhibits a larger error in terms of the expected Wasserstein distance compared with Theorem 2.
Point 4: Assumption on data space and smoothness.
Sorry for the confusion, we only need the space to have a upper Minkowski dimension ; it is not necessary to impose any smoothness conditions on (e.g., being a smooth manifold). The assumption of and in Theorem 1 is only for technical simplicity. The similar technique may also apply to, for example, unit ball, or more generally, compact set with smooth boundary. We assume that we know and to set the architecture of NN. Theoretically, it may also possible to construct estimators that adaptive to smoothness level using, for example, the Lepski method.
Point 5: Relate to latent diffusion models.
In Theorem 2, we consider the diffusion model that operates directly within the ambient space. Compared with the latent diffusion, it is perhaps less intuitive why this model can effectively adapt to the manifold structure of the data, which serves as a key motivation for the current research described in Theorem 2. For the latent diffusion model, the analysis may be divided into two part: one is the analysis for the encoder-decoder learning, and the second one is the analysis of the score matching in the encoded space, which may be treated as the Euclidean case we considered in Theorem 1.
Point 6: KDE estimator
The KDE is demonstrated to achieve the same rate as our equation (10) only within the Euclidean space, where both and lack any manifold structure. While our Theorem 2 allows both and to have low-dimensional structures. Therefore, compared to KDE, the conditional diffusion model can adapt to the underlying manifold structures, leading to better performance in such scenarios.
Point 7: transition phases.
For the transition phase in terms of smoothness parameters, since the transition boundary is , the transition boundary is independent of . In terms of , the dominant term is if with ; and the dominant is if or . The reason is that when is small, the bottleneck will be estimating the global dependence of on , so the dominant term will be the minimax rate of nonparametric regression.
I would like to thank authors for their response. I will keep my score unchanged as I would appreciate experimental section. I however believe the paper provides good theoretical background for conditional diffusion models, its contribution is novel, potentially impactful and deserves to be presented to the wider machine learning community at the ICLR conference.
Previous work has studied the min-max optimality of score-based diffusion models in estimating the density of an unknown distribution, and in estimating the intrinsic dimension of data lying on a submanifold of a sample space.
This work extends these results to the conditional modelling setting, where a conditioning variate changes the distribution being modeled. The distribution is assumed to vary smoothly with the conditioning variate.
These results show that very similar rates of convergence of the total variation and Wasserstein distances between the true and estimated data distributions hold to the unconditional modeling setting, with additional terms depending on the dimensionality of the conditioning variate. In the limit of the conditioning variate having dimension zero, the same rates as previous works on the unconditional setting are obtained.
优点
- The contributions are, while possibly not surprising, are meaningful and interesting results extending the theoretical analysis of diffusion models.
- The paper is generally very well written and easy to follow, given the technical nature of the content. In particular the remarks on the theorems make understanding the implications of them more accessible.
缺点
- The practical implications of the theoretical work presented are not completely clear to me. The authors mention that the results suggest how to design networks, but from my understanding on theorem 2, this suggests simply that larger networks produce better bounds?
- The implications of assumption D are unclear to me. The variable is undefined. Is this condition meant to hold for any ? If so then this implies a need for the decoding to be smooth globally, with I believe would limit topological structure of the low-dimension data manifold. If this is local condition, the is this not a consequence of the smooth manifold structure of ?
问题
Please see the weaknesses section.
Point 1: practical implications of the theoretical work
Reply: One guidance for designing the neural network is to utilize a smaller-sized neural network for approximating score functions corresponding to larger values of . This is motivated by the smoothing effect of Gaussian noise injection during the forward diffusion. So, we may consider a neural network class whose size diminishes over time, such as the piece-wise constant complexity neural network class described in equation (1):
where for each , is a ReLU neural network whose size decreases as increases. A simulation study included in Appendix A examines the performance of this piecewise NN structure in toy examples. Please see our global comments for details. The size of each ReLU neural network should be carefully chosen to achieve a balance between approximation error and estimation error: larger sizes increase the capacity of the NN, thus reducing the approximation error, but they also introduce larger random fluctuation in the training objective and increase the estimation error. The optimal size of each ReLU neural network should achieve a balance between these two errors, and will grow with . Explicit expressions for the optimal sizes are provided in Theorem 2.
Point 2: Implications of assumption D.
Reply: Sorry for the confusion. Here the assumption means that there exists a positive constant , so that for any , an encoder-decoder structure can be locally defined over . Here, the incorporation of the terminologies "encoder" and "decoder" is primarily intended to enhance clarity. The formal mathematical definition of a manifold is defined through local charts/parametrizations, which can be interpreted as an encoder-decoder pair and enable us to define the manifold smoothness with respect to the response and covariate ( and ).
Examples to illustrate the contribution. We have included a simulation in our paper, which can be found in Appendix A of the supplementary material. In this toy example, we consider a response generated by the process and aim at estimating and . We employ two neural network classes to approximate the conditional score functions: a standard ReLU NN and a piecewise ReLU NN. For the piecewise ReLU NN, the total time horizon is partitioned into intervals, with a ReLU NN used to model the conditional score within each interval (c.f. Section 2.3), which aligns with the structure suggested and employed in our theoretical analysis. The width of the hidden layers is selected to ensure comparability of training parameters between the two types of conditional score families.
Through simulations, we observe that for different (or ) values, the generated response from the fitted conditional diffusion model will concentrate around different ellipse (or tilted ellipse). This indicates that the conditional diffusion model effectively captures both the covariate information and the underlying manifold structure. Additionally, we assess the Maximum Mean Discrepancy (MMD) distance between the generated data and the true conditional distribution across various covariate values, as well as the average across all covariate values. Consistent with our theoretical findings, incorporating the piecewise structure into the neural network results in a noticeably reduced MMD distance. Below, we provide the MMD results for your convenience.
| Condition | Piecewise NN | NN | Condition | Piecewise NN | NN |
|---|---|---|---|---|---|
| 0.0023 | 0.0032 | 0.0002 | 0.0034 | ||
| 0.0013 | 0.0014 | 0.0009 | 0.0013 | ||
| 0.0007 | 0.0017 | 0.0014 | 0.0028 | ||
| 0.0021 | 0.0031 | 0.0008 | 0.0015 |
Table: The table presents the MMD between the conditional diffusion model estimator and the target conditional distribution. The first three rows show the MMD for the conditional distributions , , and as well as , , and . The last row displays the expected conditional MMD, where the expectation is taken with respect to sampled from a uniform distribution on and sampled from uniform over .
We will also add a discussion in our main text to emphasize this simulation:
We have also conducted a simulation study (see Appendix A) to demonstrate the effectiveness of this theoretically guided neural network architecture compared to a standard single ReLU neural network (across both space and time). In this experiments, we consider cases where, given the covariate , the response is supported on different (tilted) ellipses depending on the values of the covariate. Consistent with our theoretical findings, the simulations show that incorporating the piecewise structure into the neural network results in a more accurate estimation of the conditional distribution.
This paper studies conditional diffusion models for generative modeling, focusing on generating data given a covariate. Using a statistical framework of distribution regression, the authors analyze the large-sample properties of conditional distribution estimators under the assumption that the conditional distribution changes smoothly with the covariate. They derive a minimax-optimal convergence rate under the total variation metric, consistent with existing literature. The analysis is further extended to scenarios where both data and covariates lie on low-dimensional manifolds, showing that the model adapts to these structures. The resulting error bounds under the Wasserstein metric depend only on the intrinsic dimensionalities of the data and covariates. This paper has received unanimous support from the reviewers. Therefore, I recommend acceptance.
审稿人讨论附加意见
Even before the rebuttal, this paper received unanimous support from the reviewers.
Accept (Poster)