PaperHub
5.3
/10
Poster4 位审稿人
最低5最高6标准差0.4
5
5
5
6
3.8
置信度
正确性2.8
贡献度2.8
表达2.5
NeurIPS 2024

Constrained Diffusion Models via Dual Training

OpenReviewPDF
提交: 2024-05-16更新: 2024-11-06

摘要

关键词
Constrained diffusion modelconstrained optimizationLagrangian methoddual algorithm

评审与讨论

审稿意见
5

In this paper, the authors introduce a novel approach termed Dual Training for training constrained diffusion models, particularly focusing on scenarios involving biased data generation. Initially, the authors adeptly derive the learning objectives for diffusion models within a constrained optimization framework. Subsequently, they develop corresponding learning methodologies based on the principles of Lagrangian duality and propose an innovative training procedure. The effectiveness of their approach is rigorously validated through experiments on two distinct generative tasks.

优点

  1. Robust Derivation Procedure: The authors present a thorough and well-founded derivation procedure throughout the manuscript, which stands out as particularly compelling. This rigorous approach significantly strengthens the theoretical foundation of their work.

  2. Relevant Application Scenario: The authors explore a critical yet underrepresented field in their research. This focus not only highlights the relevance of their study but also underscores its potential impact in areas that have previously received limited attention.

缺点

Major Issues:

  1. The authors state, "Compared with the loss re-weighting method [9], our constrained formulation provides an optimal trade-off between matching given data distribution and following reference distribution." It would be helpful if the authors could clarify whether the re-weighting-based approach can be considered a special case of the proposed method in this manuscript.

  2. The discussion on diffusion models appears to focus predominantly on the DDPM model. Could the authors explore whether this proposed approach can be extended to other diffusion processes, such as the Variance Exploding Stochastic Diffusion?

  3. There seems to be an unclear transition from (U-KL) to (U-LOSS). Can the authors directly justify this conversion? It would be beneficial to include a derivation or a more detailed explanation, especially how it relates to equations (6) and (U-MIX).

  4. The manuscript lacks comparative results with baseline models. Given that the authors suggest this approach provides a more general framework for generative models, could comparisons be made with other methods mentioned in reference [1]?

  5. Could the authors discuss whether this approach can be interpreted from the perspective of partial optimal transport? The constraints discussed appear to share similarities with those in unbalanced optimal transport approaches, such as those outlined in equation 3 of reference [2].

  6. The paper discusses a dual training procedure involving primal-dual optimization. In the context of augmented direction multiplier methods used in directed acyclic graph learning, batch size is known to significantly influence model efficacy. Could the authors provide a sensitivity analysis regarding batch size effects?

Minor Issue:

  1. Under Assumption 1, should the range of ζ(0,bi)\zeta \in (0, b_i), considering that KL divergence cannot be negative?

  2. Regarding the term (U-LOSS), should the subject be revised to logqi(xt)\nabla{\log{q^i(x_t)}}? Additionally, it would be helpful if the authors could provide a detailed explanation or derivation of how the constraint term is converted into the (U-LOSS) term.


References
[1] Choi, Kristy, et al. "Fair generative modeling via weak supervision." International Conference on Machine Learning. PMLR, 2020.
[2] Duque, Andrés F., Guy Wolf, and Kevin R. Moon. "Diffusion transport alignment." International Symposium on Intelligent Data Analysis. Cham: Springer Nature Switzerland, 2023.

问题

Please see weaknesses.

局限性

  1. Organization: The manuscript lacks a dedicated section for related work, which is essential for contextualizing the study within the existing literature. The absence of this section diminishes the foundational motivation of the manuscript, potentially limiting the reader's understanding of its contributions relative to prior research.

  2. Baseline Comparison: The authors did not include experiments comparing their methods against established baselines. Such comparisons are crucial for demonstrating the efficacy and advancements of the proposed approach over existing techniques.

  3. Hyper-parameter Sensitivity Analysis: The manuscript does not address hyper-parameter sensitivity. Including such analysis is important to evaluate the robustness and reliability of the model across various parameter settings.

作者回复

We thank the reviewer for the time and the valuable feedback. We believe that we have fully addressed your concerns and will incorporate the points mentioned below into the final version. We would be happy to address any further questions you might have.


Major Issue 1 ... the loss re-weighting method [9] ... can be considered a special case of the proposed method ...

Response: First of all, [9] and our method both utilize a biased dataset to train a generative model with fairness. However, we have different fairness-promoting objectives: (i) [9] aims to remove the bias of a generative model by adding an importance sampling weight based on a fair reference dataset; (ii) our diffusion model aims to eliminate the bias of a diffusion model against certain minorities by using minority datasets to constrain the model. Hence, it seems unfair to claim that one is a special case of another. However, it is useful to check their target distributions. For instance, if we choose our original and constrained data distributions as q=pbiasq= p_{\text{bias}} and q1=prefq^1 = p_{\text{ref}} from [9], then our optimal target distribution is in the mixed form of qmix(λ)=(q+λq1)/(1+λ)=11+λ(pbias+λpref)q_{\text{mix}}^{(\lambda^\star)} = (q+\lambda^\star q^1)/(1+\lambda^\star) = \frac{1}{1+\lambda^\star} (p_{\text{bias}} + \lambda^\star p_{\text{ref}}), where the optimal dual λ\lambda^\star provides a tradeoff between prefp_{\text{ref}} and pbiasp_{\text{bias}}. In comparison, Algorithm 1 of [9] applies the importance sampling weight w(x)=pref(x)pbias(x)w(x) = \frac{p_{\text{ref}}(x)}{p_{\text{bias}}(x)} for pbiasp_{\text{bias}} to completely eliminate pbiasp_{\text{bias}}, which can be viewed as an extreme case of qmix(λ)q_{\text{mix}}^{(\lambda^\star)} for very large λ\lambda^\star\to\infty. In this case, we see that our constrained diffusion model generalizes the re-weighting approach [9] to be a soft re-weighting mechanism.


Major Issue 2 ... extended to other diffusion processes ... Variance Exploding Stochastic Diffusion

Response: Our constrained distribution optimization formulation doesn't rely on a particular diffusion process, and we use DDPM mainly for clean exposition. We note that the variance exploding diffusion and DDPM (variance-preserving) share share the same variational structure (e.g., ELBO loss), with only difference in scheduling noise parameters [R1]. Hence, we can apply our constrained diffusion model to the variance exploding diffusion [R1], and establish non-asymptotic convergence [R2]. In the final version, we will discuss further inclusions of other diffusion processes [R3].

Reference

[R1] Variational Diffusion Models

[R2] The Convergence of Variance Exploding Diffusion Models under the Manifold Hypothesis

[R3] Score-based Diffusion Models via Stochastic Differential Equations--a Technical Tutorial


Major Issue 3 ... unclear transition from (U-KL) to (U-LOSS) ... how it relates to equations (6) and (U-MIX).

Response: Reformulation of Problem (U-KL) as Problem (U-LOSS) has two key steps: (i) apply the ELBO representation of each KL divergence in Equation (3); (ii) represent EBLO in form of denoising matching (e.g., score matching) in Appendix B. This derivation doesn't rely on the Lagrangian (6) and the optimal constrained model (U-MIX).


Major Issue 4 ... comparisons be made with other methods mentioned in reference [1]?

[1] Choi, Kristy, et al. "Fair generative modeling via weak supervision." 2020.

Response: We cite [1] above as [9] in our paper. Related works on fairness & generative modeling in [1] study the classical VAE and GAN, being not directly comparable to diffusion models in methodology.


Major Issue 5 ... interpreted from the perspective of partial optimal transport? ... in equation 3 of reference [2].

[2] Duque, Andrés F., Guy Wolf, and Kevin R. Moon. "Diffusion transport alignment." 2023.

Response: [2] aims to find a coupling between data samples from two domains, under constraints on total/individual masses. Analogously, our constrained diffusion model can be viewed as a transportation from a white noise to data samples via a reverse diffusion process. However, our constrained distribution optimization problem is a nonlinear optimization, while the constrained optimal transport is a linear optimization (see (3) in [2]). Therefore, our method can't be viewed as a case of partial optimal transport [2] or vice versa.


Major Issue 6 ... a sensitivity analysis regarding batch size effects?

Response: We have included the results of training a constrained model on an unbalanced subset of MNIST, using different primal and dual batch sizes (see the attached PDF in the global rebuttal). A larger ratio between the primal and dual batch sizes leads to better performance, as indicated by lower FID scores and more evenly distributed samples. This finding aligns with the heuristic used in the experiments in the paper, where we selected batch sizes so that the ratio of primal to dual batch sizes approximates the ratio of the entire dataset to the constrained datasets. We will include this empirical sensitivity analysis, along with similar analyses for other hyperparameters, in the final version.


Minor Issue 1 ... range of ζ\zeta ...

Response: It is correct.


Minor Issue 2 ... the term (U-LOSS) ... logqi(xt)\nabla \log q^i(x_t) ...

Response: It's worth noting that the forward processes for original and constrained datasets are the same, except for initial data points. Thus, we use the notation logq(xt)\nabla \log q(x_t) in (U-LOSS), and differentiate initial data points in two expectations: Eq(x0)E_{q(x_0)} and Eqi(x0)E_{q^i(x_0)}.


Limitations Existing Literature ... Baseline ... Sensitivity ...

Response: See our global rebuttal on Related work. For the sensitivity analysis, please refer to our response to Question 3 of Reviewer c1z8.

评论

Dear Authors,

Thank you for your detailed responses to my questions. I am satisfied with the clarifications provided and will accordingly increase my evaluation score. For your revised manuscript, I would appreciate the inclusion of an expanded section on related works and a more comprehensive derivation of the key concepts regarding major issue 2.

Sincerely,

Reviewer Zh95

评论

We sincerely thank the reviewer for reading our rebuttal and reevaluating our paper. As per their suggestion, we will expand the related work section in our introduction and add derivations emphasizing our clarification on major issue 2 in the main paper. We would also be happy to address any remaining concerns they might have.

审稿意见
5

This paper proposes Dual Training, a method designed to constrain the distributions that denoising diffusion models can learn. The authors propose an extension to the standard diffusion model training objective that minimizes the Kullback-Leibler (KL) divergence of the learned distribution with respect to two components: (1) the original data distribution and (2) a set of auxiliary distributions representing relevant constraints. The authors derive a tractable algorithmic approximation of their approach and apply it to two scenarios: the fair generation of underrepresented image classes and the fine-tuning of pre-trained models on new data while preserving image classes from the pre-training dataset.

优点

  • The paper proposes an original approach to addressing two critical challenges in diffusion-based image generation: fairness and the ability to integrate new data with pre-trained diffusion models.
  • The method is well-motivated and the theoretical analysis of both the constrained optimization problem and the proposed solution embeds it in a rigorous mathematical framework.
  • Experimental results demonstrate the efficacy of the proposed approach on two relevant tasks (fair generation and fine-tuning), showcasing quantitative and qualitative improvements over unconstrained and pre-trained baselines.

缺点

  • The experimental evaluation in the main text is limited, focusing primarily on the MNIST and CelebA datasets. While some results on CIFAR-10 are provided in the appendix, they are not referenced in the main text. A more comprehensive empirical evaluation using more challenging datasets would strengthen the experimental section and better demonstrate the method's utility.
  • The empirical evaluation lacks relevant baselines. For instance, the fine-tuning experiments do not provide quantitative results for models fine-tuned without constraints. More generally, it would be beneficial to see how the model compares with standard conditioning approaches capable of enforcing similar constraints, for example, training a model with classifier-free guidance on datasets with underrepresented classes and providing balanced conditioning information at inference time.
  • The empirical results do not include uncertainty estimates or measures of statistical significance, which would enhance the robustness of the findings.

问题

  • Can you provide more details on the computational overhead of dual training compared to standard diffusion model training?
  • How does the method compare to other approaches that explicitly aim to fit mixture distributions, e.g. [1]?
  • How sensitive are the results to the choice of hyperparameters (number of dual iterations, primal/dual batch sizes, primal/dual learning rates, etc.)? Is there a principled way to set these values?

[1] Du, Yilun, et al. "Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and MCMC." International conference on machine learning. PMLR, 2023.

局限性

The authors discuss some limitations, including the need to experiment with more datasets/attributes and explore converting other types of constraints into the KL divergence formulation. It would be good to also discuss and compare the computational overhead of the proposed formulation as this would be beneficial to evaluate the method's practical implications and trade-offs

作者回复

We thank the reviewer for the time and the valuable feedback. We believe that we have fully addressed your concerns and will incorporate the points mentioned below into the final version. We would be happy to address any further questions you might have.


Weakness 1 ... more challenging datasets ...

Response: We have expanded the scope of our experiments to include constrained latent diffusion models to tackle more challenging datasets. Initial results for the image-net dataset using a latent space diffusion model are included in the attached PDF in the global rebuttal. The constrained model samples more often from the minority classes, demonstrating the utility and effectiveness of our method even when applied to a more modern diffusion paradigm and on a much more challenging dataset. We note that the image-net dataset is significantly more challenging than MNIST, CELEB-A, both because of the much higher resolution (resized to 256256 vs 3232) and the classes being much more diverse compared to each other.


Weakness 2 ... quantitative results for models fine-tuned without constraints ... baselines ...

Response: We refer to Figure (2b) in the paper which in fact provides a quantitative baseline for models fine-tuned without constraints, both as generated samples and its FID score. Regarding the second point, we believe there is no meaningful metric of comparing a constrained unconditional model to existing conditional models since the conditional model can sample from any class if it is conditioned to do so. The suggested approach of using a conditional model with balanced conditioning information at inference time is interesting. However, we were unable to find any existing baselines using the suggested approach. Hence, we believe implementing such a novel approach to use as a baseline is beyond the scope of this paper.


Weakness 3 ... statistical significance ...

Response: We thank the reviewer for this suggestion. It is straightforward to get uncertainty estimates and error bars for the results/plots in the paper as they are reproducible from our shared code. We will include them in the final version.


Question 1 ... computational overhead of dual training ...

Response: See our response to Question 1 of Reviewer EijC.


Question 2 ... compare to other approaches that explicitly aim to fit mixture distributions, e.g. [1]?

[1] Du, Yilun, et al. "Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and MCMC." 2023.

Response: Thank you for pointing out this interesting reference [1]. We summarize three main differences below.

  • (Task) One of the compositional generation tasks in [1] is to learn a mixture of several distributions (or experts) using the energy-based model method. However, they assume that the weights of different distributions comprising the mixture are the same. In contrast, our approach aims to find weights such that the mixture satisfies certain constraints.

  • (Loss) In the energy-based model method [1], the training loss is the Fisher divergence between the model and a smoothed version of a data distribution. When the data distribution is a mixture of several experts, we have to individually train an energy-based model for each expert, and finally mix them through a specific sampling method. In contrast, our constrained diffusion model method optimizes the standard diffusion ELBO loss (e.g., score matching) over an dynamically-mixed distribution, where the mixing weight is determined by the dual update. Sampling from our trained model during inference is the same as any standard diffusion model: we iteratively refine a white noise through our trained reverse diffusion process.

  • (Theory) Although an energy-based model method is outlined in [1], performance of a trained energy-based model is not analyzed in theory. In contrast, we show that our trained constrained diffusion model converges to a mixed data distribution.


Question 3 ... sensitive ... to the choice of hyperparameters ...

Response: We thank the reviewer for bringing up this sensitivity question. We will include a more thorough discussion of the sensitivity to different hyperparamters in the supplementary material. A brief discussion of each hyperparameter is given below:

  • Number of dual iterations: In our implementation this shows up as the number of primal GD steps per dual update (#primal_per_dual). Experimentally, we have observed that as long as #primal_per_dual is greater than 1, the results are not sensitive to this value. Also, as discussed in our response to Question 1, the dual updates add a negligible computational overhead. Hence, updating the dual nearly as many times as we update model parameters doesn't reduce training efficiency.

  • Primal/Dual batch sizes: We thank the reviewer for bringing this to our attention. We have included (see the attached PDF in the global rebuttal) results of training a constrained model on an unbalanced subset of MNIST, using different primal/dual batch sizes. The results suggest that when the ratio between Primal and Dual batch sizes is larger, the model performs better (lower FID and more evenly distributed samples). This is in line with the heuristic we used in the included experiments in the paper where we chose the batch sizes such that the ratio of primal to dual batch size is close to size ratio of entire dataset to constraint datasets (which are much smaller).

  • Primal/Dual learning rate: For the primal learning rate, we followed the best practice used to train standard diffusion models. For the dual learning rate η\eta, we refer to Theorem 8 in the paper, showing a smaller error bound for smaller η\eta while slowing convergence. In practice, as long as η0.1\eta \leq 0.1, we observed that the model converges to similar results reliably.

评论

I thank the authors for the detailed response. I have raised my score in response to the additional empirical results and clarifications and will outline my remaining concerns below.


Re Weakness 1: ... more challenging datasets ...

Thank you for providing these additional results. I believe that they are helpful and strengthen the empirical evaluation presented in the paper.

Re Weakness 2: ... quantitative results for models fine-tuned without constraints ... baselines ...

We refer to Figure (2b) in the paper which in fact provides a quantitative baseline for models fine-tuned without constraints, both as generated samples and its FID score.

Thank you for pointing out the FID scores in the figure captions. They do indeed provide the quantitative results I was referring to. Could you briefly touch on why they are so much higher than the results reported in other works (e.g. 3.17 in [1] for unconditional image generation on CIFAR-10)?

Regarding the second point, we believe there is no meaningful metric of comparing a constrained unconditional model to existing conditional models since the conditional model can sample from any class if it is conditioned to do so. The suggested approach of using a conditional model with balanced conditioning information at inference time is interesting. However, we were unable to find any existing baselines using the suggested approach. Hence, we believe implementing such a novel approach to use as a baseline is beyond the scope of this paper.

I am not sure I understand this point. Conditioning diffusion models via classifier-free guidance [2] is a standard approach in image diffusion models, including the latent diffusion model [3] that I assume was used for the additional ImageNet results. Since this approach enables the specification of arbitrary class labels at inference time, it should be straightforward to sample them uniformly before sample generation.

Since all class labels need to be known at training time to specify the KL divergence-constrained optimization problem (U-KL), it seems reasonable to compare the method to other approaches that use class labels to train conditional diffusion models that allow us to flexibly constrain the sample generation process. The reasoning behind this question is to better understand the potential practical impact the proposed approach, compared to simply adapting existing diffusion model techniques to the experimental settings investigated in the manuscript.

Re Questions 1-3:

I appreciate the detailed response and the primal/dual batch size sensitivity study. In addition to the conceptual considerations provided in response to Question 1 of Reviewer EijC, I think it would be helpful to add an actual empirical comparison of the computational cost of standard and dual training approaches in the paper.


References

[1] Ho, Jonathan, Ajay Jain, and Pieter Abbeel. "Denoising diffusion probabilistic models." Advances in neural information processing systems 33 (2020): 6840-6851.

[2] Ho, Jonathan, and Tim Salimans. "Classifier-Free Diffusion Guidance." NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications. 2021.

[3] Rombach, Robin, et al. "High-resolution image synthesis with latent diffusion models." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.

评论

We sincerely thank the reviewer for their detailed response. We address their remaining concerns below.

Could you briefly touch on why they are so much higher than the results reported in other works (e.g. 3.17 in [1] for unconditional image generation on CIFAR-10)?

Response: Our FID scores are higher because we train our models on a biased subset of the dataset and compute the FID scores by comparing to the entire balanced dataset. Please refer to our response to weakness 3 of reviewer Rjho for more detail.

In addition to the conceptual considerations provided in response to Question 1 of Reviewer EijC, I think it would be helpful to add an actual empirical comparison of the computational cost of standard and dual training approaches in the paper.

Response: We will include both the conceptual considerations and an empirical comparison of the computational costs of standard and dual training in the appendix.

Since this approach enables the specification of arbitrary class labels at inference time, it should be straightforward to sample them uniformly before sample generation

Response: We thank the reviewer for clarifying the question. Regarding the suggested approach in conditional diffusion models with guidance, tuning the guidance parameter gives us a trade-off between sample diversity and fairness to all classes. In theory, our approach promotes sample diversity subject to the constraints (see our response to weakness 1 of reviewer Rjho). Therefore, it would be informative to compare these two approaches in terms of sample diversity and we will include this comparison in the final version.

... The reasoning behind this question is to better understand the potential practical impact the proposed approach, compared to simply adapting existing diffusion model techniques to the experimental settings investigated in the manuscript.

Response: While extending our proposed framework to conditional models is beyond the scope of the current work, we believe our approach would be practically relevant in this setting as we explain next. A model with classifier-free guidance learns both a conditional and unconditional model and weighs them according to the guidance parameter during sampling [a]. However, when the training data, and consequently the unconditional model are biased, this would make using the same guidance parameter for different conditioning information problematic, e.g., for underrepresented classes a larger guidance parameter would be needed to ensure sampling from that class. Our approach could alleviate this by using constrained training to ensure the learned unconditional model is unbiased.

Hopefully, our response would add your openness of reevaluating our paper. We are happy to address any further concerns/questions you might have.

Reference

[a] Luo, Calvin. "Understanding diffusion models: A unified perspective." arXiv preprint arXiv:2208.11970 (2022).

审稿意见
5

This paper studies the constrained diffusion models, motivated by customizing the generation by specific tasks. The idea is to formulate KL divergence-constrained optimization problem (U-KL), which is shown to have zero duality gap. The convergence (rate) of the sampling process is given by assuming a mixture initial distribution. Several experiments have been conducted, e.g., on MNIST and on the fairness.

优点

The paper provides a thorough study on the KL-constrained diffusion models: both optimization and sampling aspects are considered. The paper is overall interesting (methodologically). I also checked most proofs, and they are correct.

缺点

There are several weaknesses (and questions):

(1) The problem (U-KL) is still abstract. It is not clear how to choose bib_i to represent desired properties, or what are the right interpretations of the KL constraints (e.g., Section 3.2 on the fairness).

(2) Theorem 7: the authors may emphasize that Theorem 7 is valid under the assumption that the initial distribution is a mixture. Also I think DDPM is used. The authors may also compare Theorem 7 with the results in [28].

(3) Experiments: the authors consider MNIST, CIFAR 10 and Celeb-A. However, the FID obtained seems to be large compared to the existing results. For instance, for MNIST the FID is typically less than 1 while the paper reports ~20-40; for CIFAR 10 the FID is less than 3 (the best is around 1.8) while the paper reports ~50. The authors may explain why this happens, otherwise the experiment results are not so convincing.

(4) Literature and comments: There have also been a line of work on the SDE-based diffusion models (see ref. a), which show good empirical results. Related to (3), ref b. reports the FID 1.8 for CIFAR 10, and ref c. reports the FID 0.8 for MNIST (both without constraints). The authors may refer to these results on the continuous-time (SDE-based) models, and explain why there is so large difference in the empirical results. Another suggestion is that the authors may consider fine-tuning in the continuous framework, as in [45] and ref d.

a. Score-Based Generative Modeling through Stochastic Differential Equations, Song et al., arXiv:2011.13456.

b. Elucidating the Design Space of Diffusion-Based Generative Models, Karras, Aittala, Aila and Laine, arXiv:2206.00364.

c. Contractive Diffusion Probabilistic Models, Tang and Zhao, arXiv:2401.13115.

d. Fine-tuning of diffusion models via stochastic control: entropy regularization and beyond, Tang, arXiv:2403.06279.

I would be happy to raise the score if some (or all) of the above concerns are addressed.

问题

See weakness

局限性

NA

作者回复

We thank the reviewer for the positive evaluation and the valuable feedback. We believe that we have fully addressed your concerns/questions and will incorporate all points mentioned below in the final version. If you have any further questions, please feel free to post them, and we would be glad to address them.


Weakness (1) ... how to choose bib_i ... interpretations of the KL constraints ....

Response: Our KL constraint thresholds (bi,i=1,m)(b_i, i=1,\ldots m) function as a set of balancing weights over constrained distributions {qi,i=1,m}\{q^i, i=1,\ldots m\}. A smaller bib_i (a tighter constraint) causes the model to give more weight on the constrained distribution qiq^i (i.e., sampling more often from qiq^i). This ties into our desired properties for each setting in different ways:

  • In the minority class setting, a smaller bib_i leads to sampling more often from the minority classes, which is our desired property.

  • In the fine-tuning setting, a smaller bib_i means sampling more often from the pre-trained model, ensuring the new model does not forget the pre-trained model.

The distribution-balancing role of constraint thresholds (bi,i=1,,m)(b_i, i=1,\ldots,m) can be shown by analysing the optimal dual variable λ\lambda^\star; Recall from Section 3.2: the dual function g(λ)=h(qmix(λ))+i=1mλibig(\lambda) = h(q_{\text{mix}}^{(\lambda)}) + \sum_{i = 1}^m \lambda_i b_i , where h(qmix(λ))h(q_{\text{mix}}^{(\lambda)}) is the differential entropy of a mixture distribution qmix(λ)q_{\text{mix}}^{(\lambda)}. The maximizer λ\lambda^\star of the dual function determines the optimal weights in the final learned mixture qmix(λ)q_{\text{mix}}^{(\lambda^\star)} i.e., how often the trained model samples from each distribution qiq^i. Setting the gradient of the dual function to be zero leads to

= e^{h_i - b_i} \text{ for } i = 1,\ldots, m$$ where $\frac{\lambda^\star_i}{1 + (\lambda^\star)^T 1}$ is the weight of $q^i$. A smaller $b_i$ leads to a larger weight for $q^i$ and vice versa. Another interesting implication of $\lambda^\star$ is based on its dependence on the entropy $h_i$ of each individual distribution. When the constraint thresholds are equal, the model learns to sample more often from a distribution $q^i$ that has high entropy, meaning that a constrained model can generate more 'diverse' samples than an unconstrained model. In practice, we choose the constraint thresholds $(b_i, i=1,\ldots,m)$ by starting with a value that is close to the minimum loss achieved by an unconstrained model. Based on the insight above, we increase/decrease $(b_i, i=1,\ldots,m)$ depending on if we're sampling too rarely/often from the constrained distributions. Another tuning method that we found useful in practice, is resilient constrained learning in which the constraint thresholds are updated adaptively during training; see it in Appendix E.2 and Algorithm 3. --- > **Weakness** (2) Theorem 7 ... the initial distribution is a mixture ... **Response:** Theorem 7 shows that the trained diffusion model from our dual-based training (Algorithm 1) converges to a distribution that is close to a mixture of data distributions weighted by an optimal dual. Importantly, Theorem 7 does not assume an initial mixture distribution, rather, Theorem 3 asserts that the optimal solution to the constrained problem is a mixture distribution and we use this in Theorem 7. Compared to [28], we have advanced unconstrained diffusion models to address constrained problems, and we additionally characterize the effect of constraints on the diffusion model through duality analysis in constrained optimization. --- > **Weakness** (3) ... FID obtained seems to be large ... **Response:** We note that our FID scores being larger compared to existing results is not a weakness of our approach, but rather a consequence of our experimental setup. We train both the unconstrained and constrained models, on a *biased* subset of the dataset wherein some of the classes have significantly fewer samples than the rest. We then compute the FID scores for these models compared to the actual dataset itself which is *unbiased* (i.e., every class has the same number of samples). These FID scores approximate how close the learned distribution of the model trained on biased data, is to the underlying unbiased distribution. This setup contrasts with existing results in the literature, where the FID is computed with respect to unbiased data, and the models are also trained on unbiased data. Therefore, it is expected that such models will achieve better FID scores than constrained or unconstrained models trained with biased data. Our purpose in reporting the FIDs was not to compare them to existing results (as such a comparison would be uninformative) but to demonstrate that, when trained on biased data, the constrained model achieves better FID scores than the unconstrained model. --- > **Weakness** (4) ... SDE-based diffusion models ... good empirical results ... fine-tuning in the continuous framework, as in [45] and ref d. > d. Fine-tuning of diffusion models via stochastic control: entropy regularization and beyond, Tang, arXiv:2403.06279. For the difference in empirical results, please refer to our previous response. We note that [45] and ref. d study how to fine-tune pre-trained diffusion models to respect certain desired generation properties. In contrast, we focus on training new diffusion models to respect desired generation properties by imposing KL-divergence constraints. Despite having different problem setups, our constrained formulation can be used in fine-tuning problems. For instance, if a high-quality dataset that satisfies our desired properties is available, we can impose the KL divergence between the fine-tuning model and the underlying distribution of the high-quality dataset as a constraint in [45]. We will discuss this direction as future work in the final version.
评论

I would thank the authors for the response, and will keep my score unchanged.

评论

We sincerely thank the reviewer for taking the time to read our rebuttal. We believe we have adequately addressed all the concerns pointed out by the reviewer (especially weaknesses 1,2, and 3). We would thus kindly ask the reviewer to let us know which parts of our rebuttal they found unclear or unconvincing so that we could hopefully clarify them further before the end of the discussion period on Aug 13th.

审稿意见
6

This paper aims to address the issue of generating biased data based on the training dataset for diffusion models. The authors introduce a constrained diffusion model by imposing the diffusion constraints based on desired diffusions that are informed by requirements/constraints. They propose a dual training algorithm to train the model, and characterize the convergence of the trained model. Two constrained generation tasks are explored: ensuring fairness to underrepresented classes and adapting a pretrained model to new data.

优点

  • The paper is well written and easy to follow.
  • Its key idea is using the quadratic loss formulation of score matching, which is equivalent to the ELBO of diffusion models. The authors employ Lagrangian duality to update both the parameters of the diffusion model and dual variables of constraints. While Lagrangian duality is commonly used for constrained optimization, particularly for constrained generative models, see [1,2,3], this paper first introduces this approach for diffusion models.
  • Thanks to the quadratic loss minimization of diffusion models, the authors can establish the convergence of the proposed constrained models. I think this is the most novel part of the paper.

[1] Liu et al., Sampling with trustworthy constraints: a variational gradient framework , NeurIPS 2021.

[2] Danilo et al., Generalized ELBO with Constrained Optimization, GECO, Bayesian Deep Learning (NeurIPS 2018).

[3] Ferdinando et al., Lagrangian Duality for Constrained Deep Learning, ECMLPKDD 2020.

缺点

  • This paper shares a close connection to [1] regarding its motivation: formulating fairness as a constrained distributional optimization. Both of them employs a methodology based on primal-dual optimization derived from Lagrangian duality to address the constraints. While [1] focuses on the constrained SVGD, this paper focuses on constrained diffusion models.
  • Some minor corrections: +) Algorithm 1, line 4: x_{\theta}(h) should be s_{\theta}(h) +) Section 5, line 314, q(x_0:T) should be q_{i}(x_0:T) (constraint part)

问题

  • How efficient the model is as the number of constraints increases? Each time the model updates the duality variables, it needs to train a diffusion model, which is slow.
  • how the proposed method differs from existing constrained generative models, for example [1].

局限性

The paper can be improved by:

  • a discussion of the efficiency of the proposed constrained diffusion models should be included.
  • a discussion of how the proposed method differs from existing constrained generative models should be added.
作者回复

We thank the reviewer for the positive evaluation and the valuable feedback. We believe that we have fully addressed your concerns/questions, and will incorporate all points mentioned below in the final version. We would be happy to address any further questions you might have.


Weakness 1... connection to [1] ...

[1] Liu et al., Sampling with trustworthy constraints: a variational gradient framework , NeurIPS 2021.

Question 2... differs from existing constrained generative models, for example [1] ...

Limitation 2 a discussion of how the proposed method differs from existing constrained generative models should be added.

[2] Danilo et al., Generalized ELBO with Constrained Optimization, GECO, Bayesian Deep Learning (NeurIPS 2018).

[3] Ferdinando et al., Lagrangian Duality for Constrained Deep Learning, ECMLPKDD 2020.

Response: We completely agree that [1] and our paper share the motivation of using distribution constraints. It is worth mentioning that our constrained diffusion model differs from the constrained sampling in [1] in four key aspects.

  • In methodology, we model unknown data distribution, while [1] works on sampling from known distribution.

  • We have different optimization problems: [1] optimizes a reverse KL divergence subject to a moment constraint, while we use the forward KL divergence to form both objective and constraints.

  • We have different algorithms: we develop a training algorithm (see Algorithm 1) in a dual space, while [1] poses a sampling method working in the primal-dual fashion.

  • We have different theory: Regarding distributions, our theory only has the mild assumption that the samples should be bounded, while [1] assumes analytical properties of the target distribution, e.g., the Log-Sobolev conditions.

Thank you for pointing out the additional references [2, 3]. Reference [2] studies the classical VAE under moment constraints, and [3] studies constrained deep learning, being not directly applicable to diffusion models. Both references empirically develop primal-dual training algorithms without guarantees. Therefore, our constrained diffusion model distinguishes itself in terms of problem, algorithm, and theory.


Weakness 2 Some minor corrections ...

Response: Thank you for pointing out typos. We will remove them upon revision, and double-check the paper's writing in the final version.


Question 1 How efficient the model is as the number of constraints increases? ...

Limitation 1 a discussion of the efficiency of the proposed constrained diffusion models should be included.

Response: Thank you for bringing our attention to the efficiency. First, we note that the complexity of sampling from our constrained diffusion model does not increase with the number of constraints, as our trained diffusion model functions like a standard diffusion model to generate samples. Importantly, we remark that training our constrained diffusion model has comparable efficiency to training standard diffusion models detailed next.

The additional computational cost of our dual-based training (Algorithm 1) arises from: (i) updating the dual variables; (ii) updating the diffusion model in the primal update.

  • (Cost of updating the dual variables) We note that our dual-based training has the same number of dual variables as the number of constraints. Thus, the cost for the dual update is linear in the number of constraints. To update each dual variable, we can directly use the ELBO loss over the batches sampled from each constrained dataset (already computed for the Lagrangian). Therefore, the cost of updating dual variables is negligible.

  • (Cost of updating the diffusion model in the primal update) We note that the primal update trains a standard diffusion model based on the Lagrangian with updated dual variables. In our experiments, this primal training often requires as few as 2-3 updates per dual update. Thus, when training our constrained model, we can train for the same number of epochs as an unconstrained model but update the dual variables after every few primal steps. As a result, training our constrained diffusion model is almost as efficient as training standard unconstrained models.

The only concern we encountered regarding efficiency is that batches need to be sampled from every constrained dataset at each step to estimate the Lagrangian. This introduces a small GPU memory overhead that increases with additional constraints. However, this is somewhat mitigated by the fact that constrained datasets are often much smaller than the original dataset, allowing us to choose a smaller batch size for the constrained datasets without degrading performance.

评论

Dear Authors,

I would like to thank the authors for the response, and would like to remain the score unchanged.

Best regards,

Reviewer EijC

评论

We sincerely thank the reviewer for reading and responding to our rebuttal. We would be happy to address any remaining concerns they might have.

作者回复

We thank the reviewers for recognizing the strengths of our contribution and providing valuable feedback. We believe that we have addressed your concerns and hope that you will reconsider our paper in light of our rebuttal. To better assist your evaluation, we summarize three shared concerns in this global rebuttal and outline our responses below. Please see our individual rebuttals directly following your review.

  • Related work: We thank the reviewers for bringing some related works to our attention. In each individual rebuttal, we have clarified how our work differs from these references in several key ways. We will incorporate these points into an expanded related work section in the final version, alongside our existing discussion of related work, to better contextualize how our approach relates to (and differs from) previous methods tackling similar problems.

  • Significance of experiments: We believe that our current experiments in the paper have validated the utility and effectiveness of our constrained diffusion framework. To further strengthen this message, we have prepared new results from training constrained latent diffusion models on the more challenging Image-Net dataset; see them in the attached PDF. In our individual rebuttals, we have also discussed how the efficiency of training constrained models is comparable to that of training unconstrained models, and further discussed the choices of different hyperparameters and their effects. In particular, we have included a table in the attached PDF showing how the primal/dual batch sizes affect the performance of the constrained model. Based on the reviewers' concerns, we will include these new results and an expanded hyperparameter sensitivity analysis in the appendix of the final version.

  • Other clarifications: Reviewer Rjho brings up an important clarifying question regarding the choice of constraint thresholds and their relation to the desired properties of the constrained model. We have addressed this by pointing out the relationship between the constraint thresholds bib_i and the weights that the model learns for each constrained distribution qiq^i. Reviewer Zh95 raises an important question about the generality of our constrained diffusion model. We have discussed how other diffusion processes can be incorporated into our framework, and we will discuss such inclusions in the final version.

If you have any further questions, please feel free to post them, and we would be glad to address them.

Best,

The authors of Submission21568

评论

Dear Reviewers and Area Chairs,

We thank the reviewers as most of their concerns are important clarifying questions about our method and its relation to existing literature. We believe these are not weaknesses in our approach or the paper's soundness, as we have addressed in the rebuttals. In light of the discussions, we summarize the strengths of our paper in four aspects below:

Originality: Utilizing constrained optimization with ELBO constraints to train diffusion models is unique to our work. Our rigorous theoretical characterization of the solution and convergence guarantees are also novel. In our rebuttals, we have further clarified key ways in which our approach stands apart from previous work referenced by reviewers.

Significance: Significance of our work lies in providing a principled framework to ensure certain biases of data are not replicated by the generative model. The implications of our approach promoting sample diversity and extending our framework to state-of-art conditional diffusion models (see our responses to reviewers Rjho, c1z8 respectively) are directions to build on the foundation laid in this work.

Quality: The theory and experiments validate our approach and reviewers appear to agree the paper is technically sound. We've addressed concerns regarding the scope of experiments by providing additional empirical results on more challenging data (noted by reviewer c1z8 in their response).

Clarity: We appreciate the reviewers noting that the paper is well-written, well-motivated, and easy to follow. We have further addressed all clarifying questions from reviewers (noted by reviewers Zh95, c1z8 in their responses). We will include additional details to aid in reproducibility and clarity of experiments, namely hyper-parameter sensitivity analysis (like we provided in rebuttals for batch size) and comparison of computation costs.

We humbly ask you to consider these points in the final evaluation of our paper. We would be happy to address any other concerns or questions.


Thank you for your consideration,

Authors of Submission 21568

最终决定

The paper proposes a method for training diffusion models to satisfy constraints based on the KL divergence to certain reference distributions. The problem is reformulated using Lagrangian duality and a dual ascent algorithm is proposed to solve it. A rigorous optimality analysis is given under clearly stated assumptions.

Reviewers felt the paper addressed an important problem and found the paper to be sound, thorough, methodologically interesting, and rigorous. Weaknesses included questions about the running-time overhead of dual training and questions about the experiments, including comparisons to baselines, comparisons of FID score to published results, and performance on more challenging data sets.

During the rebuttal, the authors clarified a number of points:

  • They used biased subsets of existing benchmarks, so their FID scores are not comparable to published results.
  • Quantitative comparisons a baseline of models fine-tuned without constraints already appeared in the paper.
  • They added results on a more challenging data set, ImageNet.
  • They clarified that the running time overhead is small in practice because the primal variables only need a few training iterations each time the dual variables are updated.

Based on these responses, two reviewers raised their scores. In the end, the reviewers unanimously recommend accept. Reviewer Rjho did not raise their score, but the meta-reviewer feels that their weaknesses were addressed by the rebuttal.

The meta-reviewer also looked at the paper and was impressed with the scope and rigor of the theoretical results, which go beyond other papers in this area. For example, in addition to the convex analysis in the space of distributions and score functions, the authors give a convergence/optimality analysis that considers the score function parameterization under clearly stated assumptions about the parameterization error.