PaperHub
6.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
4
4
4
5
3.3
置信度
创新性3.0
质量3.0
清晰度3.3
重要性2.5
NeurIPS 2025

Composition and Alignment of Diffusion Models using Constrained Learning

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29

摘要

关键词
Diffusion ModelsConstrained OptimizationAlignmentCompositionGenerative Models

评审与讨论

审稿意见
4

This paper proposes a constrained framework that unifies the alignment and composition of diffusion models, aimed at tackling the trade-offs arising when optimizing for multiple rewards or combining multiple models. The authors give theoretical results of aligning with multiple reward constraints and composition of multiple pretrained models under the canonical framework of Lagrange duality. Experiments show that aligned or composed model satisfies constraints effectively when using the proposed constrained method.

优缺点分析

Strengths:

  1. The paper provides a comprehensive theoretical framework for constrained alignment and composition of diffusion models, including characterization of the solutions and strong duality results. The proofs are detailed and well-structured, which adds significant value to the work.

  2. The proposed method unifies alignment and composition tasks, offering a principled alternative to existing ad hoc approaches.

  3. The experimental results of finetuning alignment and product composition are consistent with theories, satisfying constraints effectively.

Weaknesses:

  1. The strong duality claims in Theorems 2 and 4 are not rigorously justified. While the KL-regularized constrained optimization problem is convex in the space of diffusion trajectories, the authors project this result onto a non-convex parameter space of score networks without formally proving that the parameterized family is expressive enough to represent all feasible distributions. This step implicitly assumes a surjective mapping from score functions to path distributions.

  2. Lemma 1 and 2 assume that the diffusion models in comparison share the same noise and variance schedule. I think it is impractical. It would be valuable if the authors could discuss whether the results can be extended to more general settings, or at least clarify the practical implications of violating this assumption.

  3. The empirical evaluation is quite limited. The experimental section only compares the proposed method against equal-weight baselines for both alignment and composition. Although these baselines are simple and commonly used, they are far from state-of-the-art. For example, recent constrained or flow-based composition baselines are not compared, even though the related-work section cites them.

  4. For high-dimensional images the authors skip all primal updates and drive dual variables solely with a surrogate product score to avoid annealed MCMC cost . However, the paper does not provide any theoretical or empirical analysis of the approximation error introduced by this shortcut. As a result, it is unclear whether this practical modification substantially weakens the guarantees of the method.

  5. The method involves expensive components such as repeated KL divergence estimation between large diffusion models and, in the full algorithm, annealed MCMC sampling. The authors acknowledge that the approach is “computationally costly,” but do not provide any quantitative reporting of GPU hours, wall-clock time, or memory usage. It would be valuable if the authors could include such metrics.

问题

See "weaknesses".

局限性

yes

最终评判理由

The theoretical contributions of the paper is solid. However, the empirical evaluation remains relatively limited, especially in comparison to other diffusion-related papers at NeurIPS. For this reason, I will keep my score (weak accept), but I will increase my confidence from 3 to 4.

格式问题

No concern.

作者回复

Re: Weakness 1

Our strong duality results show that there is no duality gap when the score networks are sufficiently expressive. This can be justified by the widely-used practice that utilizes over-parameterized networks (e.g., U-Nets or transformers) to train diffusion models. When Slater condition holds for the convex hull of score networks conv(S)\overline{\mathrm{conv}}(\mathcal{S}), we can apply the known strong duality in non-convex constrained learning (e.g., see Proposition 3.2 of [1]) to our constrained alignment and composition problems. Hence, mapping from score networks to path distributions is not necessarily surjective and strong duality holds, i.e., P=DP^\star = D^\star.

When the score networks sθs_\theta are parametrized by a class of functions with limited expressiveness, we can use the duality gap analysis in non-convex constrained learning (e.g., see Proposition 3.3 of [1]) to characterize the duality gap

PθDθO(ν),P_\theta^\star - D_\theta^\star \leq O(\nu),

where ν\nu is a parametrization gap that satisfies

p(;s)p(;sθ)1νfor any sS.\|\| p(\cdot; s) - p(\cdot; s_\theta) \|\|_1 \leq \nu \quad \text{for any } s \in \mathcal{S}.

Thank you for pointing out this important point. We will include all these points formally in the final version.

Reference

[1] (Chamon et al.) Constrained Learning With Non-Convex Losses


Re: Weakness 2

In both Lemmas, this assumption can be relaxed, leading to different, but still tractable sums. In Lemma 1, the path-wise KL can be decomposed into a sum of TT KL divergences:

DKL(p0:T(;sp)p0:T(;sq))D_{\text{KL}} \left( p_{0:T}(\cdot ; s_p) \Vert p_{0:T}(\cdot; s_q) \right) =t=1TEx0:Tp[DKL(p(xt1xt;sp)p(xt1xt;sq))]= \sum_{t=1}^T E_{x_{0:T} \sim p} \left[ D_{\text{KL}}\left(p(x_{t-1} \mid x_t; s_p) \Vert p(x_{t-1} \mid x_t; s_q) \right) \right]

Importantly, p(xt1xt;sp)p(x_{t - 1} | x_t; s_p) and p(xt1xt;sq)p(x_{t - 1} | x_t; s_q) are Gaussian distributions. If the two models share a variance and noise schedule, the KL between Gaussians simplifies to the squared norm of the difference between the means, which is what we present in the paper. However, if the noise and variance schedules are different, the KL between two Gaussians

p(xt1xt;sp)=N(αt1αtxt+βtsp(xt,t),σt2I),p(x_{t - 1}| x_t; s_p) = \mathcal{N}\left(\sqrt{\frac{\alpha_{t-1}}{{\alpha}_t}} x_t + \beta_t s_p(x_t, t), \sigma_t^2 I\right),

and

p(xt1xt;sq)=N(αt1αtxt+βtsq(xt,t),σt2I),p(x_{t - 1}| x_t; s_q) = \mathcal{N}\left(\sqrt{\frac{\alpha'_{t-1}}{{\alpha'}_t}} x_t + \beta'_t s_q(x_t, t), \sigma'_t{}^2 I\right),

is still tractable and given by:

DKL(p(xt1xt;sp)p(xt1xt;sq))=12[1σt2(αt1αtαt1αt)xt+βtsp(xt,t)βtsq(xt,t)2+dlog(σt2σt2)+dσt2σt2d].\begin{aligned} D_{\mathrm{KL}} \big( p(x_{t - 1}| x_t; s_p) \,\Vert\, p(x_{t - 1}| x_t; s_q) \big) = \frac{1}{2} \Bigg[ & \frac{1}{\sigma_t^2}\|\|\left(\frac{\alpha_{t - 1}}{\alpha_t} - \frac{\alpha'_{t - 1}}{\alpha'_t}\right) x_t + \beta_t s_p(x_t, t) - \beta'_t s_q(x_t, t)\|\|^2 + d \log \left(\frac{\sigma_t^2}{\sigma'_t{}^2}\right) + d \frac{\sigma_t^2}{\sigma'_t{}^2} - d \Bigg]. \end{aligned}

If variance and noise schedules are the same, i.e., αt=αt\alpha_t = \alpha'_t and σt=σt\sigma_t = \sigma'_t, the above reduces to:

βt22σt2sp(xt,t)sq(xt,t)2.\frac{\beta_t^2}{2 \sigma_t^2} \|\| s_p(x_t, t) - s_q(x_t, t) \|\|^2.

The derivation in Lemma 2 becomes much more cumbersome in this relaxed setting; however, the point-wise KL is still tractable.

We note that the "same schedule" assumption holds for all of our experimental settings. However, we agree with the reviewer that in some practical settings this assumption may be violated. We will remark on this relaxation in the final version.


Re: Weakness 3

We believe aside from the commonly used equal-weights baseline which we compare to, other existing methods in the literature are not applicable to the settings that we consider and are generally not comparable to our constrained learning framework. This includes papers that propose constrained methods; In all these works that we are aware of, what they mean by constraining the diffusion process is significantly different from our definition. We now discuss the most relevant papers that we have cited to explain why they are not suitable baselines in this work. If the reviewer has specific existing approaches in mind as meaningful baselines, please mention them so we can discuss and potentially include them.

  • In [1] which has the most compatible setting, they average model weights to sample from the geometric mean distribution. They further propose a weighted average with potentially different weights and suggest finding weights that minimize a specific objective (e.g., Image reward). They give a heuristic algorithm to do this called "Greedy Souping". This is not usable in our setting since there is no clear choice for the objective in both for composition and alignment with multiple rewards.

  • In [2] they propose a superposition method to sample from the mixture of diffusion models with arbitrary weights (but they only use equal weight mixtures and don't discuss different weights). They also devise a method to sample points that have equal likelihood under different models which is fundamentally different to product composition.

  • Works like [3,4,5,6] all discuss constrained sampling from diffusion models, but the nature of their constraints is completely different from our work as it mainly involves sampling from a constrained set and they propose to do this through projection onto a feasible set at each diffusion time step. It is not clear how to apply these methods to reward constraints or how to use them to preserve distance to a model.

  • Other papers we have cited either propose methods to improve alignment with a single reward by using a KL regularizer like [7], or they enforce very specific constraints by adding additional losses with fixed weights to the objective which implicitly enforces the constraint like [8,9]. These methods are very specific to the constraints they are designed for and do not generalize to arbitrary reward functions and don't give us a way to constrain closeness to a model.

References

[1] (Biggs et al.) Diffusion Soup: Model Merging for Text-to-Image Diffusion Models
[2] (Skreta et al.) The Superposition of Diffusion Models Using the Itô Density Estimator
[3] (Christopher et al.) Constrained Synthesis with Projected Diffusion Models
[4] (Liang et al.) Multi-Agent Path Finding in Continuous Spaces with Projected Diffusion Models
[5] (Narasimhan et al.) Constrained Posterior Sampling: Time Series Generation with Hard Constraints
[6] (Zampini et al.) Training-Free Constrained Generation with Stable Diffusion Models
[7] (Fan et al.) DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models
[8] (Giannone et al.) Aligning Optimization Trajectories with Diffusion Models for Constrained Design Generation
[9] (Chen et al.) Towards Aligned Layout Generation via Diffusion Model with Aesthetic Constraints.


Re: Weakness 4

We agree with the reviewer that from a theoretical standpoint, using the surrogate score is not an accurate method to sample from a product distribution. We note that it is a commonly used approach in the literature (e.g., see [2]) and more generally classifier and classifier-free guidance both use a similar approach.

Furthermore, in constrained product composition for text-to-image diffusion, our results show that even with the approximation error, KL divergence constraints lead to more balanced representation of the concepts in concept composition and more balanced reward levels in composing reward fine-tuned models, which we believe is strong evidence of the effectiveness of our approach.

As we discuss in our paper, in [1] it is shown that using annealed MCMC sampling with the surrogate score can lead to samples from the correct product distribution. However, their full implementation needed to reproduce the relevant results, which we could then apply in our constrained learning framework, is not publicly available^* and reproducing their results with an incomplete implementation is beyond the scope of our paper.

Regarding the theory, it can be shown that in the case where the initial distributions at time t=0t = 0 are Gaussians, the surrogate score and the true score coincide for all values of λ\lambda. Anything beyond this simple setting quickly becomes intractable to analyze.

We thank the reviewer for highlighting this approximation error and agree that analyzing it will provide a stronger theoretical justification of our product composition experiments. We plan to more thoroughly study and address the effects of using the surrogate score for product composition in future work.

References

[1] Du et al. Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and MCMC
[2] Du et al. Compositional Visual Generation with Energy Based Models

Note: There is a GitHub repo for the paper [1] but it does not include the implementation needed to do MCMC sampling with pre-trained text-to-image diffusion models.


Re: Weakness 5

All experiments were run on a single Nvidia A6000 GPU.

  • For Alignment, there is little additional time overhead compared to baselines like AlignProp. For example, for the experiment in Figure 3, runtime is 33 minutes for both constrained and unconstrained methods, and for the experiments in Figure 4 constrained is 1h4m, unconstrained is 1h. Existing approaches already estimate the KL and sample batches to evaluate and backprop through the reward. The only additional computation for our method is the dual updates which is negligible in terms of added time.

  • For Composition, there is no meaningful comparison to the equal weights baseline since there, we don't learn the weights. For constrained composition, it takes around 5–10 dual updates for dual variables to converge which for composing the 5 fine-tuned stable diffusion models takes ~9 minutes total and for concept composition ~2 minutes.

We thank the reviewer for highlighting this and we will include the above and additional runtime details in the final version.

评论

Thank you for the thoughtful rebuttal. I find the theoretical contributions of the paper to be solid, and your responses have addressed my earlier concerns. That said, the empirical evaluation remains relatively limited, especially in comparison to other diffusion-related papers at NeurIPS. I believe it would be difficult to substantially improve the empirical side within the rebuttal stage. For this reason, I will keep my score, but I will increase my confidence.

审稿意见
4

This work proposes addressing the problems of reward alignment and composition in diffusion models through the lens of constrained learning via Lagrangian duality. The authors present a coherent formulation that yields both structural insights into the solution and practical algorithms for solving the optimization problem.

优缺点分析

Strengths

  • The Lagrangian formulation offers a clean and principled approach to the problem, and the technical work appears solid.
  • The method allows users to specify thresholds instead of weights, which is arguably more intuitive and user-friendly.

Weaknesses

  • While the constrained learning perspective is appealing, the technical novelty appears limited. The analysis seems relatively straightforward, and it's unclear whether the proofs or techniques used contain significant new contributions. It would be helpful if the authors clarified this point.
  • Although the title and framing suggest a unified treatment of reward alignment and product composition, the paper handles these two problems separately. The connection between them seems limited to the fact that a similar analytical approach is applied in both cases.
  • The motivation for using the path-wise KL divergence in alignment is somewhat unclear. Similar to composition, the final distribution p0p_0 seems more important, so why is the path-wise divergence more appropriate? Furthermore, the formulation appears to rely on a discrete-time variant; can this be extended to continuous time?

Minor

  • Section 4.1, line 211: I believe it should be problem (UR-C), not (UR-A).

问题

Please see the weaknesses above. In summary:

  • Are there new technical contributions beyond the formulation itself?
  • Why is path-wise KL divergence used for alignment rather than divergence at the final distribution?

局限性

Yes.

最终评判理由

Overall, this work on reward alignment and composition using a lagragian formulation is interesting and the work is solid, but not particularly remarkable in terms of novelty and potential impact. As such, I have recommended a borderline accept.

格式问题

None.

作者回复

Re: Weakness 1
An integral part of our approach lies in Lemma 2 which allows us to evaluate the KL divergence between marginals of a backward diffusion process (i.e., point-wise KL). As far as we know, our work appears to be the first to provide a tractable characterization of point-wise KL and highlight its difference to path-wise KL.

We argue that in composition, constraining the KL divergence between the final marginal distributions at t=0t = 0 is more natural, since we care about distributions of the generated samples and not the whole diffusion process. At first glance, computing this KL seems intractable, since in general it would require marginalizing the backward diffusion processes with arbitrary score functions to derive p0p_0.

We would like to highlight that the proof is quite involved as it requires characterizing the derivative of the KL between marginals in continuous time, and then bridging the gap to discrete time and using properties of the KL along with novel technical analysis to bound the discretization error. We expand on this further below.

Furthermore, we would argue that even some parts of the analysis that are relatively straightforward from a technical standpoint (e.g., strong duality), are still important and novel, since the results give new insight into the solutions of the constrained problem, connecting our work to the broader model composition and alignment literature.

Proof Summary of Lemma 2:
In the proof, we begin with the intuition that if the initial distributions at time t=Tt = T are the same and the two backward processes are close, then the final distributions at time t=0t = 0 will also be close. We then generalize a result from [1] to formalize this in continuous time, giving us an analogue of Lemma 2 in continuous time.

Then, we use results from the literature on discretization of Langevin dynamics along with additional theoretical analysis of our own to prove that the continuous time point-wise KL computation can be extended to work for discrete time diffusion processes, where we characterize the discretization error to be of order O(1/T)O(1/T). We refer the reviewer to Appendix C.2 in the supplementary material for more details.

Reference

[1] (S. Lyu) Interpretation and generalization of score matching


Re: Weakness 2
By unified treatment, we mean that the same constrained learning framework can be applied to both problems. Beyond the similar analytical approach, we also use essentially the same primal-dual algorithm to tackle both, the only difference being a workaround in composition to minimize the Lagrangian. Furthermore, as we briefly mention in the last paragraph of Section 2, the problem formulations for composition and alignment should be viewed as canonical formulations, and similar approaches can be readily adapted to solve combinations of these, for example:

minu0,pusubject toDKL(pqi)ufor i=1,,m,Exp[rj(x)]bjfor j=1,,n.\begin{aligned} \min_{u \geq 0, p} \quad & u \\ \text{subject to} \quad & D_{\mathrm{KL}}(p \Vert q_i) \leq u \quad \text{for } i = 1, \ldots, m, \\ & \mathbb{E}_{x \sim p} [r_j(x)] \geq b_j \quad \text{for } j = 1, \ldots, n. \end{aligned}

where the goal is to find a model that is close to multiple pre-trained models and also satisfies one or more reward constraints.

In order to emphasize this unified treatment, we present an example of the combined constrained alignment and composition formulation above. We conducted an experiment using point-wise KL constraints to pre-trained Stable Diffusion and a model fine-tuned on aesthetics reward, along with a constraint on the saturation reward. The resulting constrained model achieves less than 10% reward constraint violation and similar KLs with respect to both pretrained models.

ConstraintDual VariableInitial ValueFinal ValueSlack
Saturation1.0800.1000.5330.033
kl (pretrained)0.4930.0000.1010.004
kl (aesthetics)0.5070.2600.0970.000

Re: Weakness 3
The main reason is that computing point-wise KL in the alignment problem is intractable, at least with our suggested approach. However, in the alignment setting, the KL divergence objective acts as a regularizer to prevent too much deviation from the pre-trained model, and path-wise KL is commonly used to fulfill the role of a regularizer. Furthermore, under a mild assumption, it can be shown that using point-wise or path-wise KL leads to the same solution in alignment. We expand on these points below:

As mentioned in Lemma 2, computing point-wise KL requires 'valid' score functions, i.e., gradients of log-likelihood of some forward process.

We might reasonably assume that the score of a pre-trained model, sq(x,t)s_q(x,t), approximates a valid score function, since it was learned by minimizing the distance to the true score for some data distribution. However, for sp(x,t)s_p(x,t) which we are updating at each primal step of training, there is no longer a guarantee that it is a valid score function. This is why we need Lemma 3 which gives us a way to find the Lagrangian minimizer without explicitly computing the KL between p0(;sp)p_0(\cdot; s_p) and p0(;sq)p_0(\cdot; s_q).

Crucially, Lemma 3 assumes we have access to samples from the Lagrangian minimizer distribution (qAND(λ)q_{\mathrm{AND}}^{(\lambda)} in the case of composition). This is okay in composition since we have ways of sampling from this Lagrangian minimizer. However, for alignment this is no longer the case, since we have no way of directly sampling from:

qrw(λ)qeλr.q_{\mathrm{rw}}^{(\lambda)} \propto q e^{\lambda^\top r}.

Luckily, unlike in composition, in alignment, since there is only a single KL objective, point-wise KL and path-wise KL lead to similar solutions:

p0(x0)=q0:T(x0:T)eλr(x0)dx1:Tq0(x0)q(x1x0)q(xTxT1)eλr(x0)dx1:T=q0(x0)eλr(x0)p_0^\star(x_0) = \int q_{0:T}(x_{0:T}) e^{\lambda^\top r(x_0)} dx_{1:T} \approx \int q_0(x_0) q(x_1 | x_0) \cdots q(x_T | x_{T-1}) e^{\lambda^\top r(x_0)} dx_{1:T} = q_0(x_0) e^{\lambda^\top r(x_0)}

where we have used the fact that the pre-trained model q0:T(x0:T)q_{0:T}(x_{0:T}) has been trained to match a forward diffusion process; therefore, we assume it is approximately equal to the forward process which is Markovian in forward time and thus can be decomposed as above. Note that q0(x0)eλr(x0)q_0(x_0) e^{\lambda^\top r(x_0)} is the exact solution if we use point-wise KL as the objective.

We thank the reviewer for bringing up this point, since emphasizing the distinction between point-wise and path-wise KL is a central aspect of the paper.

For the second concern, yes, the formulation can easily be extended to continuous time. In fact, in the proof of Lemma 2, we begin by proving it in continuous time, and then bound the discretization error, extending the proof to the discrete time diffusion setting.

评论

Thanks for your detailed response and for highlighting the intricacies in the proofs and for clarifying why path-wise KL was used in the alignment.

审稿意见
4

This paper introduces a constrained optimization framework for aligning and composing diffusion models using reverse KL divergence and reward constraints. It leverages strong duality to design a primal-dual algorithm that enables fine-tuning with multiple rewards while preserving proximity to pretrained models.

优缺点分析

Strengths:

  • The constrained alignment method is principled and interpretable, offering clear control over reward satisfaction while avoiding overfitting.
  • The approach scales well to multiple reward signals, avoiding exhaustive hyperparameter searches required by weighted objectives.
  • The use of strong duality and primal-dual optimization brings theoretical rigor and enables efficient training with provable guarantees.

Weaknesses:

  • The paper lacks experiments beyond text-to-image diffusion, limiting generalizability across modalities or task types.
  • The choice and tuning of reward functions is still nontrivial, and their quality directly affects alignment; this is underexplored.
  • While the method avoids explicit weight tuning, automated threshold selection is heuristic-based and may not always generalize well.

问题

How sensitive is the method to the choice or scaling of reward functions, and how does the heuristic for automatic threshold selection perform across different tasks or domains?

Can the proposed alignment and composition framework be extended to unconditional generation tasks, or is it inherently tied to reward-conditioned settings like text-to-image generation?

局限性

yes

最终评判理由

My concerns and questions have mainly been addressed, and I raise my score to 4.

格式问题

no problem

作者回复

Re: Weakness 1

Our proposed framework and theoretical analysis do not depend on any specific modality or task types. From our theoretical guarantees, we would expect experiments in other modalities to provide results similar to those presented for images in the paper. The reason why we focus on text-to-image diffusion is that it is currently the most widely used case of diffusion models: standard pre-trained models and reward functions for this modality are readily accessible. Furthermore, the reviewer seems to have focused mostly on Alignment but we emphasize that constrained Alignment, Product and Mixture Composition, and even the combination of composition and alignment (We give an example in our response to reviewer QwoA) together cover a wide variety of different tasks that our framework can be applied to.

To further address the reviewer's concern regarding generalizability, we have conducted concept composition experiments for text-to-audio generation as an example of another modality. We briefly outline the setting and our results below:

We treat a text-to-audio diffusion model (in this case AudioLDM) conditioned on different inputs, each representing a concept, as the models to be composed. We apply our constrained learning to find the optimal weights to compose these two models, and use the CLAP score [1] to measure the similarity between the generated audio samples and the text prompts representing each model.

Minimum CLAP scores across prompts for each method:

MethodMinimum CLAP Score
Combined prompting0.816
Equal Weights1.57
Constrained (Ours)1.92

Similar to Section 5.2 (II), where we did concept composition for images, we observe that using our constrained approach, the minimum CLAP score across prompts increases compared to the two baselines. The constraints ensure closeness to each model, which in turn results in a more equal representation of the concepts.

Reference

[1] CLAP: Learning Audio Concepts From Natural Language Supervision (Elizalde et al.)


Re: Weakness 2

We provide a framework that can be used for Alignment with arbitrary reward functions. To evaluate the performance of our proposed approach, we measure whether each expected reward is higher than the chosen threshold and the closeness to the pre-trained model.

We agree with the reviewer's point that depending on the choice of reward functions, the generated sample distributions can be different; however, this is independent of the merit of our constrained framework. Hence, whether increased reward leads to generations with higher quality or not (and what quality means in this context) is not the focus of this work.

We do note that some commonly used aesthetic rewards tend to overemphasize certain qualities like saturation or contrast in images. This is another motivation for our constrained alignment approach: by constraining simple saturation and contrast rewards alongside the original aesthetic reward, we can mitigate these issues of the aesthetic reward (see Figure 3 in the paper).


Re: Weakness 3

We would like to clarify that our automated threshold selection is not purely heuristic, as it is informed by our theoretical analysis. There are two types of constraints that we consider in the paper, which we now explain below:

  • KL divergence constraints in the constrained composition problem:
    While our framework supports choosing fixed thresholds, we note that the values of the KL divergence are not as interpretable as rewards. This is why we present the epigraph formulation in the paper, which essentially finds a minimum shared threshold uu for all the KL constraints. This epigraph formulation could be viewed as "automated threshold selection." We have theoretically analyzed the solution of this problem and discussed its properties, noting that the solution has a distribution equally close in KL divergence to each of the models, which is desirable for composition. Hence, we do not view it as a heuristic since we present a theoretical justification and analysis for it.

  • Reward constraints:
    We agree with the reviewer that choosing reward thresholds is more interpretable, avoiding exhaustive tuning of multiple reward weights. To ensure that the chosen thresholds lead to a feasible problem, we suggest normalizing the rewards using the mean and standard deviation of reward values estimated from the pre-trained model. This helps alleviate the issue of rewards that have different scales.


Re: Question 1

The choice or scaling of the reward functions mainly informs what the constraint threshold for each reward should be. As long as the chosen thresholds are feasible, the dual variables (i.e., the learned weights for every reward) converge to the values that result in satisfying all the constraints.

We note that depending on the reward functions chosen, different scales might make choosing feasible thresholds less straightforward. To alleviate this in practice, we propose to use reward normalization by estimating the mean and variance of reward values on a few batches sampled using the base pre-trained model. Then, we observe that setting the threshold for each reward to be (threshold = mean + alpha \cdot std) with α\alpha between 0.5 and 1, usually works well in practice.

Regarding "automatic threshold selection" for composition, we maintain that it is theoretically justified and is not a heuristic. We refer to our comments on Weakness 3 for a detailed discussion. Furthermore, our experiments on constrained composition for audio diffusion models verify that the applicability of our approach (including the automated threshold selection) is not domain-dependent.


Re: Question 2

The proposed framework is not limited to conditional generation tasks. In fact, the illustrative examples shown in Figures 1 and 2 are from constrained unconditional models trained to generate data from Mixture of Gaussian distributions. For the main experiments, we focus on the text-to-image conditional setting, since it is a commonly used application of diffusion models.

We would appreciate clarification on what the reviewer means by "reward-conditioned settings" in this context. We note that the rewards we used for alignment include both rewards that measure similarity between text-image pairs (like HPS) and rewards that only take images as input (like saturation and local contrast rewards), which could be used for unconditional generation tasks.

评论

I thank the reviewer for their detailed feedback and the additional experiments on the text-to-audio diffusion model. My concerns and questions have mainly been addressed, and I raise my score to 4.

审稿意见
5

The paper proposes a unified constrained optimization framework for alignment and composition of diffusion models. Instead of weighting competing objectives, the proposed framework impose explicit constraints (e.g., minimum reward levels or maximum KL distance to each source model). The paper theoretically proves strong duality for both constrained optimization problems, characterize the solution, and implement a primal–dual training loop for score-based diffusion models. Experiments on Stable-Diffusion-v1.5 show improved balance across rewards and fairer representation of composed models relative to equal-weight baselines.

优缺点分析

Strengths:

  1. The paper is well written and easy to follow
  2. The paper introduces a principled constrained optimization formulation that unifies reward alignment and model composition satisfying strong duality—bringing theoretical clarity and practical tractability to both problems.
  3. Compared to weight-based approaches, constrained formulations offer more interpretable hyperparameters and can eliminate tuning entirely in composition tasks.

Weaknesses:

  1. While the constrained alignment formulation is built around explicitly enforcing reward thresholds, the experiments do not systematically analyze how varying these thresholds affects model behavior. For example, it would be informative to report sensitivity curves showing how sample quality, reward satisfaction, or KL divergence vary as the reward constraint levels are tightened or relaxed.
  2. (suggestion) It would be valuable to discuss whether a setting exists where both alignment to external rewards and composition of multiple source models are required simultaneously—for instance, composing reward-specialized models while enforcing a minimum aggregate reward level. Such a setting could further test the expressive power and scalability of the proposed framework.

问题

  1. What does each letter in abbreviations like UR-A stand for?
  2. Do direct reward alignment and composition of reward-specific models yield different outcomes in practice?

局限性

N/A

最终评判理由

I lean towards acceptance. The paper is clearly written and introduces a principled constrained optimization formulation that unifies reward alignment and model composition under strong duality, offering theoretical clarity, practical tractability, and more interpretable hyperparameters than weight-based methods. While the empirical evaluation appears limited, I think the conceptual novelty and potential impact are sufficient to justify acceptance.

格式问题

N/A

作者回复

Re: Weakness 1

What we observed by varying the reward constraint thresholds in our experiments was that for thresholds up to 1.0 (meaning mean + 1.0*std for each reward) the model was typically able to satisfy the constraints with minimal violation. Another trend that we observed was that increasing thresholds usually leads to constraints that are harder to satisfy leading to higher Lagrange multipliers and resulting in higher KL to the pre-trained model.

An advantage of our constrained approach is that Lagrange multipliers give information about the sensitivity of the objective with respect to relaxing the constraints i.e. if the multiplier for a certain reward ends up being much higher than the rest it means that constraint is particularly harder to satisfy. Consequently, even slightly relaxing the threshold for the corresponding reward can lead to much smaller KL objective.

We thank the reviewer for the suggestion, and will include ablations on the constraint levels that illustrate the above points (see tables below).

Table 1. MPS reward alignment with saturation and contrast constraints, for varying thresholds.

ConstraintThresholdSlackDual Variable𝐷KL𝐷_{KL}
contrast0.250-0.2450.2820.177
contrast0.500-0.9850.0000.296
contrast1.000-0.3810.0000.332
saturation0.250-0.1260.0810.177
saturation0.5000.0600.0060.296
saturation1.0000.0521.1950.332

Table 2. Pickscore reward alignment with saturation and contrast constraints, for varying thresholds.

ConstraintThresholdSlackDual Variable𝐷KL𝐷_{KL}
contrast0.250-0.6840.0000.136
contrast0.500-1.0110.0000.109
contrast1.0000.6610.1920.293
saturation0.250-0.0250.0140.136
saturation0.500-0.0600.0000.109
saturation1.0000.0621.0200.293

Re: Weakness 2
As the reviewer suggests, composing reward-specialized models while enforcing a minimum aggregate reward level is a great example of combining constrained Alignment and Composition. To demonstrate the viability of this approach, we conducted an experiment using two KL constraints: one for pre-trained stable diffusion and one for a model fine-tuned on the aesthetics reward, respectively, along with a constraint on the saturation reward. The fine-tuned model achieves less than 10% reward constraint violation and similar KL divergences with respect to both pre-trained models, as seen in the table of results below:

ConstraintDual VariableInitial ValueFinal ValueSlack
Saturation1.0800.1000.5330.033
KL (pretrained)0.4930.0000.1010.004
KL (aesthetics)0.5070.2600.0970.000

We thank the reviewer for bringing up this important setting and will include the above example in the final version.


Re: Question 1
The first letter indicates parameterization: U and S in UR-A and SR-A denote Unparameterized and Score function parameterized problems, respectively.
The second letter, R or F (as in UR-C or UF-C), stands for Reverse or Forward KL divergence constraints.
The final letter, A or C, signifies Alignment or Composition. We will clarify these abbreviations in the final version.


Re: Question 2
Based on our theoretical analysis of the solutions for the Composition and Alignment problems, it is straightforward to show that the solution in both cases will end up being in the form

pppreeλirip^\star \propto p_{\text{pre}} e^{\sum \lambda_i r_i}

but with potentially different λ\lambda's.

There is no theoretical guarantee that the optimal dual multipliers, λ\lambda, will be the same or similar for direct reward alignment and for the composition of reward-specific models. This is reflected in our experiments, where we observe that these two approaches lead to different final results in terms of expected reward values.

Regarding where each approach would be useful, we note that reward alignment requires a reference model and access to each reward function, while composition can be used when we have access to models fine-tuned for specific rewards or tasks but do not have access to the rewards or task data themselves.

We thank the reviewer for raising this interesting question, and we will comment on this in the final version.

评论

Thanks for your detailed rebuttal. I will keep my positive score.

最终决定

This paper introduces a constrained optimization framework that unifies reward alignment and composition of diffusion models under a Lagrangian formulation.

The reviewers all agreed that this was a good contribution. The approach was seen as technically solid and well written, with empirical results demonstrating improved balance across multiple rewards and better representation of composed models relative to simple baselines. At the same time, the reviewers raised several points that should be addressed in the camera-ready. The empirical evaluation remains somewhat narrow, with limited exploration beyond text-to-image diffusion and few comparisons against more advanced baselines. Clarifications are also needed regarding assumptions (e.g., shared noise schedules, convexity to parameterized score networks). The rebuttal addressed many of these concerns, and we hope to see these additions in the final version of the paper.