PaperHub
5.5
/10
Poster4 位审稿人
最低2最高4标准差0.7
4
2
3
3
ICML 2025

Mechanisms of Projective Composition of Diffusion Models

OpenReviewPDF
提交: 2025-01-15更新: 2025-07-24
TL;DR

We theoretically study compositions of diffusion models.

摘要

关键词
diffusion modelscompositiontheory

评审与讨论

审稿意见
4

The paper proposes a theory for understanding composition in diffusion models and how it can produce samples that are out of distribution for each of the constituent models. Their key insight is that composition of distributions is ill specified unless tied to a projection that specifies which attribute we would like to compose. This leads to the idea of projective composition, which can be realized with the composition operator given sufficient conditions, which are related to the factorability of the distributions we would want to compose. An explicit construction of this composition is provided, which is similar to Bayesian composition except for the replacement of the unconditional score with a background score. It is shown that it is also sufficient for this factorability to be present in some feature space, greatly generalizing the theory, although this result does not provide an explicit construction for sampling from the composed distribution.

update after rebuttal

The empirical evidence provided in the rebuttal addresses my concerns surrounding the empirical evidence. I have increased my score from 3 to 4.

给作者的问题

Please see my comments under “Experimental Design or Analyses” - I am curious about how consistent the results provided in Figures 3 and 5 are.

论据与证据

The theoretical results are well supported with proofs. Some empirical results are given, but appear anecdotal and lacks statistical analysis.

方法与评估标准

The main contribution of this work is theoretical, so the limited empirical results do make sense modulo the issues described above.

理论论述

The theorems that appear in the main body of the work all make intuitive sense. I have not thoroughly checked the proofs.

实验设计与分析

As mentioned above, I think the empirical results are more anecdotal than scientific as is. While not the main contribution, some basic analysis could be performed (e.g. something like “out of 100 identical trials, composition with an empty background was successful 100 times and bayesian composition was successful 12 times, which is significantly improved with a p-value of …”).

补充材料

No.

与现有文献的关系

This work provides theoretical foundations for understanding the the composition of diffusion models. While composition has been shown to be possible in prior work, the aim of this work is to provide additional theoretical understanding for why and under what conditions it is expected to work.

遗漏的重要参考文献

Not to my knowledge.

其他优缺点

N/A

其他意见或建议

See my suggestions under "Experimental Designs Or Analyses".

作者回复

We thank the reviewer for their support for our paper and helpful suggestions.

  • The reviewer suggests a quantitative analysis of our CLEVR experiments. We agree that this is an excellent idea and have performed the analysis. The results are shown in the table below and will be included in the camera-ready version should the paper be accepted.
  • To produce the table below, we generated 100 samples using each composition method, and manually counted (to avoid any potential error in using a classifier) the objects in correct locations (i.e. locations corresponding to the conditioners of the distributions being composed) in each generated image. In the table below we record the histogram of object counts in correct locations.
  • Regarding the reproducibility of Figures 3 and 5, we provide additional samples in Figures 8 and 10, respectively, in the appendix. Further length-generalization is also explored in Figure 9. In addition, the new table below quantitatively confirms the reproducibility of the results of Figure 3 (when attempting to compose 3 single-object distributions as in Figure 3, the empty-background (projective) composition correctly produced images containing 3 objects in 99/100 trials, while the Bayes composition never produced an image containing 3 objects in 100 trials).

Composition of location-conditioned CLEVR distributions

  • N denotes number of distributions being composed (hence the expected number of objects) -- we test N=1 through N=6
  • "Single-object empty" composes single-object object distributions with an empty background
  • "Single-object Bayes" composes single-object object distributions with an unconditional background
  • "Bayes-cluttered" composes 1-5 object distributions (with location label assigned to a single randomly-chosen object) with an unconditional background
  • The table shows the histogram of manual counts, that is, each column lists the number of images that contained the given number of objects in correct locations.
StyleN
Single-object empty0123456
1100
2100
3199
4298
5298
6397
Single-object Bayes0123456
1100
2106732
336622
47723
566322
63463
Bayes-cluttered0123456
1100
2100
3100
4100
5298
6298
审稿人评论

Thank you for the clarification and the additional data. Since my primary concern surrounded the lack of proper empirical analysis, I am satisfied by the table provided by the authors and will update my score.

作者评论

Thank you for your support, and the good suggestion to improve our empirical analysis!

审稿意见
2

The authors present a formalization of compositionality in diffusion models. Using diffusion models separate for particular objects and background can be joined together in various ways. The authors explore these different ways and point out the correct way of composing these. The authors suggest a particular way of composition if a collection of distributions satisfies a set of conditional independencies, and then continue to generalize that by using diffeomorphisms and showing if such a diffeomorphism exists, their composition stil holds. The authors show various illustrations in the form of the CLEVR dataset and an example from a text-conditional diffusion model.

给作者的问题

See strenghts and weaknesses

论据与证据

.

方法与评估标准

There are no quantitative results in the paper.

理论论述

I can not find any particular issues with the theorems and proofs in the work, but I find the requirement of the existence of a diffeomrophism in 6.2 strong enough that I question the usefulness of the result.

实验设计与分析

See also strenghts and weaknesses: The authors mostly focus on the CLEVR dataset for examples and all experimental examples are qualitative.

补充材料

I reviewed B,C,D,H

与现有文献的关系

.

遗漏的重要参考文献

NA

其他优缺点

  • The idea of formalizing compositionality of diffusion models is relevant and an interesting topic
  • The authors mostly focus on the CLEVR dataset for examples and all experimental examples are qualitative. While the field is perhaps not developed to a point where there is a set benchmark, the work of Du et al (2023) serves as major inspiration for the authors, and it would certainly make sense to run the same experiments reported in that paper for a quantitative comparison.
  • While theorem 5.3 is interesting, it is of limited use other than for datasets such as clevr, where objects are easily separated spatially. The authors do present theorem 6.2, and while I can not find any issues with the proof perse, its use is extremely limited. The assumption of the existence of a diffeomorphism that perfectly separates all variables is very strong. Effectively the work has now moved to the assumption, and the authors make no effort to investigate when this assumption is valid. There is moreover the issue that there is no guarantee the reverse process is correct. Altogether, this brings into question of how useful the result in 6.2 is.
  • In this situation, it would be good to have a strong, quantitative, experimental evaluation to demonstrate the use of such a theorem, but as mentioned earlier, particular on evaluation on natural image datasets is lacking.

其他意见或建议

.

作者回复

We thank the reviewer for their time, insightful questions, and constructive critiques.

Overall, we want to emphasize that the goal of this paper is to understand and predict when composition will work — and just as importantly, when it will fail. That is, we want to theoretically explain prior empirical observations about when composition worked or failed; we do not aim to introduce any new methods.

We paraphrase and respond to your specific questions below.

Q1: In real-world settings, when we can expect the existence of a diffeomorphism that separates the conditions we want to compose?

  • First on a technical note, we actually only require “C1C^1 diffeomorphisms”, i.e. the feature-map and its inverse should be differentiable. We will clarify this in the revision.
  • Our “diffeomorphic” assumption is very closely related to existing assumptions in the literature on “disentangled feature representations.” For example, the long line of work on learning disentangled representations implicitly assumes that such a disentanglement is (at least approximately) possible. (e.g. [1] on VAEs and [2] on GANs). That is, if we are in a setting where we have a neural network that maps to and from a “disentangled” feature space (e.g. a VAE or a BiGAN), then this neural network defines our requisite C1C^1 diffeomorphism (technically this assumes the encoding and decoding networks are differentiable everywhere, which we can guarantee e.g. if if the network uses smooth activation functions).
  • Finally, we do not believe that disentangled representations always exist for any distributions we might wish to compose--- and we are equally interested in understanding these failure cases. For example, “style” and “content” features are typically believed to be disentangled in the existing literature, and thus we expect style+content compositions to work. On the other hand, some concepts may be impossible to disentangle in any reasonable features space. In such cases, a diffeomorphism may not exist, and we do not expect these concepts to compose well (an example is the horse+dog composition in Figure 6).

Q2: How robust is the theory? Is it really necessary to perfectly satisfy Factorized Conditionals?

  • Although our theory technically requires perfect independence, which is indeed a strong condition, our CLEVR experiments empirically study a case where the conditions hold only approximately, and explore both the robustness of the theory as well as its limits in this imperfect case (please see response to reviewer nnqo for further detail).

Q3: What are the practical implications/usefulness of the result in Lemma 6.3 which says that, even if projective composition is possible at t=0t=0, reverse diffusion may not correctly sample from it?

  • The fact that reverse-diffusion sampling may not work even when composition is possible at t=0t=0 explains the “negative-result” in Figure 5, and may help explain other failures of composition in the literature.
  • Most notably, this result helps explain empirical findings in Du et al. (2023), who showed that HMC sampling (which in particular allows sampling directly at t=0t=0) was necessarily to enable successful composition in many cases. Our theory helps explain why HMC sampling worked when standard diffusion sampling did not. We discuss this further in Appendix J.1.

Q4: Text-to-image evaluations?

  • Our goal is primarily to theoretically explain existing empirical evaluations in the literature. In text-to-image settings, the Bayes composition (used in Du et al. (2023) and other works) is often approximately projective, as discussed in Section 5.4. Therefore, existing empirical results in text-to-image settings are typically already constructed in a way that is compatible with our theory. We therefore accept the existing experimental results of Du et al (2023) and others and seek to understand/explain them (both successes and failures) through our theory. Of course, there is much more to study and explore empirically in text-to-image settings that we hope to explore in future work.

Q5: Quantitative evaluation?

  • We performed some additional quantitative evaluations of our CLEVR experiments: please see the table of results and description of the experiments in our response to Reviewer DBEj.

If the reviewer’s concerns have been adequately addressed, we kindly ask they consider raising their score to support acceptance.

References:

[1] Isolating Sources of Disentanglement in VAEs RTQ Chen, X Li, RB Grosse, DK Duvenaud. NeurIPS 2018.

[2] A style-based generator architecture for generative adversarial networks T Karras, S Laine, T Aila. CVPR 2019.

审稿人评论

I do not have fundamental concerns regarding correctness in this paper, but I remain skeptical of its usefulness and its experimental validation given such a strong assumption. The response in Q1 was useful, and I understand the reasoning between this assumption and pictures of horses and dogs. However I think such theory rooted in strong assumption invites, and to some extent requires, a strong experimental evaluation, which the authors seem to defer to other work (i.e. response Q4).

Moreover, the experimental conclusions identified from other work are anecdotal, since there is no way to verify whether the assumptions are satisfied in those cases beyond intuition. It also limits the use to the community, since again it is difficult for practitioners to asses in which cases this theory can be used, beyond intuition.

I would welcome any experimental results or literature that attempts to make that statement in rigorous in some way.

作者评论

We appreciate your engagement. You raise some important questions regarding evidence for and connections to disentangled representations, which we will try to address here.

  1. What is the precise connection between the notion of disentanglement and Factorized Conditionals?
  2. How can we measure “disentanglement” quantitatively?
  3. What experimental evidence is available for disentanglement, and for which concepts?
  • Disentanglement is somewhat difficult to precisely define as we discuss next. However, to quote Karras et al. [2]: “There are various definitions for disentanglement, but a common goal is a latent space that consists of linear subspaces, each of which controls one factor of variation.” This definition is a necessary condition for Factorized Conditionals.
  • Regarding quantitative metrics for disentanglement: Note that there is a fundamental barrier to rigorously testing “disentanglement”-type assumptions in high-dimensions, since it is impossible to test independence of two arbitrary high-dimentional random variables in poly(dimension) time. (This follows from e.g. cryptographic PRFs). Nevertheless, there is a large body of work towards designing disentanglement-metrics appropriate for “real-world” distributions (e.g. disentanglement metrics introduced by BetaVAE, FactorVAE [3]; MIG in [1], etc.). Several of these metrics are effectively one-sided-tests of our Factorized Conditional assumptions: for example, if the FactorVAE metric reports high-entanglement, then our Factorized Conditionals assumption must be false. It is therefore reassuring that the “FactorVAE” metric reports low-entanglement for several realistic datasets [3]. Furthermore [4] shows that many of the most common disentanglement metrics are fairly correlated with each other.
  • Many datasets have been investigated in the disentanglement literature. For example, [1] studies CelebA, 3D Faces, dSprites, [2] studies CelebA, FFHQ, [3] studies 3D Shapes, 3D Faces, CelebA, [4] studies Cars3D, Shapes3D, MPI3D, and [5] studies dSprites, smallNORB, Cars3D, Shapes3D. All provide qualitative and quantitative (via the various metrics described above) evidence of disentanglement of various concepts present in the datasets. For example, [2] investigates disentanglement between the 40 attributes labeled in the CelebA dataset (such as “BlackHair”, “Eyeglasses”, etc.); suggesting for example that composition of BlackHair+Eyeglasses is likely to work. Also, [6] qualitatively explores disentanglement between style and content for style transfer.

Finally, we would like to contextual our work by mentioning that the theoretical understanding of compositional-generation is at a very early stage: prior to our work, there was not even a formal definition of composition which could capture our applications. Moreover, it was not known if any reasonable assumptions exist which would imply correct composition. Thus, part of our contribution is identifying a “natural” assumption under which composition works. The value of this assumption, we believe, is that it tells us “one possible reason” that composition can work in practice. We agree that it is an important question to bring these assumptions closer to reality, and we hope our work inspires future work in this direction. We hope you will agree that our work is a good first step.

References:

[1] Isolating Sources of Disentanglement in VAEs. RTQ Chen, X Li, RB Grosse, DK Duvenaud. NeurIPS 2018.

[2] A style-based generator architecture for generative adversarial networks. T Karras, S Laine, T Aila. CVPR 2019.

[3] Disentangling by Factorising. H Kim, A Mnih. ICML 2018.

[4] DisDiff: Unsupervised Disentanglement of Diffusion Probabilistic Models, T Yang et al. NeurIPS 2023.

[5] Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations. F Locatello et al.

[6] Content and Style Disentanglement for Artistic Style Transfer. Kotovenko, D. ICCV 2019.

审稿意见
3

This paper gives a rigorous theoretical framework for understanding composition in diffusion models, with a focus on out‐of‐distribution extrapolation and length‐generalization. The authors introduce the notion of “projective composition,” which formalizes the idea that a composed distribution should, when viewed through specified projection functions, match the marginals of the component distributions. They derive conditions—most notably via the Factorized Conditionals assumption—under which linear score combination (and its feature-space analogue) yields a correct composition. This paper supports its theoretical results with experiments on synthetic CLEVR data, demonstrating instances of length-generalization and discussing practical sampling challenges.

给作者的问题

  1. Can you provide additional empirical evidence on real-world datasets to assess whether the Factorized Conditional assumption holds approximately in practice?
  2. How robust are your theoretical results if the independence assumptions are only approximately satisfied? Could the framework be extended to account for partial dependencies?
  3. Could you elaborate on potential strategies to mitigate the sampling challenges (as noted in Theorem 6.1 and Lemma 6.3) in practical implementations of your composition operator?
  4. Are there any plans to integrate or test your framework with more complex, high-dimensional real-world image datasets to further validate its practical impact?

论据与证据

The main claims are that: first, prior definitions (simple product and Bayes composition) fail to capture the desired out-of-distribution behavior; second, projective composition, as defined via appropriate projection functions, can correctly compose diffusion models, and, third, under Factorized Conditional assumptions, the proposed composition operator yields a distribution with the intended marginals. These claims are supported by theoretical results (e.g., Theorem 5.3 and Theorem 6.1) and illustrated through synthetic experiments. Nevertheless, while the derivations are insightful, some proofs are only sketched and the reliance on strong assumptions (e.g., perfect factorization) may limit the generality of the evidence.

方法与评估标准

The paper combines theoretical analysis with experiments on a controlled synthetic dataset (CLEVR). The methods contain defining novel composition operators and establishing conditions for their correctness via rigorous proofs. The evaluation criteria are appropriate for a theory-focused work, though the empirical validation remains limited to synthetic settings. A broader set of experiments on more complex, real-world data would assist to corroborate the practical relevance of the theoretical findings.

理论论述

The paper showss several non-trivial theoretical claims regarding the behavior of composition operators in diffusion models. The formal definition of projective composition (Definition 4.1) and subsequent results (e.g., Theorem 5.3 on the correctness of composition under Factorized Conditionals, and Theorem 6.1 in feature space) are substantial contributions. Nevertheless, the proofs are sometimes only outlined, and some underlying assumptions (such as exact independence across masked coordinates) might not hold in practice.

实验设计与分析

Experiments on the CLEVR dataset illustrate key phenomena such as length-generalization and the sensitivity of composition to background choice. Although these experiments effectively demonstrate the theory in a controlled environment, the experimental section is relatively narrow in scope. Extending the experiments to more realistic datasets could help strengthen the overall impact.

补充材料

The supplementary material offers additional proofs and experimental details that support the main text. While it is comprehensive, some parts are highly technical and could benefit from clearer explanations to aid reproducibility and understanding.

与现有文献的关系

The work is well-situated within the literature on diffusion models, compositional generation, and generative modeling in general. It builds upon and extends prior methods such as those by Du et al. (2023) and Liu et al. (2022), offering a novel perspective by formally addressing the limitations of existing composition definitions. The paper, in addition, relates to literature on disentangled representations, which underpins its Factorized Conditional assumption.

遗漏的重要参考文献

Despite the fact that the paper cites a wide range of related works, a deeper discussion of literature on disentangled feature learning and alternative composition strategies (especially in the context of real-world image synthesis) could further contextualize the contributions.

其他优缺点

Strengths:

  1. Introduces a novel and formal definition of composition (projective composition) that addresses clear limitations in prior work.
  2. Provides a set of theoretical results that illuminate when and why linear score combination can yield correct composition in diffusion models.
  3. Connects theoretical insights with empirical observations on synthetic data, offering useful perspectives on sampling challenges.

Weaknesses:

  1. The Factorized Conditional assumption, critical for the theoretical guarantees, may be too strong and not fully reflective of practical scenarios.
  2. Experimental validation is limited to synthetic datasets, leaving open questions about applicability in more complex, real-world settings.
  3. Some proofs and technical derivations are only sketched, which could hinder reproducibility and complete understanding.

其他意见或建议

The paper could benefit from clearer exposition in some of the more technical sections, as well as from an expanded experimental section that explores the framework’s applicability beyond synthetic examples. Detailed discussion on potential methods to address the identified sampling challenges could also strengthen the work.

作者回复

We thank the reviewer for their support for our work and insightful questions, to which we respond individually below.

Weaknesses

Q1: The Factorized Conditional assumption, critical for the theoretical guarantees, may be too strong and not fully reflective of practical scenarios.

  • Theorem 6.1 shows that it is enough to satisfy Factorized Conditionals in some feature-space, even if the assumption is not satisfied in pixel-space: that is, as long as there exists some feature-map which “disentangles” features in the appropriate sense, then distributions will compose correctly. We further discuss the empirical evidence for such disentangled features spaces, as well as the robustness of the theory to approximate satisfaction of the conditions, in our answers to Q1 and Q2 below.

Q2: Experimental validation is limited to synthetic datasets, leaving open questions about applicability in more complex, real-world settings.

  • Our goal is primarily to theoretically explain existing empirical evaluations in the literature. In text-to-image settings, the Bayes composition is often approximately projective, as discussed in Section 5.4. We therefore accept the existing experimental results of Du et al (2023) and others using the Bayes composition, and seek to understand/explain them (both successes and failures) through our theory. Of course, there is much more to study and explore empirically in text-to-image settings that we hope to explore in future work.

Q3: Some proofs and technical derivations are only sketched, which could hinder reproducibility and complete understanding.

  • We provide complete proofs of all claims in the Appendix. In particular, Theorem 5.3 is sketched in the main text but proved formally in Appendix G. Theorem 6.1 and Lemma 6.2, 7.1, and 7.2 are proved in Appendices H, I, and J.

Questions

Q1: Can you provide additional empirical evidence on real-world datasets to assess whether the Factorized Conditional assumption holds approximately in practice?

  • This question is closely related to the existing literature on “disentangled feature representations.” This long line of work (e.g. [1] on VAEs and [2] on GANs) implicitly assumes that such a factorized representation is (at least approximately) possible, and there is substantial empirical evidence supporting this for at least some concepts. For example, “style” and “content” features are typically believed to be disentangled in the existing literature, and thus we expect style & content to form Factorized Conditionals. We mention this connection in Section 7, but will elaborate on it in the revision.

Q2: How robust are your theoretical results if the independence assumptions are only approximately satisfied? Could the framework be extended to account for partial dependencies?

  • Currently, the theory requires that the independence assumptions be satisfied exactly, but developing robust versions is an important direction for future work.
  • Empirically, we use the CLEVR experiments to probe the robustness of the theory. In the CLEVR setting, Factorized Conditionals holds only approximately, due to the possible occlusions and shadowing effects between different objects. Our experiments show that that projective composition is approximately, but not exactly, achieved. To push the limits of this robustness, in Figure 9 we attempt to length-generalize up to 9 objects (which works up to about ~6 objects and then degrades).

Q3: Could you elaborate on potential strategies to mitigate the sampling challenges (as noted in Theorem 6.1 and Lemma 6.3) in practical implementations of your composition operator?

  • Yes! As you note, Lemma 6.3 tells us that even if projective composition is possible at t=0, reverse diffusion may not correctly sample from it. Practically, this suggests that non-diffusion sampling methods that enable sampling directly at t=0, such as variants of Langevin dynamics, may be necessary to achieve projective composition in practice (when it is possible at t=0). This is consistent with empirical findings in Du et al. (2023), who showed that HMC sampling was necessary to perform composition in many cases. We discuss this further in Appendix J.1.

Q4: Are there any plans to integrate or test your framework with more complex, high-dimensional real-world image datasets to further validate its practical impact?

  • We agree this is an important area for future work; we consider the present work as the first step in this direction.

References:

[1] Isolating Sources of Disentanglement in VAEs RTQ Chen, X Li, RB Grosse, DK Duvenaud. NeurIPS 2018.

[2] A style-based generator architecture for generative adversarial networks T Karras, S Laine, T Aila. CVPR 2019.

审稿意见
3

This paper proposes a new theoretical framework for analyzing a special type of composition in diffusion models, and it specifically focuses on two previously discovered phenomena in diffusion model composition: out-of-distribution (OOD) extrapolation and length-generalization. The theoretical framework aims at the product-style compositions implemented with diffusion models via a linear combination of scores. Prior studies propose to describe the composed distribution as a simple product of two distributions, or the Bayes composition of them. Yet, the paper uses the CLEVR experiment as an intuitive illustration to show that these two definitions can not really cover OOD composition results and thus will fail to do length-generalization in the CLEVR experiments. Based on this, the paper defines a new form of distribution composition: Projective Composition. Intuitively, it requires the composed distribution to be the "same" as each single distribution when viewed from a projection defined for each single distribution. This Projective Composition can describe real OOD and length-generalization. The paper further defines a Composition Operator to compose a set of distributions, and a Factorized-Conditionals that defines specific features of a set of distributions and projections, such as the projections are disjoint masking of the coordinates. The paper further shows that when Factorized-Conditionals is satisfied, the reverse-diffusion SDE using compositional scores following Composition Operator will satisfy the desired Projective Composition. The paper then argues how the successful OOD settings in the CLEVR experiment approximately satisfy Factorized-Conditionals. Moreover, the paper discusses how similar analysis can be extended to feature space. In the feature space, they show under the constraint of Factorized-Conditionals, the Composition Operator also defines a Projective Composition; however, how to generate such Projective Composition via diffusion sampling is unknown within the theoretical framework. At last, the paper discusses how the proposed theoretical framework can help understand other empirical understandings during diffusion model composition.

给作者的问题

Q1: Can authors explain the intuitive meaning of Lemma 6.3 and discuss some remarks on it?

Q2: In appendix B, training EDM2 on CLEVR, what is the conditional variable? Is the generation trained to be conditioned on the number of objects, shape of objects, color of objects, position of objects, or a mixture of them?

Q3: In appendix B.2, after training the diffusion model, how to use the trained diffusion model to get/define a generation model following certain pip_i?

Q4: How practical will the definition of Factorized-Conditionals be in other real-world diffusion composition, and how far it can be generalized to other successful compositions, such as composing different text prompts with/without different region masks to generate an image?

论据与证据

The claims are clear and convincing.

方法与评估标准

The evaluation criteria make sense.

理论论述

I have checked proofs of most of the theorems except for Lemma 6.3. I haven't fully got the intuitive meaning of Lemma 6.3 and would appreciate remarks from the authors.

实验设计与分析

I have checked appendix B for the experiment details.

  1. There are questions related to the conditional variable of the trained EDM2 on CLEVR: what is the conditional variable - is the generation trained to be conditioned on the number of objects, shape of objects, color of objects, position of objects, or a mixture of them? This question arises since the paper seems to require distributions that can control position, or color, or number of projects.
  2. After training the diffusion model, in B.2, how to use the trained diffusion model to get/define a generation model following certain pip_i?

补充材料

I have read all parts of the supplementary material.

与现有文献的关系

Maybe the paper can inspire future works to design a new sampling method for feature decomposition, or propose a theoretical framework to understand feature space decomposition.

遗漏的重要参考文献

NA

其他优缺点

Strengths:

  1. The paper is well written and easy to follow, with motivation, theorems, remarks, proof scratches, and empirical demonstrations.
  2. The paper studies a novel theoretical problem of understanding OOD and length-generalization phenomenon in diffusion model composition. The interesting projective composition is analyzed in both pixel space and feature space.

Weakness:

  1. The main theoretical results only cover sampling in the pixel space, and a theoretically successful result is lacking in the feature space. Yet, feature space composition is an important application in diffusion decomposition.
  2. Although the paper discusses in B.2 how the CLEVR settings approximately satisfy Factorized-Conditionals, it is unknown how practical the definition of Factorized-Conditionals will be in other real world diffusion composition, and how far it can be generalized to other successful compositions, such as composing different text prompts with/without different region masks to generate an image.

其他意见或建议

NA

作者回复

We thank the reviewer for their support for our work and insightful questions, to which we respond individually below.

Weaknesses:

Q1: The main theoretical results only cover sampling in the pixel space, and a theoretically successful result is lacking in the feature space. Yet, feature space composition is an important application in diffusion decomposition.

  • This is certainly true. In fact, we show theoretically that in feature space, diffusion sampling may not work even when projective composition is possible at t=0 (Lemma 6.3). This is consistent with empirical findings in Du et al. (2023), who showed that HMC sampling (which in particular allows sampling directly at t=0) was necessarily to enable successful composition in many cases. Our theory helps to explain why HMC sampling worked for Du et al. when standard diffusion sampling did not. We discuss this further in Appendix J.1. This result contributes to our overall goal of understanding when composition will work — and just as importantly, when it may fail.

Q2: [It is unclear how practical the definition of Factorized-Conditionals will be in real world composition.]

  • Please see response to Q4.

Questions:

Q1: Can authors explain the intuitive meaning of Lemma 6.3 and discuss some remarks on it?

  • Lemma 6.3 intuitively says that, even if projective composition is possible at t=0t=0, reverse diffusion (or indeed any annealing method) may not be able to correctly sample from it. Specifically, the lemma proves (using a counterexample) that it is possible for a set of distributions to all vary smoothly in time, while their composition changes extremely abruptly— making any annealing-based sampling method very challenging.

Q2: In appendix B, training EDM2 on CLEVR, what is the conditional variable? Is the generation trained to be conditioned on the number of objects, shape of objects, color of objects, position of objects, or a mixture of them?

  • Appendix B includes two different conditioning setups. In Figures 7, 8, and 9 we condition on the 2d location of the object (or the location of one randomly-chosen object, for multi-object distributions). In Figure 10, we condition on the color of the object. In all experiments we condition only a single attribute (either location or color) at a time, with all other attributes sampled randomly and not conditioned on. Thanks for the question — we will clarify these points in the final draft!

Q3: In appendix B.2, after training the diffusion model, how to use the trained diffusion model to get/define a generation model following certain pip_i?

  • For the location-conditional models, the pip_i‘s correspond to different location conditioners. Specifically, in these experiments, we choose a fixed set of locations ii that we wish to compose, and obtain the score of pip_i by forwarding our conditional diffusion model conditioned on location ii.
  • Similarly, for the color-conditional models, the pip_i‘s correspond to different color conditioners. There are only 8 colors so we assign a pip_i to every possible color, and we obtain the score of pip_i by forwarding our conditional diffusion model conditioned on color ii.

Q4: How practical will the definition of Factorized-Conditionals be in other real-world diffusion composition, and how far it can be generalized to other successful compositions, such as composing different text prompts with/without different region masks to generate an image?

  • This question is closely related to the existing literature on “disentangled feature representations.” For example, “style” and “content” features are typically believed to be disentangled in the existing literature, and thus we expect style & content to form Factorized Conditionals. Regarding composing different text-prompts, we expect similar intuitions about disentanglement to carry over --- see our Figure 6 for example, which composes using different text-prompts.
  • Regarding region masks, we believe that either explicit masks or simply text-conditioning that includes location information can indeed be very helpful for achieving Factorized Conditionals (please see Section 7 for further detail).

We thank the reviewer again, and hope these responses are helpful.

最终决定

This paper presents a theoretical study on the possibility of composing the score function of a series of diffusion models for generating new data with mixed properties. The authors proposed a projective composition condition and factorized conditionals to allow the composition of score functions. Meanwhile, there are fruitful discussions on the fail of composition and relation to independent coupling, as well as Bayes composition.

The strength of the paper is clearly appreciated by reviewers: 1) clear presentation with extensive discussions; 2) focused on a relevant and interesting topic; 3) correct and sound theoretical results. Therefore, a positive recommendation is formed. However, there are also weaknesses prevent a direct strong accept recommendation: 1) the assumptions are vaguely validated on a particular data set, limiting the broader practicality in the current scope; 2) the assumptions are relatively strong and in the current submission, no relaxation is immediately possible; 3) there are no numerical results demonstrating theory-motivated methods.