How Diffusion Models Learn to Factorize and Compose
We carry out experiments on conditional DDPMs using synthetic datasets that demonstrate their ability to learn factorized representations and compositionally generalize.
摘要
评审与讨论
This paper investigates the capabilities of diffusion models, particularly Denoising Diffusion Probabilistic Models (DDPMs), in learning factorized representations and achieving compositional generalization. The authors aim to quantify this by analyzing mechanisms to train the model, supporting the hypothesis that the architecture of diffusion models has an inductive bias towards such factorized representations. Their results on a toy dataset suggest that, to achieve out-of-distribution compositional generalization, the training set must: i) contain at least a few compositional examples of the factors, and ii) present the factors independently of each others across the full range of their variability. If either of these conditions i) or ii) is not met, the model will fail to generalize out of distribution.
优点
- The manuscript is particularly well-written and presented, with clear and thorough explanations. The methodology and experiments are relevant and consistent, and they are both well-explained and well-illustrated.
- The originality of this work lies in explaining the factorization capability emerging in the manifold from the perspective of percolation theory. This supports the hypothesis that the training set needs a certain level of correlation among the components of independent features to achieve a faithful representation, which in turn leads to compositional generalize.
缺点
The only weakness I can think of in this work is that it is conducted in simplistic toy settings. However, I don't believe this discredits the work and interesting experiments presented in the paper. Additionally, the authors acknowledge this issue and suggest exploring more naturalistic and relevant experiments in future studies.
Some typos:
- Line 7: first time using the abbreviation ‘DDPMs’ in the abstract, maybe precise what it is (Denoising Diffusion Probabilistic Models)
- Line 198: at the end of the sentence, there is an extra ‘not’
问题
N/A
局限性
See “Weaknesses”
We thank the reviewer for taking the time to provide us with constructive feedback. Please find our responses to the specific concerns and questions below.
Weaknesses
- We thank the reviewer for the very positive feedback. As noted in the global response section “Regarding the toy setting,” using a simple toy setting allows us to carefully examine various effects. Future investigations on more realistic datasets are necessary but could introduce many competing effects. Despite the simplicity, our study provides valuable insights into compositionality and generalization in Diffusion models. Future research should explore why Diffusion models cannot encode continuous latent features continuously and how percolation theory of manifold formation can be applied to natural image data.
- We thank the reviewer for pointing out our typos. We will fix them in the manuscript.
Thank you to the authors for their detailed feedback. I remain fully convinced of the contribution of this article, despite the simplicity of the toy dataset used. A controlled and simplified setting is essential for investigating hypotheses, and while extending the work to more naturalistic data would be valuable, a clear starting point is necessary.
This paper investigates how and when diffusion models learn factorized representations of composable features. To this end, the authors construct controllable synthetic datasets by compositionally combining 1D and 2D Gaussian data and examine the factorized representation and the compositional generalization capability of the diffusion model. Systematic analysis on the controllable dataset indicates that the diffusion model learns orthogonal but not necessarily parallel representations and is capable of compositionally generalization to OOD samples if a few compositional examples are provided. Additionally, the authors draw a connection to percolation theory, suggesting that a certain amount of correlated data is required to learn factorized representations.
优点
- The paper is well-written and easy to follow.
- The simple yet well-controlled experiments and comprehensive analysis provide a clear understanding of factorization and compositionality in diffusion models.
- Connecting the empirical findings to percolation theory improves the understanding on the emergence of factorization.
缺点
- The experiments are conducted solely on a simple dataset. The 2D Gaussian Addition dataset features only basic additive compositions of 1D Gaussian sprites. It remains unclear whether the paper's claims extend to more complex compositions, such as multiplicative compositions involving scale, color, etc.
- Some of the conclusions, such as those regarding the properties and requirements for compositional generalization, have already been addressed in previous work [1]. [1] provides theoretical analyses and conditions for sufficient compositional support. Therefore, the experiments and conclusions from section 3.2 provide limited additional insights.
[1] Wiedemer et al., “Compositonal Generalization from First Principles”, in NeurIPS 23.
问题
- Are the conclusions and implications specific to the diffusion model? Although the paper claims to investigate the compositionality of diffusion models, the experiments and analysis do not seem diffusion-specific. Other general generative models might exhibit similar behavior.
- In the Gaussian Bump + 1D Gaussian Stripes experiment (Figure 4(g)), what is the accuracy when only 2D Gaussian Bump data is used? Comparing this value would help identify the performance boost from compositional generalization.
- How did you format the inputs when training the diffusion model on 1D Gaussian stripe data? Did you input a null value (e.g., )?
- How does the model generate OOD samples in 2D Gaussian Bump Data? From the experiments on the 2D Gaussian Addition dataset, it is concluded that the model cannot interpolate well on unseen examples, as implied by the low accuracy in the intersection areas. By adding 1D Gaussian Addition data, the model learns from single 1D stripe patterns and composes that information to generate OOD samples in the test regions. However, in 2D Gaussian Bump Data + 1D Gaussian Addition data, the model never observed the 2D Gaussian Bump samples in the test regions. How could adding 1D Gaussian Addition data lead to improvement?
- In Figures 4(c), 4(d), and 4(e), interestingly, the model trained only on 1D Gaussian data already predicts reasonably on values (but very low for values). Is there any particular reason for this result?
局限性
The conclusion section describes the limitation regarding the restricted scope and simplicity of the toy dataset. While the study provides valuable insights, the findings are based on synthetic datasets with simple structures, which may not fully capture the complexities of real-world data.
We thank the reviewer for taking the time to provide us with constructive feedback. Please find our responses to the specific concerns and questions below.
Weaknesses
- We appreciate the reviewer’s feedback. In our paper, we explored both additive and multiplicative composition with the 2D Gaussian Addition (addition of two 1D Gaussian Stripes) and the 2D Gaussian Bump (multiplication of two 1D Gaussian Stripes) datasets. Fig. 4(g) shows that the model can generalize to new 2D Gaussian Bumps given all 1D Stripes and a few 2D Bumps. This multiplicative composition is similar to transformations like scale, color, and style, as it involves "masking" one 1D Stripe with another.
- We thank the reviewer for bringing Ref. [1] to our attention. While it similarly concludes that gaps in the support of the training dataset hinder learning, it lacks insights from a manifold formation perspective. Their experiments use sprites with a mix of (semi-)continuous and categorical latent variables, showing general results across all types. Missing support in categorical variables makes sense, as a model cannot generate images of monkeys after only seeing cats and dogs. However, with continuous latent features like - and -positions, one would expect interpolation to be possible (e.g., generating a sprite at after seeing and ). Our experiments focus on the model’s ability to represent continuous latent features. In Sec. 3.1, we found that the model represents continuous features similarly to categorical variables, with some overlap. Consequently, as shown in Sec. 3.2, the model struggles with interpolation. The main takeaway is that the model excels at composition but fails at interpolation. Contrary to popular belief, Diffusion models can perform well at compositional generalization but struggle with continuous latent features, hindering interpolation and robust generalization.
Questions
- While we believe some observations may apply to other generative models, we refrain from general claims, focusing our study on Diffusion models and their factorization and compositionality. Similar studies on vision- or language-based generative models are beyond our current paper's scope.
- We thank the reviewer for the constructive suggestion. We refer the reviewer to the global response section on “Notes on additional figures - Additional Figure 2: Data Efficiency Scaling”.
- When generating 1D Gaussian Stripes, we embed the data image (32 x 32) into a larger canvas (44 x 44), centering it and creating a 6-pixel border. We generate 2D Gaussian Additions centered in this extended border region, most of which (except the corner ones) will have one Gaussian Stripe partially visible in the 32 x 32 space. We then crop the central 32 x 32 pixels, keeping the label as the center of the 2D Gaussian Addition in the extended space. Thus, 1D Gaussian Stripes have one coordinate outside the 32 x 32 image space, preserving the 2D structure for both 1D and 2D data points. Details of the data generation process are in Sec. C.1 of the Appendix.
- The experimental setup is detailed in Sec. C.1 Fig. 8 of the Appendix. The model effectively learned spatial information from the 1D Gaussian Stripes. Given a few examples of 2D Gaussian Bumps, it learned to multiplicatively compose the 1D Gaussian Stripes into Bumps. Combined with the results in response to Question 2, this demonstrates the model's ability to transfer knowledge across different forms of compositionality.
- We appreciate the reviewer's detailed observation. The 1D models in Fig. 4(c)-(e) are trained on equal numbers of vertical and horizontal 1D Gaussian Stripes, providing the same training examples for learning and . While the exact reason for the higher accuracy in generating is unclear, we observed that the model often defaults to generating 1D Stripes when it fails to generate the intersection of two Stripes, possibly skewing the accuracy distribution between and .
Thank you to the authors for the detailed reponse.
I believe most of my concerns have been addressed. Although the authors employed controlled and simple toy datasets, I find their experimental setting and comprehensive analysis sufficient to investigate their hypotheses and provide insights regarding the compositionality of diffusion model. Therefore, I raise my score to weak accept.
This paper investigates, on a very simple toy dataset, how conditional diffusion models learn factorized representations of the data, and the extent to which they can compositionally generalize out of distribution. Additionally, the authors make a connection to percolation theory in physics.
优点
The motivation and research questions are very interesting and relevant for the community. They are outlined in a compelling way in the introduction. The experiments are carefully designed and rather interesting.
缺点
Main high-level weaknesses:
- This paper takes a promising approach to a crucial research question, but to me it does not deliver. Although using toy data allows for broader experimentation, the evaluation is overall quite limited. For this reason, though the authors make an effort to clarify the relevance of their results for realistic settings (Sec. 4.1), this still sounds unconvincing.
- Clarity, especially in the presentation and discussion of the results. I found the results and conclusions significantly more difficult to parse than would be expected from toy experiments.
Expanding on the above:
- In general, the design and motivation for the experimental study is a bit lacking. I appreciate toy experiments, but they should reflect more realistic cases as much as possible, and in this case we are assuming that the model observes basically all information necessary to reproduce the data (as opposed to typical cases where the conditioning signal has significantly less information than the data itself). The authors should provide a solid justification for this choice, and more in general argue how such a toy scenario may be informative for more complex settings. The most natural next step would be to include a dataset that comes significantly closer to realistic settings, although of course some investigations will probably be impossible in that case. The trade-off between relevance and controllability is a hard one, and the current paper seems to be heavily on the latter side.
- Even sticking to the current toy data, a broader evaluation would be possible and useful. There are several degrees of freedom that can be explored further, e.g., the UNet input noise level at evaluation time, the mutual information between condition and data (which could range from perfect - the current unrealistic case - to zero - the unconditional case), different compositionality patterns (as done in some references in the paper), the layer of the UNet at which representations are extracted.
- The abstract states "paving the way for future research aimed at enhancing factorization and compositional generalization". This may be an overstatement, given my point above. What are concretely actionable insights from these experiments?
- The experiments here investigate the representations in the last layer of a UNet. Why this choice? Representations in diffusion models could also be considered to be the activations at different layers, especially the bottleneck, which has been investigated in the literature. Another representation can be the latent variable deterministically corresponding to the data using the probability flow ODE.
- The bottleneck idea is mentioned only in Appendix A.1, and in a negative way. This reinforces my belief that this toy scenario is too far from realistic settings, where the bottleneck is widely used as representation in diffusion models. However, let me still point out that I find the toy scenario a very interesting and promising direction.
- As far as I can tell, there is no mention of the noise schedule for training DDPM.
- When evaluating the representations in the UNet, what is the noise level in the input? I would expect this to significantly affect the representation (especially since you're using the last UNet layer and the UNet is trained to predict noise -- but this is just a hunch).
- The models here are basically trained to convert the conditioning pair to an image that is deterministically determined from such a pair. I would be a lot more interested in investigating the representations of an unsupervised model trained on such data, where the labels are inaccessible but can be used for evaluation -- similarly to how e.g. disentanglement is evaluated.
- Alternatively, to mimic real-world conditional (e.g. text-to-image) generation, there could be some stochasticity involved, such that the observed data is not trivially obtained from the conditioning signal.
- To make the experiments and results fully understable and accessible to a wider range of researchers, I would strongly recommend including a quick introduction to the relevant concepts from geometry/topology.
- In addition, the individual results subsections seem to lack a clear structure, which makes them not particularly easy to follow. Some more intuitive explanation, as well as highlighting the main takeaways from each experiment, might help.
A few minor or more detailed comments:
- In the contributions: "differing values of the same feature are also treated similarly" - what does this mean?
- In Section 3.1 there's mention of and : are these actually and , since these are the ground-truth generative factors of the data?
- At the beginning of the results section, the dataset is modified to have a torus topology. Why not define the dataset like this in the first place?
- Line 124. "If the model were to parameterize x and y independently with two ring manifolds, we expect to see a Clifford torus in the neural activations, rather than a different geometry such as the 3D torus." What does it mean exactly that the model parameterizes x and y? My interpretation is that, when generating data conditional on and (which I take to be the means), we can observe how the activations in the pre-determined UNet layer change, as we change and .
- Line 127 and following: "we first confirm that the model indeed learns a torus representation of the dataset". What would the model alternatively learn? Since we're considering such simple toy datasets, I think it's expected that we can get a full intuition of what is going on. In my opinion, this is not the case. The following lines also involve technical terms from topology that are not properly introduced.
- It's unclear what exactly effective dimensionality is, why it is important here, and what we can learn from it.
- The conclusions drawn in the last paragraph of page 4 are not very clear. For example: "These results suggest that x and y are independently encoded in pairwise orthogonal subspaces, but different values of x’s and values of y’s respectively are not encoded in the same way, i.e. in parallel subspaces".
- line 170: "compositionally generalizing out of the training distribution as well as its ability to spatially interpolate in a single variable alone". What is the precise difference between these two scenarios? In general, interpolation and composition don't seem to be accurately defined (although they can be implicitly inferred e.g. by lines 172-179).
问题
See weaknesses.
局限性
The limitations are mentioned in the discussion section. Unfortunately, I believe some of them may be too large to ignore.
We thank the reviewer for taking the time to provide us with constructive feedback. Please find our responses to the specific concerns and questions below.
Weaknesses
Expanding on high-level
- We kindly refer the reviewer to the first two sections of the global response.
- We appreciate the reviewer's suggestions. To summarize your suggestions: a) Input noise level for UNet, b) Imperfect conditioning label, c) Different compositionality patterns, d) Different layers of UNet. We have explored a), c), and d). For a), we have studied the bottleneck and layer 4 at various noise levels (diffusion timesteps) and found minimal differences between representations at different timesteps, with the bottleneck having diminishing signals. Thus, we used the final timestep output from layer 4 for our analysis. For c), as shown in Fig. 4, the model learned both multiplicative and additive compositionality with the full range of 1D Gaussian Strips and a few compositional examples. For d), we investigated all UNet layers and found layer 4 better reflects the latent structure of the dataset than other layers, so we used it for our analysis. We have not considered b). While interesting, this doesn't allow precise prompting of the model to output a specific Gaussian at a desired location, complicating evaluation. As a reminder, our focus is on whether the model can even, given perfect label information, organize learned data into an efficient and meaningful representation for generalization. The pursuit of b) falls outside of our research scope.
- We thank the reviewer for this feedback. We will modify this sentence in the abstract.
- Although Diffusion model bottlenecks have been shown to encode semantic information, in our case, layer 4 acts as the bottleneck due to the skip connection. This does not affect our main point, as we are focused on the factorization of the model's learned representation.
- We kindly refer the reviewer to the global response section on “Notes on additional figures - Additional Figure 1: Impact of Annealed Noise Schedule on Learning Rates”.
- See response to “Expanding on high-level” 2.
- Investigating representations learned by unsupervised models is intriguing. Previous work has focused on disentanglement in unconditional Diffusion models, but evaluating compositional generalization without explicit conditional input is challenging. In terms of stochasticity, our conditional training uses classifier-free guidance, dropping labels 10% of the time to co-train conditional and unconditional models. Moreover, we know that the model isn't solely relying on conditional input, as it sometimes fails to learn even with perfect conditions (e.g., Fig. 4(c)-(e)). Future work should nevertheless evaluate unconditional models or conditional models with imperfect input for their factorization and compositionality. 8-9. We will refine our manuscript to reflect the suggestions.
Minor
- By the original sentence, we meant that different values of the same feature (e.g., ) are encoded similarly to how different features (, ) are encoded, treating and more like categorical than continuous variables. We will clarify this in the original sentence.
- We refer to and as features and and as the center locations of the Gaussian bumps in a data image. The latter is intrinsic to the dataset, while the prior is learned by the model.
- Although periodic boundary conditions could be enforced in all datasets, the resulting torus manifold is nonlinear and less straightforward to analyze using linear methods. Otherwise, there's no fundamental difference between datasets with and without these conditions.
- By “parameterize and ,” we mean how the model represents the two features. Jointly, the model may learn a 2D look-up table (3D torus), while independently, it may learn and as 1D rings (4D Clifford torus). Our geometry/topology test distinguishes between these scenarios.
- Our tests detect whether the learned representation is a 3D or Clifford torus. This requires the object to be a torus, which isn't always obvious during training. The persistent homology results in Fig. 2(b) help clarify this. We will refine the manuscript to be more precise.
- Effective dimension, computed by the participation ratio (lines 134-135), measures the intrinsic dimensionality of learned representations. Plotting it over training epochs helps us understand if the model learns a Clifford torus (dimensionality of 4) and whether it does so from first learning a 3D torus (dimensionality of 3). Fig. 2(c) shows higher than 4-dimensional effective dimensionality, indicating independent learning of and . Fig. 2(d) eigenspectra and Fig. 2(g) PCA projections suggest the learned representation isn't a perfect Clifford torus but higher-dimensional and cone-shaped. We will add more details on this in the manuscript.
- The model treats and as categorical variables with non-zero overlaps. This means is a separate category from , not neighboring values of a continuous variable. Technically, if and were continuous, different values should be encoded in parallel subspaces, like a Clifford torus. The model learns a hyper-factorized version, representing and as categorical. We will clarify this in the manuscript.
- "Compositional generalization" means combining components in ways not seen in the training set (e.g., given , output ). Interpolation refers to combining values (e.g., output given and ). These are distinct forms of generalization, which we probe separately in Sec. 3.2. We will define these clearly at the beginning of Sec. 3.2.
Thank you for your thorough and thoughtful reply. Based on your clarifications, I am raising my score to borderline accept, as I would no longer strongly oppose acceptance of this paper.
However, I still have reservations about the motivation and the experimental setting. While toy experiments can be valuable, they need to offer insights that are likely to extend to more complex and realistic scenarios. For example, synthetic images can serve as proxies for real images, and small real images can approximate larger ones. In this case, it remains unclear what real-world scenarios these extremely simplified experiments are meant to approximate. When I asked in my initial review, "What are concretely actionable insights from these experiments?" it was a genuine concern. While toning down the language and claims in the paper would be a positive step, I believe it is essential to clarify this point in the final version, should the paper be accepted. Otherwise, I would encourage a deeper reconsideration of these issues, as this work has the potential to be a strong contribution.
Additionally, I believe the paper would greatly benefit from being more accessible and clear, which would also enhance its impact across various subfields. As I mentioned, providing some background on relevant concepts from geometry and topology, or at least offering more intuitive explanations, would be very helpful. For example, intrinsic dimensionality is a concept that many in the machine learning community likely understand intuitively, but the paper would be improved by including both an intuitive explanation and a precise mathematical definition (perhaps in the appendix). Similar clarity should be provided for other concepts that may not be familiar to researchers outside of geometry, such as persistent homology, persistence diagrams and how to interpret them, the role of orthogonal/parallel subspaces in this context, etc. This would be fine at a geometry-centered workshop, but not at the main conference, unfortunately.
At the very least, I would strongly recommend that the authors incorporate the updates they promised in the rebuttal to me and the other reviewers, and to assume less prior knowledge of topology, given that these fields are not highlighted in the title or keywords.
Just a minor additional point about related work (this is of course just a suggestion): you could consider the empirical result in Träuble et al., 2020 (Sec. 4.3), and in object-centric learning, the theoretical results in Wiedemer et al., 2023 and empirical in Dittadi et al., 2021.
Thanks again for the discussion!
In this work, the authors investigate how diffusion models achieve compositional generalization. Through controlled experiments on conditional DDPMs with 2D Gaussian data, the authors find that these models learn semantically meaningful, factorized manifold representations of composable features. These representations are orthogonal for independent feature variations but not aligned for different values of the same feature, resulting in superior compositionality but limited interpolation over unseen feature values. The study reveals that a small amount of compositional examples can enhance this capability and links the formation of these representations to percolation theory in physics. This work provides insights into the mechanism of compositionality in diffusion models, guiding future research to improve their factorization and generalization for real-world applications.
优点
- The work investigate the factorization and compositionality of diffusion model, which is an interesting and valuable problem.
缺点
- The major concern is that the work only considers a highly reduced setting, which makes it hard to validate wether the conclusions generalize to real world applications. Since the setting is oversimplified, it's also unclear how the conclusions can be extended to popular applications like conditional text2image generation.
- The performance of diffusion is mostly evaluated with customized metrics without comparison to other models. It's hard to get a sense of how well/bad the model performs in term of the metrics.
问题
- Given the dataset contains only ~100K synthetic images, the model can simply memorize the data. This makes it hard to argue the generalization behavior of the model can be extended to real image diffusion models.
- Why the output of layer 4 is used to investigate the internal representations? Do the authors try representations from other layers?
局限性
The work sufficiently addressed the limitations.
We thank the reviewer for taking the time to provide us with constructive feedback. Please find our responses to the specific concerns and questions below.
Weaknesses
-
We thank the reviewer for the feedback, we kindly refer the reviewer to the global response Section “Regarding the toy setting” addressed to all reviewers.
-
Due to the simplicity of our toy setting, we design the custom metrics such that it is tailored to our objective of investigating the factorization and compositionality in Diffusion models. Towards this goal, we have specifically designed the task and the metrics accordingly such that they target the model’s ability in recovering the spatial location of the Gaussian bumps.
Questions
-
Naively, when a model resorts to memorization, it typically suffers from poor performance in the out-of-distribution task sets. We have carefully selected our task and model such that the task is not too trivial for the model and that the model can just memorize all the training data. This is indeed shown through the out-of-distribution model performance evaluation in, for example, Fig. 5. We note that even given a subset of the original dataset, the model has similar in-distribution and out-of-distribution performance, which means that the model is not simply memorizing all in-distribution data but rather trying to learn the correct representation.
-
The reason why we have chosen layer 4 of the UNet as our internal representation is because it better reflects the latent structure of the dataset than other layers. Specifically, we have consistently noticed that the bottleneck layer of the UNet gives diminishing signal due to the utilization of the skip connection. As a result, we chose the layer immediately following the bottleneck layer as our internal representation. Moreover, we have investigated all the layers of the UNet and found that layer 4 gives the strongest semantic signal.
I thank authors for answering my questions. Though as also pointed by other reviewers, the paper only considers toy setting and lacks experiment on larger and more practical settings. After reading the discussions between authors and reviewers (including me), I believe the paper does provide some interesting observations of the underlying representation structures learned by diffusion models. And it could potentially guide the development of more efficient and powerful diffusion models. Therefore I have raised my score to 5 meaning I won't go against accepting the paper.
We thank all the reviewers for the constructive feedback and suggestions. Below are some of the recurring concerns/questions that we would like to address to all reviewers.
Regarding the toy setting
It is a tradeoff between a simple setting, which allows for controlled studies to isolate all possible effects, and a more complex setting that mimics popular application regimes. We agree with the reviewers that in larger, more realistic settings, richer dynamics absent from the toy setting would emerge. However, this does not undermine the relevance of our observations or the importance of starting with simple, controlled scenarios, especially given the complexity of neural networks.
In fact, We expect many insights from our toy experiments to hold for larger models. For example, our experiments showed that the model's inability to interpolate well is due to the categorical-like encoding of continuous latent features, highlighting the limited ability of Diffusion models to encode continuous quantities key to image generation according to real-world physics. Our observation of the connection between percolation theory and manifold formation should also be relevant to more realistic datasets, where the measure of "overlap" is more abstract (e.g., cosine similarity between images). Using Gaussian bumps, a well-studied system in percolation theory with computable thresholds, validated our observations. Extending this overlap measure to natural image data and studying manifold formation in more realistic settings is a promising future direction.
Finally, interpretability studies of Diffusion models on natural/synthetic datasets show varying degrees of success, especially in compositionality and factorization. Even slightly more realistic synthetic datasets than Gaussian datasets (e.g., dSprites, CLEVR) have led to conflicting conclusions, as noted in the Related Work section. This underscores the need for careful and controlled experiments, motivating our study of the Gaussian dataset, which has a low-dimensional, intuitive latent representation. This allowed us to analyze its geometry and topology to study its factorization explicitly. More importantly, our case study of the Gaussian dataset revealed phenomena of greater scientific interest than we initially expected. These observations, impossible without the toy setting, provide a foundation for understanding larger, real-world models. Future work should aim to validate and extend these observations to larger models with more realistic datasets.
Regarding the specific experimental setup
The task of image reconstruction of Gaussians given perfect conditional information (center coordinates) is inspired by traditional cognitive psychological studies. In these studies, subjects perform simple tasks while their behavior and brain activities are recorded. For tasks involving continuous latent factors (e.g., angles, color hues), humans and animals can represent these data using continuous attractors (e.g., rings or lines), which are efficient and robust. Our study aims to determine if Diffusion models can similarly learn these continuous attractors in a factorized and generalizable manner, akin to biological brains. Thus, we designed our task to mimic cognitive experiments, analyzing the "brain activity" (Sec. 3.1) and "behavior" (Sec. 3.2) of the Diffusion model during the task. Our findings reveal that, unlike biological brains, Diffusion models do not learn continuous manifolds representing continuous data variation, making them less efficient and robust.
Notes on additional figures
We have included two additional figures in the new supplementary material attached below.
Additional Figure 1: Impact of Annealed Noise Schedule on Learning Rates
We included a figure showing how an annealed noise schedule affects the learning rates of different concepts. Using a noise schedule similar to the original DDPM paper, we analyzed the relationship between learning rates and signal-to-noise ratios. We hypothesize that high-frequency details (e.g., moles) are drowned out more quickly with an aggressive noise schedule, while low-frequency features (e.g., hair color) persist longer and are easier to learn. This aligns with existing literature observations that low-frequency features are learned before high-frequency ones. To verify, we trained 1D conditional Diffusion models on sinusoidal data with high- and low-frequency components. Figure 1(b) shows the model learns the low-frequency component faster and more accurately due to the annealed noise schedule. Figure 1(c) illustrates that more noising steps result in diminishing signal-to-noise ratios in most noised image samples, making high-frequency details harder to learn. We have previously omitted this from the main paper as it’s not central to our main message.
Additional Figure 2: Data Efficiency Scaling
We included a figure on the data efficiency scaling of Diffusion models across various datasets. Our original paper (Sec. 3.2) shows that combining 2D Gaussian data with 1D Gaussian data reduces the number of compositional examples needed for compositional generalization. Figure 2 in the supplementary material compares data efficiency between models trained on i) 2D Gaussian Bumps, ii) 2D Gaussian Bumps + 1D Gaussian Stripes, and iii) 2D Gaussian Bumps + 2D Gaussian Additions + 1D Gaussian Stripes. Figure 2(a)-(c) show that models trained on ii and iii achieve higher accuracy with fewer compositional examples. We also show in Figure 2(d) that for dataset size , models trained with 1D Gaussian Stripes are more data-efficient, with data needed to reach an accuracy threshold growing linearly rather than quadratically. This signifies that not only can the model learn multiple forms of compositionality, but do so in a more data efficient manner when the dataset is mixed in with the set of 1D components, which potentially suggests a more data-efficient training approach for Diffusion models.
This paper investigates the internal mechanisms of diffusion models, focusing on their ability to learn factorized representations that support compositional generalization. The work provides valuable insights into the inductive biases of diffusion models through controlled experiments on synthetic datasets and offers a normative explanation inspired by percolation theory.
The reviewers had mixed opinions about the reliance on toy datasets, with some concerns about the generalizability of the findings to more realistic settings. However, Reviewer VwNJ strongly supported the paper, noting its straightforward characterization of conditions under which factorization and composition are achieved, and believes it is worth sharing with the community. Reviewer un3W, while acknowledging some limitations, also agrees that the paper has value and supports its acceptance, with the suggestion that the authors explicitly acknowledge the scope of their experiments and improve the clarity of their presentation, particularly in sections involving technical concepts from topology and geometry.
Based on the overall positive feedback and the constructive discussions among the reviewers, I recommend accepting this paper. I encourage the authors to consider Reviewer un3W's suggestions for improving the clarity and accessibility of their work, especially for readers less familiar with the technical terminology used.