Energy-Based Conceptual Diffusion Model
We propose Energy-Based Conceptual Diffusion Models (ECDMs), a framework that unifies the concept-based generation, conditional interpretation, concept debugging, intervention, and imputation under the joint energy-based formulation.
摘要
评审与讨论
This paper proposes a concept-based diffusion model that enables conditional generation, concept interpretation and debugging, as well as image operations like intervention and imputation.
优点
- The paper is clearly written and well-organized, making complex concepts more accessible.
- The authors conduct thorough experiments across various datasets and tasks, providing clear comparisons to existing benchmarks.
- The concept-based framework is versatile and seems applicable to any conditional data generation task.
- The framework enhances the interpretability of the elements and features in the generated images.
缺点
- I have to mention that I have not previously conducted research about concept-based generation, but the significance of this work within the broader field of generative models is unclear for me. It appears to be a straightforward combination of concept bottleneck models and standard conditional diffusion models.
- The concept-based generation method described in (11)-(13) resembles a Gibbs sampling or coordinate-wise algorithm, but equation (11) focuses on maximizing mapping energy rather than the entire joint energy . This raises questions about the rationale behind this approach, as incorporates dependencies on the concept in both terms. Additionally, maximizing with respect to the binary vector suggests an integer programming problem, which the paper does not sufficiently address regarding efficiency.
问题
- Is the code for the model available now?
- How is the number of concepts in the concept vector determined? Is fixed?
- How is the concept embedding modeled?
- What distinguishes the generation process described in equations (12) and (13) from that of a standard conditional diffusion model? It seems the only change is replacing the conditioning input with the processed conditioning input obtained from .
Q4. What distinguishes the generation process described in Equations (12) and (13) from that of a standard conditional diffusion model? It seems the only change is replacing the conditioning input y with the processed conditioning input obtained from .
This is a good question. After training with the joint energy-based objective, our model performs concept-based joint generation by minimizing the joint energy. As clarified in W2.1, this joint sampling process can be further simplified for computational efficiency and ease of implementation.
Note that Eqn. (12) and (13) are only part of the generation process. As mentioned in the response to W2.1 above, in practice, one can alternate between interpretation (using Eqn. (15) or (17) in the paper) and generation (using Eqns. (11)-(13) in the paper) until convergence. This is the key difference between standard conditional diffusion models and our ECDM's generation.
The key distinction between our ECDM's formulation (Eqn. (6), (10)-(13), (17)-(19)) and a standard conditional diffusion model is that the variable in ECDM is not merely a processed condition.
Note that concept-based generation is only a small part of our ECDM's contribution. Our ECDM goes far beyond generation and unifies five different tasks, i.e., concept-based generation, conditional interpretation, concept debugging, intervention, and imputation, in a single probabilistic framework. For example:
- Intervention. During generation, one can easily intervene on the concepts to fix any incorrect generation.
- Interpretation. Given a generated image, one can infer the concepts to check what concepts are expressed in the image, thereby interpreting the generation process. One can then perform intervention (mentioned above) based on the inferred .
- Debugging. Given the input and the generated image , one can debug what concepts are generated incorrectly by comparing the what concepts are generated (i.e., ) and what concepts should be generated (i.e., ). One can then perform intervention (mentioned above) based on the debugging results.
In the tasks above, the concept vector is not merely a "processed conditioning input"; it is also an interpretable variable that enables flexible compatibility measurement and joint modeling of the instruction, concept, and generation within our energy-based framework.
W2.1. The concept-based generation method described in (11)-(13) resembles a Gibbs sampling or coordinate-wise algorithm, but equation (11) focuses on maximizing mapping energy rather than the entire joint energy...the rationale behind this approach...as incorporates dependencies on the concept in both terms...
We are sorry for the confusion. In the concept-based joint generation, we perform concept inference and image generation by minimizing the joint energy . This process entails the minimization of both the mapping energy and the concept energy .
In the full model, to minimize the joint energy , we alternate between
- Eqn. (17) to infer ,
- Eqn. (11) to infer , and
- Eqn. (12)-(13) to infer until convergence. This is indeed similar to Gibbs sampling, but it is slightly different. For example, Gibbs sampling involves computing the conditional probability of one variable given all other variables. By contrast, in Eqn. (11), we infer rather than .
In practice, we find that simply alternating between minimizing Eqn. (11) (mapping energy) and Eqn. (13) (concept energy) can already provide satisfactory results with improved computational efficiency.
Note that in Eqns. (12)-(13), the diffusion-style sampling dynamics of the image necessitate computing the gradient with respect to for each sampling step. Here the mapping energy in is not relevant to , and therefore is not involved in Eqn. (13).
W2.2. Additionally, maximizing with respect to the binary vector suggests an integer programming problem, which the paper does not sufficiently address regarding efficiency.
This is a good question. Note that while the concept label is binary, our ECDM's predicted concept probability is a real value in the range . Therefore, we can use gradient descent to compute the gradient of (including ) w.r.t. and update iteratively to infer . In this case, it is not an integer programming problem and is therefore very efficient.
For Questions:
Q1. Code Availability
Thank you for your interest in our code. We assure you that we will make our code open-source and available to the wider research community after acceptance, thereby facilitating the reproducibility of our results. So far, we have finished cleaning up the source code and will release it if the paper is accepted.
Q2. How is the number of concepts in the concept vector determined? Is fixed?
The number of concepts is determined by the concept annotations provided by the specific dataset. The overall concept number is fixed during training and inference, but can be adjusted according to different datasets. We also include more details on the Datasets in the Experiments Setup Section (Section 4.1) and Appendix C.
Q3. How is the concept embedding modeled?
Thanks for mentioning this. For a given concept phrase (e.g., "black bird wings"), we first extract the textual embeddings using the text encoder. Subsequently, the textual embedding is projected (through a learnable projection neural network) into a positive embedding and a negative embedding . The final concept embedding is then a combination of and , weighted by the concept probability , i.e., . This combined concept embedding is further utilized in all five unified tasks (i.e., generation, interpretations, debugging, intervention, and imputation).
Thank you for your insightful and constructive feedback. We are glad that you found our proposed framework "versatile"/"applicable to any conditional data generation task"/"enhances the interpretability", our paper "clearly written"/"well-organized", and our experiments "thorough". Below we address your questions one by one in detail.
W1. ... the significance of this work within the broader field of generative models ... a straightforward combination of concept bottleneck models and standard conditional diffusion models.
We would like to clarify that our ECDM's primary contribution lies not only in combining existing models but in developing a unified probabilistic framework that unlocks new capabilities. We provide more details below.
Fundamental Difference between ECDM and Concept Bottleneck Models (CBMs) Combined with Conditional Diffusion Models. Conventional CBMs typically predict a set of concepts from an image input and then predict the class label based on these predicted concepts, i.e., predicting concepts and labels given an image . Similarly, conditional diffusion models typically generate an image given an input text or class label . In contrast to such "sequential" modeling, our proposed ECDM jointly models the relationships among class-level instructions , concept sets , and the generated image within an energy-based framework.
Consequently, our energy-based framework facilitates a more flexible and unified inference process. Given any subset of these three elements (the instruction , the concepts , and the generated image ), the joint framework can infer the remaining elements by composing energy functions and deriving conditional probabilities. This unique characteristic allows us to achieve generation, interpretation, and intervention within a unified framework. These capabilities extend beyond what either component could achieve independently. Unifying these tasks under a flexible framework provides the interpretability and generation modeling community with deeper insights into how diffusion models generate images, comprehend, and incorporate concepts in the generation process through a human-understandable probabilistic interpretation, thus holding significant value for the community. Our novelty and contributions include:
-
Unifying Five Tasks in a Single Framework. Our ECDM framework unifies concept-based generation, conditional interpretation, concept debugging, intervention, and imputation under a joint energy-based formulation. These five tasks encompass a typical workflow of the text-to-image diffusion model, and unifying them enhances generation quality, boosts interpretability, and enables interpretable intervention. In our proposed energy-based modeling of the diffusion model, we incorporate large-scale pretrained diffusion models (e.g., Stable Diffusion) within our framework to achieve a unified interpretation of these diffusion models, an area that previous methods have less extensively explored.
-
Probabilistic Interpretations by Energy Functions. With ECDM’s unified framework, we have developed a set of algorithms to compute various conditional probabilities by composing corresponding energy functions. These conditional probabilities provide theoretical support for concept-based interpretation of the generation process, as opposed to merely visualizations, and enable flexible inference of different elements (as mentioned in point 1) for diverse tasks.
-
Better Experimental Results. Empirical results on real-world datasets demonstrate ECDM's state-of-the-art performance in terms of image generation, imputation, and their conceptual interpretations.
Dear Reviewer 5CJZ,
Thank you for your time and effort in reviewing our paper.
We appreciate your valuable comments and suggestions, and we firmly believe that our response and revisions can fully address your concerns. We are open to discussion (before Nov 26 AOE, after which we will not be able to respond to your comments unfortunately) if you have any additional questions or concerns, and if not, we will be immensely grateful if you could reevaluate your score.
Thank you again for your reviews which helped to improve our paper!
Best regards,
ECDM Authors
I thank the authors for their detailed response. However, I still feel that the novelty of this work is limited, given the previous results on diffusion models and concept bottleneck models. Thus, I maintain my score.
Thank you for providing your feedback and outlining your remaining concerns. Since the discussion period has been extended by six additional days (until December 2nd AoE), we would like to take this opportunity to further explain that our approach is not merely a combination of CBM and diffusion models and enjoys novel capabilities.
Fundamental Difference between our ECDM and Concept Bottleneck Models (CBMs).
Conventional CBMs predict a set of concepts from an input image and then use these predicted concepts to determine the class label. CBMs remain an important and active area of research. However, despite significant progress in this field, a major research gap remains:
Most CBMs are discriminative models. Previous work has primarily focused on discriminative settings (e.g., modeling and ), while the generative setting (e.g., modeling and ) has been largely unexplored.
This oversight in generative settings limits the potential to extend CBMs' interpretability into generative tasks, which are all critical for advancing generative model development, to provide:
- Human-understandable explanations,
- Interfaces for integrating human expertise,
- Interpretations that can enhance generation quality,
- Tools for model intervention.
In contrast, our ECDM can enable all 4 capabilities above under a unified framework.
Note that simply marrying CBMs with generative models fails to provide faithful interpretations with minimal cost. For example:
- Only predicting concepts relied on instructions and using them for generation models a fixed mapping between classes and concepts. This approach overlooks the dynamic interplay between generated images, concepts, and instructions. Such unidirectional learning may introduce biases, as it does not allow for the generated images to influence or modify the concepts being used. Consequently, it fails to accurately reflect the foundational basis of the image generation process. This limitation prevents the verification of whether the generated images faithfully reproduce the concepts intended by the instructions, indicating a critical need for mechanisms to address and correct these biases.
- Only predicting the concepts from the generated images using additional discriminative conceptual models cannot ensure these prediced concept's involvement during the generation process, since these predictions are not inherently involved in the generative process.
- Inserting a concept prediction layer in the diffusion UNet (e.g., CBGM) requires training the model from scratch again, overlooking the abundant visual information and rich potential interpretive informants embedded in the large-scale pretrained diffusion model (e.g., Stable Diffusion).
Fundamental Difference between our ECDM and Conditional Diffusion Models.
Conditional diffusion models (CDMs) generate images based on specific conditions, such as class labels or textual instructions. Unfortunately, these models fail in terms of:
- Human-Understandable Interpretability, which is essential for recognizing and understanding the generation behavior of CDMs, and
- Transparent Control, which means the capability of precise and transparent control over the generated outputs, further supporting the development of advanced editing methods.
Our ECDM can enable both Human-Understandable Interpretability and Tranparent Control under a unified framework.
Note that simply marrying CDMs with concept-based models fails to provide meaningful Human-Understandable Interpretability and Tranparent Control. For example:
-
The explanation given by previous interpretable CDMs are not always human-understandable and not informative enough. Previous interpretable CDMs often fail to provide human-understandable and sufficiently informative explanations. For instance, in energy-based CDM literature, the number of decomposed and visualized concepts is typically fewer than ten. These visualized concepts are not guaranteed to represent the key factors driving the image generation process. Furthermore, increasing the number of decomposed concepts not only leads to prohibitively high inference costs, making this approach impractical, but also tends to result in the acquisition of abstract concepts that are often difficult for humans to understand, typically rendering them meaningless.
-
Simply inputing human-understandable concepts into CDMs also cannot ensure a faithful interpretable generation. Simply inputting human-understandable concepts into CDMs does not guarantee faithful or interpretable generation. This approach neither ensures nor monitors the involvement of these concepts during the generation process. Without understanding how or why the model generates specific outputs under given conditions, it becomes difficult to diagnose and correct errors when the generation fails. For example, if the model generates an incorrect image, we cannot identify the root cause of the error or determine how to intervene effectively.
Our Proposed ECDM.
Our proposed Energy-based Conditional Diffusion Model (ECDM) addresses key research gaps, bridging the worlds of conceptual interpretability and generative modeling.
Specifically, we:
- Formulate Conceptual Interpretation under a Joint Energy-Based Framework. We propose a novel joint energy-based framework for generative CBMs that models the interactions among instructions, concepts, and image generations in an integrated manner. This enables:
-
Faithful reflection of concept probabilities: During generation and interpretation, concept probabilities are jointly influenced by both the input instructions and the generated images. This ensures the probabilities accurately reflect the underlying generative process.
-
Deep involvement of concept interpretations: Concepts are actively involved in the generation process because our model minimizes the joint energy. This core objective inherently requires the concepts to play a significant role in both generation and interpretation.
-
Derive Conditional Probabilities by Composing Energy Functions. By systematically deriving new conditional probabilities within the joint energy framework, our model extends beyond concept-based joint generation to support a wide range of tasks, including interpretation, debugging, corrective intervention, and interpretable imputation. This is made possible by leveraging faithful conceptual probabilities and embeddings to perform diverse tasks within a unified interpretable framework.
If CBMs had been naively added to the framework, these capabilities—relying on non-binary concept-image interactions to perceive and derive concept probabilities—could not have been achieved. -
Seamlessly Incorporate Pretrained Diffusion Models. We integrate large-scale pretrained diffusion models into our energy-based framework by reformulating and unifying their training objectives under the energy-based framework (see Appendix A). Rather than merely utilizing pretrained network features, we propose a novel formulation that enables:
- Direct and deep involvement in interpretation: The pretrained diffusion model becomes a critical part of the interpretability process by contributing to minimizing the joint energy through the concept energy network.
- Interpretation of pretrained model outputs: Our framework can interpret images generated by pretrained diffusion models, not just those fine-tuned within our model.
- Efficient training and sampling: By harmonizing the pretrained diffusion model into our framework, we only need to optimize a small set of embeddings to repurpose it as a strong energy estimator. This dramatically improves computational efficiency (see Appendix D: Computational Efficiency Analysis).
Therefore, our ECDM has successfully addressed the drawbacks associated with integrating CBMs and generative models, culminating in the development of this unified and faithful framework. This highlights the versatility of our approach, as noted in your previous comment: "versatile and seems applicable to any conditional data generation task." It also emphasizes our model’s flexible interpretability, as you observed: "enhances the interpretability of the elements and features in the generated images."
Thank you again for providing feedback during the rebuttal period. We hope this additional explanation further clarifies the novelty and added value of our proposed ECDM. We believe our contributions significantly advance the state of the art by providing both theoretical insights and practical tools. Furthermore, we are eager to hear about your thoughts regarding "previous results on diffusion models and concept bottleneck models," and we would be happy to provide more targeted, point-by-point response accordingly.
The paper introduces a framework that integrates diffusion models and Concept Bottleneck Models in an energy based model structure. ECDM aims to address interpretable control in current diffusion models. It allows concept-based generation, interpretation, debugging, intervention, and imputation. ECDM unifies tasks through energy networks, which enables modifications in the generated images based on probabilistic estimates. The model is evaluated on datasets like AWA2, CUB, and CelebA-HQ in concept accuracy, class accuracy, and FID scores with existing diffusion models.
优点
- ECDM combines diffusion models with concept bottlenecks in a way that supports both generative and interpretive tasks.
- The model allows users to modify generated images based on specific concept-level controls, which is a practical tool.
- The experiments on multiple datasets shows quantitative improvements in image quality and concept alignment.
缺点
- The paper lacks comparisons with some related methods like COMET or CBGM, which are relevant energy-based interpretive frameworks for diffusion models.
- While FID, class, and concept accuracy are used, other metrics like diversity or user-study-based interpretability scores could further validate the model's effectiveness.
- Code for reproducibility is not provided.
- The experiments rely on a fixed pretrained stable diffusion model, while other models are not explored.
- Limitations: The method struggles with precise regional control over concept-based edits. Also, the energy-based approach is computationally intensive, especially during joint optimization steps.
问题
- Concepts like "pivotal inversion" and "energy matching inference" could be better explained for clarity.
For Questions:
Q1. Further clarification for "Pivotal Inversion" and "Energy Matching Inference".
We apologize for any unclear parts in our explanation and are pleased to provide more detailed explanations below.
The intuition behind the ECDM's interpretation task is that the diffusion model's sampling trajectory conditioned on the instruction and the optimal concept set should be alike in our framework, since they are all under our joint energy-based formulation. To substantiate this intuition, two steps are required: (1) pivotal inversion, which aims to simulate the sampling trajectory conditioned on the instruction ; (2) energy matching inference, which optimizes the concept probability to find the most compatible concept set (i.e., the concept set that is the most compatible with the generated image) by minimizing the distance between the sampling trajectory conditioned on concepts and the one obtained from pivotal inversion.
We explain these two steps individually below.
Pivotal Inversion: The goal of pivotal inversion is to simulate how the pretrained diffusion model samples an image directly conditioned on the instruction. For instance, consider an image of a "black billed cuckoo" bird generated from the instruction "A photo of the bird black billed cuckoo" using pretrained Stable Diffusion 2.1. Pivotal inversion utilizes reverse DDIM [8] to derive a set of latents that represent how the image is gradually denoised from Gaussian noise conditioned on this instruction. This derived set of latents serves as pivots that illustrate the model's original sampling trajectory and, within our formulation, simulates the energy landscape of the external energy model (pretrained diffusion model). This facilitates the matching process in the subsequent step.
Energy Matching Inference: The goal of this energy-matching-inference step is to determine the most compatible concept set, i.e., the concept set that is the most compatible with the generated image; this is done using the simulated trajectory from pivotal inversion. Specifically, we freeze all learned embeddings as well as the concept energy network, and optimize the concept probability . The optimization target is to minimize the distance between the sample trajectory conditioned on the concept set and the fixed pivotal trajectory simulated in the previous step. By minimizing this distance, we align the energy landscape of the concept energy network with the external energy network to obtain the most compatible concept set, which is why this process is called energy matching inference. Once the most compatible concept set is found, these concepts can then be used to interpret the generated image.
We hope this further explanation of these two concepts clarifies the overall idea. If you need additional clarification or have further questions, we would be more than happy to provide any additional details required. These further clarifications have been incorporated in Appendix D.1.
[1] Ismail, Aya Abdelsalam, et al. "Concept Bottleneck Generative Models." The Twelfth International Conference on Learning Representations. 2023.
[2] Du, Yilun, et al. "Unsupervised Learning of Compositional Energy Concepts." Advances in Neural Information Processing Systems 34 (2021): 15608-15620.
[3] Liu, Nan, et al. "Unsupervised Compositional Concepts Discovery with Text-to-image Generative Models." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.
[4] Hao, Shaozhe, et al. "ConceptExpress: Harnessing Diffusion Models for Single-image Unsupervised Concept Extraction." ECCV. 2024.
[5] Su, Jocelin, et al. "Compositional Image Decomposition with Diffusion Models." ICML. 2024.
[6] Hertz, Amir, et al. "Prompt-to-prompt Image Editing with Cross Attention Control." arXiv preprint arXiv:2208.01626 (2022).
[7] Xu, Xinyue, et al. "Energy-based Concept Bottleneck Models: Unifying Prediction, Concept Intervention, and Probabilistic Interpretations." The Twelfth International Conference on Learning Representations. 2024.
[8] Song, Jiaming, Chenlin Meng, and Stefano Ermon. "Denoising Diffusion Implicit Models." arXiv preprint arXiv:2010.02502 (2020).
W3.Code Availability.
Thank you for your interest in our code. We assure you that we will make our code open-source and available to the wider research community after acceptance, thereby facilitating the reproducibility of our results. So far, we have finished cleaning up the source code and will release it if the paper is accepted.
W4. The experiments rely on a fixed pretrained stable diffusion model, while other models are not explored.
This is a good question. Our model does not assume any specific model architecture and is therefore compatible with any pretrained diffusion model.
We follow the convention in the field to choose Stable Diffusion as the base model because it is the most commonly used pretrained large text-to-image diffusion model. Thousands of works have used Stable Diffusion as their only base model [3, 4, 5]. Note that a lot of open-source pretrained diffusion models are actually finetuned from the Stable Diffusion model. Therefore, we believe the results are representative and generalize across different pretrained models.
Nonetheless, we are happy to further validate our framework if you have any specific open-source pretrained diffusion model in mind. We would be very happy to include additional results before the discussion period ends in late November.
W5. Precise regional control and computational expense.
Thank you for mentioning this. As mentioned in our Conclusion and Limitations section, precise regional control and computational expense are two of our ECDM's limitations.
Precise Regional Control. Enhancing precise regional control in concept-based editing is indeed an intriguing direction for future work. This challenge could potentially be addressed by incorporating attention-based regional editing techniques, such as prompt-to-prompt [6]. This is definitely interesting future work, but is out of the scope of our paper.
Computational Cost. We would like to clarify that our ECDM introduces low computational overhead compare to existing methods. Specifically:
- Mapping Energy Network. We sample from the mapping energy network using the Gradient Inference technique, as outlined in ECBM [7]. This sampling procedure requires approximately 10 to 30 steps, taking around 10 seconds of wall-clock time.
- Concept Energy Network. For the generative concept energy network, we model the diffusion model as an implicit representation of the energy function, making the diffusion model sampling algorithm applicable to our framework. We utilize the standard diffusion sampling algorithm (i.e., DDIM [8]]) to generate an image from the concept energy network. This process involves approximately 50 steps and takes around 3 seconds of wall-clock time when using an NVIDIA RTX 3090. Therefore, the computational overhead remains comparable to that of standard diffusion models.
We agree that accelerating the sampling process of our framework is another promising area for future research and could potentially be addressed by adapting variational inference methods. However, we decided to leave this to future work, as it is beyond the scope of this paper.
Thank you for your valuable comments. We are glad that you found our model "is a valuable tool"/"supports both generative and interpretive tasks" and that our ECDM "shows quantitative improvements in image quality and concept alignment". Below we address your questions one by one.
W1. Comparisons with some related works (COMET or CBGM).
Thank you for mentioning this. We would like to clarify that our setting focuses on concept-based generation and interpretation given a pretrained large diffusion model. Therefore CBGM [1] and COMET [2] (both cited in our paper) are not applicable in this setting. Specifically:
- CBGM [1] involves training a new diffusion model from scratch using a modified Diffusion UNet. In contrast, we focus on augmenting an existing pretrained large diffusion model (e.g., Stable Diffusion) to enable concept-based generation, intervention, and interpretation. Therefore CBGM is not applicable to our setting.
- CBGM [1] is not an energy-based model, which is different from our ECDM.
- In this paper, we focus on the text-to-image generation setting, where the input is free-form text, and the output is an image. CBGM [1] is a conditional diffusion model that takes class labels as input. Therefore CBGM is not applicable to our setting.
- COMET [2] is an unsupervised, unconditional diffusion model that does not take any input (neither class labels nor text). Therefore COMET is not applicable to our setting either.
- Since COMET is an unsupervised learning model, the visual concepts decomposed by COMET do not have ground truth. Therefore, it is not possible to evaluate COMET in our setting.
We have included the discussion above in the revision as suggested (Appendix E.1).
W2. Incorporation of other metrics.
Thank you for your suggestions.
Additional Experiments and Metrics. Following your suggestion, we run additional experiments on another metric, Inception Score (IS). The results are presented in the tables below.
Table A. Results on the CUB dataset.
| Model | FID | IS | Class Accuracy | Concept Accuracy |
|---|---|---|---|---|
| SD-2.1 | 29.55 | 5.40 | 0.5033 | 0.9222 |
| PixArt-α | 46.85 | 3.82 | 0.1208 | 0.8231 |
| TI | 23.36 | 5.41 | 0.6397 | 0.9496 |
| ECDM (Ours) | 22.94 | 5.63 | 0.6492 | 0.9561 |
Table B. Results on the AWA2 dataset.
| Model | FID | IS | Class Accuracy | Concept Accuracy |
|---|---|---|---|---|
| SD-2.1 | 37.79 | 14.78 | 0.8935 | 0.9850 |
| PixArt-α | 59.71 | 13.47 | 0.9008 | 0.9764 |
| TI | 29.63 | 14.79 | 0.9142 | 0.98 |
| ECDM (Ours) | 28.91 | 14.93 | 0.9200 | 0.9801 |
Table C. Results on the CelebA-HQ dataset.
| Model | FID | IS | Class Accuracy | Concept Accuracy |
|---|---|---|---|---|
| SD-2.1 | 53.47 | 3.36 | 0.4881 | 0.8079 |
| PixArt-α | - | - | - | - |
| TI | 53.47 | 3.36 | 0.4881 | 0.8079 |
| ECDM (Ours) | 52.89 | 3.51 | 0.5017 | 0.8182 |
These tables show that even in terms of IS, our method still outperforms all baseline methods, indicating consistently improved image quality. These results and the necessary discussions have been incorporated into the revised version of our paper in the Experiment section.
FID Already Evaluates Diveristy. We would like to clarify that the metric FID already evaluates the diversity of our generated images. Specifically, FID is computed as the distance between the distribution of generated images and the distribution of real images, using feature vectors from a pre-trained Inception network. Therefore, a small FID means that our ECDM's generated images are as diverse as the real images.
Thank you again for your comments, and we are open to considering additional metrics that might enhance the evaluation process and provide further insights into our model's performance. If you have specific metrics in mind, we would be very happy to explore them further during the discussion period.
Dear Reviewer grAS,
Thank you for your time and effort in reviewing our paper.
We appreciate your valuable comments and suggestions, and we firmly believe that our response and revisions can fully address your concerns. We are open to discussion (before Nov 26 AOE, after which we will not be able to respond to your comments unfortunately) if you have any additional questions or concerns, and if not, we will be immensely grateful if you could reevaluate your score.
Thank you again for your reviews which helped to improve our paper!
Best regards,
ECDM Authors
I thank the authors for the rebuttal and efforts. Some of my concerns, such as the metrics and term clarifications, are addressed. Comprehensively considering the contributions of the paper, the current experiment volume (only on the pretrained stable diffusion model), and the perspective from other reviewers, I keep my score for now.
This paper introduces Energy-based Conceptual Diffusion Models (ECDMs), which integrate diffusion models and Concept Bottleneck Models within an energy-based framework. The key contribution is providing a unified approach for concept-based generation, interpretation, debugging, intervention, and imputation. The method enables both high-quality image generation and human-interpretable control through concepts. The authors demonstrate effectiveness on three datasets (CUB, AWA2, CelebA-HQ) through quantitative and qualitative evaluations.
优点
-
Novel integration of concept bottleneck models with diffusion models through an energy-based framework
-
Comprehensive theoretical framework with detailed proofs
-
Multiple practical applications (generation, interpretation, debugging, intervention)
-
Strong empirical results across different datasets
-
Clear improvement over baseline methods in both generation quality and concept accuracy
缺点
The paper fails to acknowledge pioneering work on energy-based diffusion models, particularly "Diffusion Recovery Likelihood" and "Cooperative Diffusion Recovery Likelihood" and also fail to include a wide range of works using EBM as compositions such as "a theory of generative convnet", "Implicit Generation and Generalization in Energy-Based Models" etc.
问题
-
How does the method scale with increasing number of concepts?
-
What is the computational overhead compared to standard diffusion models?
-
Could the framework be extended to handle continuous concept values rather than binary?
-
How robust is the concept interpretation when handling out-of-distribution samples?
伦理问题详情
NA
Q4. How robust is the concept interpretation when handling out-of-distribution samples?
This is a good question. Inspired by your comment, we conducted additional experiments regarding out-of-distribution samples, and the results are included in Figure 7 and Appendix B.2.
Additional Results on Out-of-Distribution Samples. Specifically, the experiments are conducted on the TravelingBirds dataset following the robustness experiments of CBM [7]. We provide the bird image under significant background shift to our models for concept interpretation. In this case study, our model can still accurately infer the corresponding concepts of the bird "Vermilion Flycatcher" (e.g., "all-purpose bill shape" and "solid belly pattern"). These findings demonstrate our model's robustness when facing domain shifts.
Why ECDM Is Robust for Out-of-Distribution Samples. Typical methods tend to suffer from spurious features, e.g., irrelevant backgrounds. In contrast, the concept-based modeling framework of our ECDM ensures the robustness of the interpretations. Specifically, ECDM forces the model to learn concept-specific information and use these concepts to generate images and interpret these images; this way, ECDM focuses more on the genuine attributes of the target object and is less influenced by irrelevant, spurious features, such as irrelevant backgrounds. As a result, our ECDM enjoys robustness when dealing with out-of-distribution samples. For example, when interpreting a water bird with a spurious land background, our ECDM focuses only on the concepts of the water bird in the foreground and, therefore, will not be fooled by the spurious features in the background.
We have incorporated the discussion above in our revised paper (e.g., Appendix B.2) as suggested.
[1] Gao, Ruiqi, et al. "Learning Energy-Based Models by Diffusion Recovery Likelihood." International Conference on Learning Representations. 2021.
[2] Zhu, Yaxuan, et al. "Learning Energy-Based Models by Cooperative Diffusion Recovery Likelihood." The Twelfth International Conference on Learning Representations. 2023.
[3] Xie, Jianwen, et al. "A Theory of Generative Convnet." International conference on machine learning. PMLR, 2016.
[4] Du, Yilun, and Igor Mordatch. "Implicit Generation and Modeling with Energy Based Models." Advances in Neural Information Processing Systems 32 (2019).
[5] Xu, Xinyue, et al. "Energy-based Concept Bottleneck Models: Unifying Prediction, Concept Intervention, and Probabilistic Interpretations." The Twelfth International Conference on Learning Representations. 2024.
[6] Song, Jiaming, Chenlin Meng, and Stefano Ermon. "Denoising Diffusion Implicit Models." arXiv preprint arXiv:2010.02502 (2020).
[7] Koh, Pang Wei, et al. "Concept Bottleneck Models." International conference on machine learning. PMLR, 2020.
Thank you for your encouraging and valuable comments. We are glad that you found our method "novel" and has "multiple practical applications", our theoretical framework "comprehensive" "with detailed proofs", our empirical results "strong", and our improvement "clear". We will address each of your comments in turn below.
W1. Acknowledging pioneering work on energy-based diffusion models and works using EBM as compositions.
Thank you for pointing us to these interesting papers. Following your suggestions, we have cited and discussed these pioneering papers in our revision (in the related works section).
We recognize the importance of the works [1, 2, 3, 4] you have mentioned. For example, in the pioneering works on energy-based diffusion models, i.e., Diffusion Recovery Likelihood and its extension [1, 2], they trained EBMs on diffusion recovery likelihood to facilitate training and sampling on high-dimensional datasets.
We also note that for these works,
(1) The number of supported concepts is fixed and limited (e.g., only concepts, compared to concepts in our ECDM), and hence not sufficiently informative as interpretations.
(2) More importantly, these works aim to compositional generation with deterministic concepts and, therefore, fail to provide probabilistic interpretation, which is the focus of our ECDM.
Therefore, these methods are not applicable to our setting.
In contrast, our ECDM explicitly considers human-understandable probabilistic concept explanations in its design by jointly modeling the input instruction , associated concepts , and the generated image during the generation process within a unified energy-based framework.
For Questions:
Q1. How does the method scale with increasing number of concepts?
Thank you for mentioning this. Our model is efficient, and scales linearly with the number of concepts in terms of computation and the number of model parameters.
For example, when concept number , the parameter size is 27.57 M, excluding all frozen pretrained components, and when , the parameter size is 110.99 M. Note that the computational cost and number of parameters for all frozen pretrained components are fixed (i.e., constant).
We have included a scaling analysis in Figure 8 of the updated Appendix D. We can see that the number of model parameters scales linearly with the number of concepts.
Q2. What is the computational overhead compared to standard diffusion models?
Thank you for mentioning this.
Mapping Energy Network. We sample from the mapping energy network using the Gradient Inference technique, as outlined in ECBM [5]. This sampling procedure requires approximately 10 to 30 steps, taking around 10 seconds of wall-clock time.
Concept Energy Network. For the generative concept energy network, we model the diffusion model as an implicit representation of the energy function, making the diffusion model sampling algorithm applicable to our framework. We utilize the standard diffusion sampling algorithm (i.e., DDIM [6]]) to generate an image from the concept energy network. This process involves approximately 50 steps and takes around 3 seconds of wall-clock time when using an NVIDIA RTX 3090. Therefore, the computational overhead remains comparable to that of standard diffusion models.
Q3. Could the framework be extended to handle continuous concept values rather than binary?
This is an insightful question and points to an interesting extension of our ECDM.
- Our framework naturally supports normalized continuous-valued concepts. For example, By normalizing the continuous concept value to the range of , the concept probability , which is already a real (continuous) number in the range of , used for mixing the positive/negative concept embedding can be substituted by this value, and further be integrated into our framework.
- Our framework can be further extended to support unnormalized continuous-valued concepts. For example, we can learn a unit concept embedding that represents the unit value of a certain concept, and a continuous magnitude concept embedding that represents the actual magnitude of the concept. With and , we can then replace the final concept embedding (in Line 232-238) with . All other components of our ECDM can remain unchanged.
We agree that extending our method toward continuous concepts would be an interesting future work, and we have included the discussion above in our revised paper (Appendix E.2).
Dear Reviewer oJRi,
Thank you for your time and effort in reviewing our paper.
We appreciate your valuable comments and suggestions, and we firmly believe that our response and revisions can fully address your concerns. We are open to discussion (before Nov 26 AOE, after which we will not be able to respond to your comments unfortunately) if you have any additional questions or concerns, and if not, we will be immensely grateful if you could reevaluate your score.
Thank you again for your reviews which helped to improve our paper!
Best regards,
ECDM Authors
The paper introduces concept bottleneck model into the diffusion generation process.
优点
-
The paper is well-written and easy to read.
-
The paper introduces concept bottleneck model into the generative diffusion model. Now the model has probilistic interpretation about the generated images.
缺点
-
Although I appreciate the idea of using concept sets to explain the generation, the proposed formulation does not make sense to me. The paper transforms a text embedding into a probility vector that represents the concepts. The point is that it is a deterministic mapping. For instance, "polar bear" outputs a determinsitic vector that represents the "paws", "furry" and "big". In many situations, we would expect a polar bear could be of different size, so "big" dim should vary from [0,1] in different polar bear images. In other words, I am expecting that the concept probability vector changes with generated image (not only the input tex prompt).
-
Given a binary concept labes set, I am wondering the optimal output of the concept energy model with input y? It seems that a binary output is also expected from y to minimize the loss? If yes, can we just use a logic mapping, i.e., "polar bear"-> paws=1, big=1. (So, we do not have to learn the first energy model). If not, can you provide any explanation why it would not learn a binary vector given the target is binary?
If these two concerns are addressed, I am happy to raise my score.
问题
as above
W2. "Given a binary concept labels set, I am wondering the optimal output of the concept energy model with input y? It seems that a binary output is also expected from y to minimize the loss? Can you provide any explanation why it would not learn a binary vector given the target is binary?"
This is a good question. Our ECDM's formulation and learning process goes beyond a simple binary logical mapping; instead, they involve probabilistic interactions among concepts, instructions, and generated images.
As shown in the response to W1 above, given the same text prompt (e.g., "A photo of the animal Polar Bear"), our ECDM can generate different concept probabilities according to the different generated images. Therefore, it is not a simple binary logical mapping from the text prompt (and its embedding) to the concepts; they also depend on the images.
Formally, we model the joint energy function as:
,
with the mapping energy function
,
and concept energy function
.
- Therefore, given the input , the optimal concepts depend on not only mapping energy function but also the concept energy function . Since the inferred concepts would also change with respect to the image , the resulting for each concept will be a real value between and .
- Since different images may have different sizes, concepts such as "big" are actually real values from , as you mentioned. With our ECDM inferring concepts from both the input text and the image , the resulting for each concept will be a real value between and .
Last but not least, we would like to thank you again for your insightful comments and for keeping the communication channel open. Please do not hesitate to let us know if you have any follow-up questions. We will be very happy to provide more clarifications.
Thank you for your valuable reviews. We are glad that you found our model "provides probabilistic interpretation" and our paper "well-written and easy to read". Below, we address your questions one by one:
W1. "... The point is that it is a deterministic mapping. ... I am expecting that the concept probability vector changes with generated image (not only the input text prompt)."
Thank you for this insightful comment. We would like to clarify that (1) our ECDM's concept probability does vary with both the text instructions and the generated image and that (2) it provides probabilistic interfaces rather than a deterministic mapping.
Why Our ECDM's Concept Probability Varies with the Generated Image Too. This is actually a key property enabled by our ECDM's joint energy-based modeling. Specifically, we model the joint energy function as:
,
with the mapping energy function
,
and concept energy function
.
This joint formulation of energy functions, particularly the concept energy function, allows the concept probability () in the inference process to change with the generated image in our model's interpretation task, which is the main focus of our paper. Specifically, we optimize the concept probability using the concept energy network (Eqn. (17)) given a generated image for interpretation, with a probability range of . This non-binary probability indicates "to what degree the diffusion model generates the image based on these specific concepts." For example, given a "polar bear" image generated by Stable Diffusion, our ECDM will infer a high probability for the concept "arctic" and a low probability for "forest" using Eqn. (17), suggesting that the Stable Diffusion model may generate this image based on "arctic" rather than "forest".
Additional Experiments: Concepts "Water" and "Arctic". Inspired by your comments, we conducted additional experiments to observe the changes in this non-binary concept probability (Figure 6 of Appendix B.1). Given the same prompt, "A photo of the animal Polar Bear", the diffusion model generates two different "Polar Bear" images: The top image does not have a "water" and "arctic" background, while the bottom image has a "water" and "arctic" background. Our ECDM correctly infers that the probabilities of the concepts "water" and "arctic" in the top image are 0.1233 and 0.0363, respectively, much smaller than those in the bottom image (0.9543 and 0.8015, respectively).
Additional Experiments: Concept "Big". For the concept "big," we can also see meaningful variation in the inferred probabilities, 0.9067 (top image) versus 0.9922 (bottom image), meaning that our ECDM is more certain that the bottom image is a "big" polar bear, but is less certain about the top image since it only shows the head of the bear.
Therefore, our ECDM's concept probability vector does adjust with the generated image in interpretation. We have incorporated your valuable insight and further discussions into Appendix B.1.
Dear Reviewer AYc9,
Thank you for your time and effort in reviewing our paper.
We appreciate your valuable comments and suggestions, and we firmly believe that our response and revisions can fully address your concerns. We are open to discussion (before Nov 26 AOE, after which we will not be able to respond to your comments unfortunately) if you have any additional questions or concerns, and if not, we will be immensely grateful if you could reevaluate your score.
Thank you again for your reviews which helped to improve our paper!
Best regards,
ECDM Authors
I thank authors for your responses. But I am more interested in the generation part of the method. In text-to-image generation, the method generates a deterministic probability vector from text? Can we change a single dimension (e.g., 1->0 gradually) to get a slighly different image with the corresponding concept changes? As for the binary question, it is still unclear why the objectives won't lead to binary solutions (in training and generation phases, not interpretation phase). Since your target is binary and it is labeled correctly, why the optimal solution is not binary?
Thank you for your feedback and for providing additional clarifications regarding your questions. Below, we will address them point by point in detail.
Q1: Can we change a single dimension (e.g., 1->0 gradually) to get a slighly different image with the corresponding concept changes?
Yes, our ECDM can change a single dimension (e.g., 1->0 gradually) to get a slighly different image with the corresponding concept changes. When performing generation, the images produced by our model can vary depending on the adjustment of the concept probabilities. This feature is enabled by the novel joint energy-based modeling approach of our ECDM. This approach facilitates the complex representation of relationships between concepts and further supports concept-based generation by leveraging these modeled interactions through the minimization of joint energy.
Additional Experiments to Demonstrate this Ability. Inspired by your detailed comments, we conducted additional experiments to visualize the variation of the generated image according to the adjustments of concepts; the new results are included in Figure 6 of Appendix B.1.
Given the same prompt, "A photo of the animal horse", we adjusted the probabilities of the concepts "white" and "brown". Specifically, we gradually decreased the probability of the concept "white" from to , simultaneously increased the probability of the concept "brown” from to , and then performed joint generation.
As shown in Figure 6 of Appendix B.1, our ECDM accurately reflected these concept probability changes, producing images of a horse with the corresponding colors. When the probability of “white” was set to and “brown” to , the model generated a purely white horse. As the probability of “white” gradually decreased and that of “brown” increased, the generated horse images gradually shifted in color, eventually producing a purely brown horse.
Therefore, our ECDM's generated image does adjust with the concept probability vector in generation. This further validates that our model does not only learn a deterministic mapping. We have incorporated your valuable insight and further discussions into Appendix B.1.
Q2: In text-to-image generation, the method generates a deterministic probability vector from text?
Thank you for your further clarification. The concept probability vector is also a probabilistic vector from text. The concept probability depends on both image and instruction on the training and generation process. We refer the reviewer to our response to Q3 below for further details.
Q3: Why the objectives won't lead to optimal solutions in simple binary form in the training and generation?
Thank you for your detailed follow-up question. We would like to kindly clarify that in the training and generation stage, the optimal solution is still not in simple binary form.
Generation. During generation, our model focuses on minimizing the joint energy , i.e., "given instruction , find image generation and concept set " that minimize this joint energy. In this way, and will mutually affect each other.
For example, our full model's generation are equivalent to:
- (A) given , generate ;
- (B) given , generate ;
- (C) given and , adjust ;
- (D) repeat step (B) and (C) until convergence.
While the generated in step (A) may be binary, is not binary after alternating between step (B) and (C). The key is that the generation of is stochastic (probabilistic), i.e., the generated images will differ depending on the initial noise of the diffusion model. Therefore, the final depends on not only but also , making it probabilistic and non-binary.
Training. Similarly, in training, our model focuses on optimizing the estimation of the joint energy, instead of simply producing binary predictions.
For the mapping energy network , the optimization target is to minimize the energy estimate (a scalar value) for each correct instruction-concept pair and vice versa using contrastive divergence, rather than predicting a binary output. Consequently, we aim for optimal concept embeddings that minimize the energy of the correct input combination to reduce loss rather than directly predicting the correct binary label.
Regarding the concept energy network , the input consists of the combined concept embedding and the corresponding correct images , rather than a binary prediction. Similarly, we optimize the concept embedding so that the pretrained energy estimator assigns the lowest energy to the correct concept-image combination, ensuring accurate compatibility estimation between concepts and images.
This training process is non-binary under the energy-based formulation, forcing the network to learn more complexed relationships among the instructions, concepts, and image generations. As a result, the training process does not learn a binary vector during training because (1) the target is contrastive energy minimization, not binary classification; (2) the joint minimization of both energy networks enforces the concept embedding to be a complex, non-binary vector.
Conclusion. Therefore, if we train and generate using separately, the output will be binary; if we train and generate considering jointly, the final will not be binary. This is true both for training and generation.
If the model had only learned fixed deterministic and binary mappings as the optimal solution during the training process, the additional experiments on
- concept-based intervention in the response to Q1 and Figure 6 of Appendix B.1 and
- energy matching conceptual interpretation in Appendix B.2
could not have been successful, as both of they heavily rely on leveraging non-binary concept-image interactions to perceive and derive concept probabilities.
Again, we are immensely grateful for your follow-up comments and keeping the communication channel open. If you feel that we have adequately addressed your concerns, we would appreciate your consideration in adjusting our score.
We thank all reviewers for their valuable comments. We are glad that they found the problem we solve "novel"/"has multiple practical applications"/"a practical tool"/"supports both generative and interpretive tasks" (oJRi, grAS), our proposed method "has probabilistic interpretation"/"versatile"/"enhances the interpretability" (AYc9, 5CJZ), our theoretical analysis "comprehensive"/"detailed" (oJRi), our paper "well-written"/"easy to read"/"well-organized" (AYc9, 5CJZ), and our experiments "thorough and clear" (5CJZ), and agreed that our ECDM has "strong empirical results", "clear improvement" and "improvements" in both "image quality" and "concept alignment" (oJRi, grAS).
Below we address the reviewers’ questions. We have also updated the main paper and the Appendix (with the changed part marked in blue).
The submission presents a generative model for image and "concept" jointly for a given instruction label, in hope to support better interpretability, diagnosis, instruction following, and modified/controlled image generation, through the various derived conditional distribution samplers. Reviewers acknowledge the general idea and the capabilities, while also raised some insufficiencies in discussions on related work, only using a pretrained stable diffusion, and asked for more finer demonstration on the controllability and more evaluation metrics. The authors have addressed some, but there remain concerns (e.g., only using one pretrained model) and the reviewers did not update their neutral-to-negative scores.
In addition, I also found the mathematical formulations confusing. I posted my concerns to the authors: "In Eq. (5), is the E^concept model to be used as the epsilon model in the diffusion formulation? If so, why it does not depend on t (while the right-most expression contains t)? In Eq. (6), how does the l.h.s still a function of x (and t, if you answered yes to the first question) while you take expectations w.r.t x and t on the r.h.s? Particularly, how can you recover Eq. (5) from the definition in Eq. (6)? In the following description on training the energy model, if you are training it using maximum likelihood, why can you only take the expectation of the energy under the data distribution, while omitting the expectation of the energy under the "model" distribution (the distribution that the energy function defines) from the loss, which should be there in the standard energy-based training loss? Is there any specialty here?" The authors replied, but without sufficiently detailed justification on the equations, deductions, and methods, and my concerns persist. In addition, Eq. (2) also seems inaccurate as there should be an appropriate time weighting to the noise prediction loss at each t to make it an ELBO. Therefore, the submission does not seem like a serious draft for publication. I hence recommend a reject.
审稿人讨论附加意见
(Already covered in Metareview)
Reject