Adapting Diffusion Models for Improved Prompt Compliance and Controllable Image Synthesis
We introduce a new framework for modeling the joint distribution of images and conditioning variables by adapting Stable Diffusion to enhance prompt compliance, controllability and editing of images.
摘要
评审与讨论
This paper propose a new family of diffusion model: FG-DM for better prompt compliance. Unlike traditional diffusion model, FG-DM models the joint distribution of images and conditioning variables like semantic, sketch, depth or normal maps via a factor graph. These extra factors largely boost prompt compliance compared to use text only. Users can also easily edit target image by modifying intermediate conditions. This work shows better results compared to vanilla diffusion.
优点
By decomposing the text into different factors such as seg maps, depth maps, etc., and using these factors as conditions for generation, this method achieves better prompt compliance. Additionally, the introduction of intermediate factors makes this method easier to edit.
缺点
Since intermediate factors are generated sequentially, these method need more time for generation. The novelty of this method is limited.
问题
- How much more time do we need for the conditional generation chains? (for example, how much more time do we need when we use one or two conditions compared to original SD?)
- If we use an off-the-shelf condition generator e.g. a text-to-depth map generator, and we fed the generated depth map to controlnet for generation, what is the advantage of our method?
局限性
Yes
We thank the reviewer for their positive comments and useful feedback on the paper.
Q1. How much more time do we need for the conditional generation chains? (for example, how much more time do we need when we use one or two conditions compared to original SD?)
Regarding time, we make several considerations. First, the fact that the conditions can be synthesized with lower resolution makes their synthesis much faster than vanilla SD. Typically SD requires 3+ second per image while each condition typically requires less than 1 second. Hence, the increase in overall synthesis time is small. This is shown in Table 4 of the main paper, where generating an image with SD requires 3.25 seconds while generating an image with one condition using FG-DM takes 4.45 seconds for 10 timesteps and 4.81 seconds for 20 timesteps. For two conditions, these numbers become 4.9/5.26 seconds for t=10/t=20 timesteps. Second, a comparison to vanilla SD is somewhat unfair because the whole point is to increase the prompt compliance of vanilla SD. A fairer comparison is to popular methods to achieve this goal, such as A-E, which the table shows to require 36 seconds per image! Third, while the implementation of SBPC with the FG-DM achieves good results for single image synthesis (N = 1), the bottom half of the table shows that its true potential is unlocked by larger batch sizes. Both the SD and FG-DM implementations of SBPC achieved the much higher average object recall in this setting. However, for SD, the sampling of a batch of high resolution images requires 23s per prompt, while the FG-DM is 4x faster (when using 10 DDIM steps for the segmentation mask and the same 20 steps for image synthesis) and achieves the same object recall.
Q2. If we use an off-the-shelf condition generator e.g. a text-to-depth map generator, and we fed the generated depth map to controlnet for generation, what is the advantage of our method?
Table 8 and Table 13 in the Appendix compare to the sequential implementation trained with segmentation condition, where the models are trained independently and used as the reviewer suggests. They show that joint training and joint inference of image and factor synthesis improves the quality of the generated images as compared to sequential operation.
Q3. The novelty of this method is limited.
Regarding novelty, we would note the following facts.
- First, the paper is the first to propose Sampling based Prompt Compliance (SBPC), which is shown to be much more effective than the existing Inference based prompt compliance (IBPC) approach. We show that IBPC methods are slower and underperform the proposed implementation of SBPC with the FG-DM.
- Second, we do introduce the attention distillation loss, which is shown to make a difference, both in the quantitative ablations of Table 2 of the main paper and the qualitative examples that we now show in the rebuttal pdf file (also see reply to Reviewer cyg3).
- Third, the concept of the FG-DM is novel and could have many other applications beyond improved prompt compliance. For example, its modularity enables more effective continual learning schemes, where factors are added or updated for specific conditions without the need to fully retrain the model. These type of benefits are not easy to measure/quantify without devoting a complete paper to the formulation of the continual learning problem, introducing experimental protocols, baselines, etc. We intend to consider the issue in future work.
- Finally, the paper shows some results that we consider somewhat surprising (and thus "novel"), e.g. 1) that SD can be prompted, without even changing the auto encoder, to produce segmentation maps or other conditions and that this even maintains the generalization ability of SD, or 2) that, as is now shown in the rebuttal pdf, model inversion and editing methods work as well for conditions as for image synthesis (see reply to Reviewer cyg3). Overall, we believe that all these contributions make the paper quite novel, and the FG-DM an interesting new direction for further research by the text-to-image synthesis community.
Thank you for your response. You have addressed most of my concerns.
This paper proposes a method for sequentially generating images using a frozen SD. Starting from a text prompt, the process iteratively generates several visual conditions and images, with each step depending on the previous ones. The model uses the VAE in SD to encode and decode visual conditions and employs an condition-specific adapter to introduce information into the SD for generating these conditions. This approach makes the image generation process more controllable, flexible, and easier to edit.
优点
- The iterative generation method using a stable diffusion model is impressive. It provides strong control over the output while maintaining flexibility, aided by classifier-free guidance training.
- The new method allows for flexible editing by modifying visual conditions and letting the model generate realistic content based on these changes.
缺点
- The newly introduced iterative process adds extra loop to each generation step.
- Some components and experimental details are not well-explained.
问题
- The assumption behind attention distillation suggests that word tokens and visual regions remain constant, even if the visual regions are non-semantic, like poses. Why does this happen? It seems odd that the attention score of a general word like "person" can help generate the poses of a person.
- Why do you use independent noise for different factors?
- The experimental details are unclear. Does the model run with a single visual condition for each result, or does it generate all conditions sequentially? It would help to clearly present the generation process for each experiment, including Tables 1-5 and Figures 5-7.
- How do you implement null conditions for different variables?
- You claim that after choosing the best segmentation factor, the image synthesis only needs to run once. Why is this? Does the segmentation factor determine the quality of the generated images, or is your goal just to ensure the object appears in the generated image?
局限性
The authors have adequately addressed the limitations.
We thank the reviewer for their positive comments and useful feedback on the paper.
Q1. The newly introduced iterative process adds extra loop to each generation step.
Although the proposed FG-DM adds an extra loop in the generation process, in Table 4 we show that using lower resolution and fewer timesteps for conditional generation has no impact on overall FG-DM performance. Hence, the increase in inference time over vanilla SD is small (4.45 secs for FG-DM vs 3.25 secs for vanilla SD). However, this is not a fair comparison since the object recall of the FG-DM (67.8) is much higher than that of SD (59.8). Note that the whole point is to improve prompt compliance. For reference, the popular Attend-Excite method [4] has a much lower improvement in recall (63.6) for a much larger inference time (36 secs). On the other hand, the table shows that the FG-DM is 4 times faster than SD under a fair comparison, where the two methods have equal recall. Please see the reply to Reviewer t2kD for a more detailed discussion.
Q2. Some components and experimental details are not well-explained.
Good point about missing details. Table 1-5 and Figures 5-7 use a single visual condition for the image generation. The results presented in Figure 2 and Figure 19 follow the chain segmentation--pose--image (image conditioned on both). In general, the FG-DM can be trained with a single or multiple conditions. However the complexity (training, inference, and model size) increases with the number of conditions. Hence, the best model configuration is a trade-off between meeting application requirements and complexity. The critical condition for improved prompt compliance is segmentation, since it allows the measurement of object recall. It is also important for editing, since it allows moving and resizing objects. In the rebuttal pdf we show that it can even be combined with editing methods like LEDITS++ to enable text-based object editing operations. The other conditions are important for some image editing applications, e.g. changing object depths, people pose, or sketching, and can also contribute image quality improvements as shown in Figures 2-3. Overall, the FG-DM is a very flexible framework, allowing to mix and match different factors so as to best meet application needs. We will add all details to the final revision.
Q3. It seems odd that the attention score of a general word like "person" can help generate the poses of a person.
The reviewer has a good point that it does not make a lot of sense to use attention distillation for pose. We do not use it for pose factor and only use it for other conditions when these are the only conditioning factors. Note that the pose factor is conditioned by the segmentation, which is itself subject to the attention distillation loss. For multi-condition FG-DMs, the distillation loss is only required for the synthesis of the first condition, which is conditioned by text alone (see Figure 4 of main paper). For the subsequent factors, which are already conditioned by a visual condition (e.g. pose conditioned by segmentation), the attention loss is not needed. We will clarify this in the revised version.
Q4. Why do you use independent noise for different factors?
The question about independent noise is an interesting one. Independent noise is a natural assumption (after all, the noise is independent even across steps of the same diffusion model factor) and enables a simple training and inference process. It is also not clear what a natural model for the dependencies between the noise of a segmentation mask and an image would be, for example. Maybe one could consider the use of a common noise in all diffusion chains. This would be trivial to implement in the forward chain (just sample noise once per step instead of per step and per factor) but less trivial for the reverse denoising step, as some regularization would be needed to guarantee that all the chains produced the same noise. Maintaining the consistency of the noise across factors in the denoising process would likely be difficult. So, while we do not rule out that dependent noise may be an interesting possibility, we do not see an obvious way to implement it and we are not clear that it would be beneficial. We leave this question as a topic for future work.
Q5. How do you implement null conditions for different variables?
We use a null condition to support classifier free guidance during inference when generating conditions, even for the single condition FG-DMs. The null condition is implemented by using an image filled with zero values as the conditional input for the final factor of FG-DM (ControlNet). For each visual condition model (seg, depth, sketch, normal), we use the empty prompt ("") as the null condition while training the models.
Q6. is your goal just to ensure the object appears in the generated image?
The image synthesis only needs to run once because we are only attempting to improve prompt compliance. As the reviewer notes, the idea is to sample segmentations until we can ensure that the objects appear in the image. We verify the object recall with the segmentation factor for the results presented in Tables 3 and 4 and Figure 7. It is true that we could also sample multiple images given segmentation but this would not improve prompt compliance, maybe only image quality given the segmentation. We do not do this because 1) the existing models already produce high quality images (it is not clear that the extra computation would be justified), 2) the existing metrics of image quality do not support an evaluation of image quality fine-grained enough to even know if this would make a big difference in terms of the human judgements of quality for the resulting images, and 3) it can be done for any diffusion model (there is no special advantage of the FG-DM here).
[4] Chefer, H., et. al.: Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. SIGGRAPH (2023)
Thank you for your response. Most of my concerns have been addressed.
The paper proposes unified generation framework of simultaneously generate image, segmap. depthmap, etc.
优点
The idea of simultaneous generation of images with various conditioning maps is very interesting.
缺点
-
Although the method shows good performance in image editing for generated images, it does not include any discussion of whether the framework can be applied to real image editing case. Does it can be applied to inverted real images? Does the framework can still predict the segmentation map and keypoints of given real images? If that is possible, there would be great impact and future usage. If it is still applicable to real images, then please show some results which are similar to ControlNet.
-
The manuscript does not contain any qualitative comparison of ablation study. Please show some visual results on ablation study.
-
Please show the generated image results using handcrafted conditions, not the edited condition from generated conditions. For example, give the hand-drawn sketches (user-identified one) to the model and show the generated outputs.
问题
See weakness
局限性
Yes
We thank the reviewer for their positive comments and useful feedback on the paper.
Q1. Does it can be applied to inverted real images? Does the framework can still predict the segmentation map and keypoints of given real images? If it is still applicable to real images, then please show some results which are similar to ControlNet.
The editing question is quite interesting and motivated us to consider aspects that we had not thought about. Since the framework is based on Stable-Diffusion, it can be applied for inversion. Figure 2 of the rebuttal pdf provides an answer to our understanding of the question "Can it be applied to inverted real images? Does the framework can still predict the segmentation map and keypoints of given real images?"
We have experimented with editing of both real images and their segmentation masks. The top of Figure 2 refers to inversion of the segmentation mask. We use an off-the-shelf OpenSEED [2] model to extract the segmentation map of a real image (shown on the bottom left of the figure) and apply the FG-DM segmentation factor model for inversion and editing using LEDITS++ [3], a recent method for text based image editing. We apply LEDITS++ to the segmentation factor to 1) replace the mask of the woman by that of a chimp (third image of the top row) and 2) to delete the mask of the umbrella (fifth image). New images (fourth and sixth) are then generated by the image synthesis factor conditioned on the edited segmentation masks. We have found that the inversion and editing of segmentation masks is quite robust. The synthesized masks usually reflect the desired edits. However, because the final image synthesis is only conditioned on these masks, the synthesized image does not maintain the background of the original image. The synthesized image is a replica of the original image at the semantic level (similar objects and layout) but not at the pixel level. From our experiments, this method has high robustness and quality for semantic-level editing. It can also be used for pixel-level editing using copy and paste as discussed below.
We next investigated pixel level inversion and editing, which is harder. The bottom part of Figure 2 shows the comparison of LEDITS++ editing with inversion by SD and by the image synthesis factor of the FG-DM. For the latter, we apply inversion to the ControlNet image generation factor using the real image and the segmentation mask extracted from it. Then we perform the LEDITS++ edit using the edited mask from the top part of Figure 2 (inverted with the FG-DM segmentation factor) to produce the edited image as shown in columns 4 and 5. This pixel-level inversion and editing tends to maintain the background of the original image but is much less robust than mask-level editing in terms of editing quality. This can be seen from the images in columns 2 and 3, which show the inversion using SD, which fails to produce a realistic chimp and turns the woman into a stone sculpture. The FG-DM produces more meaningful edits but which still has room for improvement, as shown in columns 4 and 5. The last column of the bottom part of the Figure 2 shows an added advantage of FG-DM where the chimp generated in the top portion can be pasted to the original image due to availability of the segmentation mask. In this example the pasting is rough around the object edges since we have made no attempts to beautify it. We believe that this copy and paste technique can produce high quality images with a bit more of work. Nevertheless, these examples show that real images can be edited directly or used as semantic guidance to generate new images following the specified semantic layouts. The pixel-level inversion can also be improved with more work but not within the time frame of the rebuttal. It will require some additional technical contributions and we leave it, as well as the optimization of the copy and past technique, as a topic for a subsequent paper on the use of FG-DMs for text-based image editing.
Q2. The manuscript does not contain any qualitative comparison of ablation study. Please show some visual results on ablation study.
Great point about ablations. Figure 1 of the rebuttal pdf shows some qualitative results of the ablation for the impact of the attention distillation loss. There is a clear qualitative benefit in introducing this loss. Without it, the model generates less accurate masks, leading to an unrealistic pizza making depiction/ cart-person relationship/ zebra pair from left to right. This confirms the qualitative ablation showing the benefits of the attention distillation loss in Table 2 of the main paper but provides a stronger illustration of the advantages of the loss, which tends to produce more "plausible" scenes. Such plausibility is difficult to measure with qualitative metrics. For example, the CLIP score is not sensitive to the fact that the cart and the person are not interacting in a normal way, or that the pizza making activity is unrealistic.
Q3. For example, give the hand-drawn sketches (user-identified one) to the model and show the generated outputs.
We note that if hand-drawn sketches are provided the model reduces to a standard conditional DM. In this case, it inherits all the capabilities of the ControlNet, or whatever DM is used to implement the final factor. Figure 3 of the rebuttal pdf shows some qualitative results on using hand drawn sketches to synthesize images. These are as good the ones produced with the ControlNet.
[2] Hao Zhang, Feng Li, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianfeng Gao, Jianwei Yang, Lei Zhang. A Simple Framework for Open-Vocabulary Segmentation and Detection, ICCV 2023
[3] Manuel Brack, Felix Friedrich, Katharina Kornmeier, Linoy Tsaban, Patrick Schramowski, Kristian Kersting, Apolinário Passos. LEDITS++: Limitless Image Editing using Text-to-Image Models, CVPR 2024
Thank you for your detailed rebuttal. Most of my concerns have been addressed.
Just one little concern remaining is that there are so many concurrent and previous methods which try similar tasks. I leave the final decision to mete-reviewer whether this manuscript can go beyond the borderline of acceptance in this extremely competitive field.
This paper introduces a Factor Graph Diffusion Models (FG-DMs) to address limitations in current generative models. FG-DMs model the joint distribution of images and conditioning variables using a factor graph decomposition, offering advantages like efficient sampling for better prompt compliance, fine-grained editing, and explainability. The method's effectiveness is validated on several datasets.
优点
This paper introduces a method to model the joint distribution of images and conditioning variables, enhancing prompt compliance and control over image synthesis.
The Attention Distillation Loss is proposed to enhance the consistency and quality of generated images and conditions.
缺点
The proposed method seems promising, however, it's very important to compare with current SOTA works to justify its effectiveness. The author only put Stable Diffusion in Table 5, which is not fair. Please include more recent works especially sine 2023 such as StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis, Make a Scene.
问题
What's the training cost of your model? Is there a comparison with other methods?
局限性
Yes.
We thank the reviewer for their positive comments and useful feedback on the paper.
Q1. The proposed method seems promising, however, it's very important to compare with current SOTA works to justify its effectiveness. The author only put Stable Diffusion in Table 5, which is not fair. Please include more recent works especially sine 2023 such as StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis, Make a Scene.
Unfortunately, the StyleGAN-T work, mentioned by the reviewer, only has open-source code. The trained checkpoints are not available. This makes it impossible for us to compare, since we are an academic group and do not have the resources to train the model from scratch. Make-a-Scene does not have either code or checkpoint released, which makes us unable to perform quantitative comparisons on COCO validation dataset. Instead, we compared the qualitative results of FG-DM with Make-a-Scene and SpaText in Figure 13 of the Appendix (figures taken from the respective papers). Note that the proposed FG-DM synthesizes both segmentation maps and images, while the results of these methods were obtained with manual sketching of the segmentation, which is cumbersome and requires some skill. Nevertheless, the FG-DM generates high quality images that adhere well to the prompts as compared to these prior works.
In any case, the reviewer has a good point in requesting quantitative comparisons to more recent methods. For this, we compare with the recent 4M model [1], an autoregressive model trained from scratch on both discriminative and generative tasks. In this case, we use the largest model (4X-ML) released by the authors. Table 1 in the rebuttal pdf presents the updated comparison. Figure 1 in the rebuttal pdf shows a qualitative comparison between FG-DM and 4M. It can be seen that 4M generates images of weaker quality (distorted hands, missing person's head, deformed zebra bodies) as compared to FG-DM with/without the attention distillation loss.
Q2. What's the training cost of your model? Is there a comparison with other methods?
Regarding training time, one of the significant benefits of the proposed FG-DM is the short training time, since it only requires training adapters that are added to the pretrained SD model. Unfortunately, detailed comparisons are hard to perform since, as mentioned above, we do not even have the resources needed to train many of the models that have appeared in the literature. For the FG-DM factors, we simply train the adapters for 100 epochs, please see section A.5.2 in the Appendix of the original paper for a more detailed discussion. The training time per factor is about 1.5 days on 2 A100 GPUs or 72 A100 GPU hours. Finetuning the ControlNet takes about 2 additional days on 2 A100 GPUs. We find it quite interesting that these simple adaptations work quite well even for the synthesis of conditions, which the SD model was not originally trained to synthesize. For certain conditions the training even converges faster, e.g for depth maps it only takes 10 epochs and we do not observe any improvement after that.
For reference, training the original SD model requires about 150,000 A100 GPU hours, which is times larger. The FG-DM is also much more efficient than 4M, which requires 1.5 days training on 64 A100 GPUs for their smallest model (4M-B) and 8 days on 128 A100 GPUs for their largest model (4M-XL) which produces lower quality images as discussed above.
[1] Mizrahi, D., Bachmann, R., Kar, O.F., Yeo, T., Gao, M., Dehghan, A., Zamir, A.: 4M: Massively multimodal masked modeling. In: Thirty-seventh Conference on Neural Information Processing Systems (2023)
Thanks for the detailed rebuttal. Most of my concerns are properly addressed.
We thank all the reviewers for their time and comments. The rebuttal for individual reviews are posted below each review. Here we list the summary of the responses.
-
In the rebuttal pdf, we added the qualitative results of ablation study with and without attention distillation loss (Figure 1). We also added quantitative and qualitative comparisons to the recent 4M-XL model (Figure 1 and Table 1). We show additional qualitative results using hand-drawn sketches (Figure 3).
-
We show that FG-DM is also applicable for inversion and editing of real images and their extracted segmentation maps in Figure 2 of the rebuttal pdf.
-
We clarify concerns regarding novelty, computational cost and the inference time for FG-DM.
-
We apologize for missing the details on the experiments and clarified them in the rebuttal. We will add the details in the revised version.
We thank all the reviewers for their time and useful comments in improving the submission. We also thank all the reviewers for positively acknowledging the rebuttal. We differentiate the contributions of FG-DM from previous works below to alleviate any further concerns.
-
As compared to concurrent or existing works, e.g. 4M that requires training a model from scratch for generating conditions, the proposed FG-DM framework show that it is possible to efficiently adapt existing models like Stable Diffusion (SD) to improve prompt compliance and controllability (Table 4, rebuttal Table 1). We show that SD can faithfully reconstruct segmentation, depth, normal or sketch maps (Figure 17, 18) which allows for adaptation of the pretrained model. This has significant practical relevance. The fact that SD can simply be adapted to generate all the conditions represented as three channels and generalize to concepts not covered by the finetuning dataset (Figure 3, 5), despite not being trained on them, is a novel and somewhat surprising result.
-
A main novelty of the paper is to propose a paradigm shift for prompt compliance research. While prior works focus on inference based optimizations to improve prompt compliance, we show that sampling based prompt compliance is substantially faster (4 times) and better (Table 4). Furthermore, any future improvements in the samplers for diffusion models will automatically improve the performance of our sampling based method. While we only used the standard DDIM sampler in all our experiments, fast sampling is an active area of research, and fast samplers such as PNDM [1] or DDSS [2] can further boost our speed results.
-
Upon prompting by reviewer cyg3, we now show that image inversion techniques such as LEDITS++ can be applied to the FG-DM to edit the conditions and the images directly based on text. Figure 2 of the rebuttal shows that the segmentation masks can be directly edited by text prompts using our proposed FG-DM segmentation factor. To the best of our knowledge, this method is the first to be able to directly edit the semantic masks from text prompts. This adds a new editing dimension, which helps to automate the process and avoid the laborious task of manually editing the masks. We believe that this aspect alone can inspire various future papers in the topic of image editing.
-
We introduce a novel attention distillation loss that improves the quality of the generated conditions and images as shown in the paper and rebuttal (Table 2, rebuttal Figure 1)
-
We show that FG-DM also supports training from scratch on datasets such as CelebA, ADE20K or CityScapes (Figure 14, 15), and provides synthetic data for data augmentation in downstream tasks (Table 9, 10). Further, FG-DM can support continual learning where new factors such as pose can be added with minimal training.
Overall, we believe that all these contributions make FG-DM novel and distinct from existing works, offering valuable insights for further research within the text-to-image synthesis community.
[1] Luping Liu, Yi Ren, Zhijie Lin, Zhou Zhao. Pseudo Numerical Methods for Diffusion Models on Manifolds ICLR 2022.
[2] Daniel Watson, William Chan, Jonathan Ho, Mohammad Norouzi. Learning Fast Samplers for Diffusion Models by Differentiating Through Sample Quality, ICLR 2022
This paper proposes FG-DM for better prompt compliance and controllable generation. By using the iterative generation process with generating several visual conditions and images, FG-DM shows high controllability and enabling prompt compliance.
Although this paper has a borderline score (an average score of 5.25), all reviewers reached a positive consensus that the rebuttal addressed their concerns. As I have no significant concerns regarding the paper, I recommend acceptance.