Towards Understanding the Working Mechanism of Text-to-Image Diffusion Model
In this paper, we reveal the working mechanism of text-to-image diffusion model
摘要
评审与讨论
This paper aims to understand two mechanisms of diffusion models. First, the denoising process is analyzed, and it is found that shapes in an image are constructed in the beginning of the denoising process, while textures and details are filled in later. This empirical observation is justified with a mathematical frequency analysis. Second, the role of text conditioning is analyzed and it is found that the [EOS] token, which captures global information of the prompt, is relied on more heavily by the diffusion model. It is also observed that the text prompt is utilized more in the earlier stages of the denoising process. This finding is utilized to speed up diffusion sampling by ~25% while maintaining the image quality and prompt alignment. This is done by only injecting conditional information in the beginning of the denoising process.
优点
- Although the finding that shape is constructed in the first few timesteps has been observed many times before, it is nice to have a more principled study with various experiments and mathematical justification.
- The finding that the special [EOS] token is the most relied upon during generation rather than the prompt tokens is an interesting finding that can be used in later studies. For instance, improving prompt alignment, attribute binding, etc.
- The observation that the text prompt is used more in the early denoising process lends itself to a practical application of speeding up inference.
- Multiple architectures and samplers are used in this study, suggesting the generality of these findings.
缺点
- As mentioned in the Strengths section above, the findings are not completely surprising (for instance, the shape reconstruction or reliance on text in the early denoising steps, then detail-filling in the later steps). However, this work takes a principled approach in studying these phenomena which have largely been used in diffusion application literature (e.g., [1, 2])
- Limited to no mention of broader impact or limitations. Furthermore, the Conclusion section is just a summary of the paper but does not discuss the implications of these findings.
[1] @inproceedings{mengsdedit, title={SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations}, author={Meng, Chenlin and He, Yutong and Song, Yang and Song, Jiaming and Wu, Jiajun and Zhu, Jun-Yan and Ermon, Stefano}, booktitle={International Conference on Learning Representations} } [2] @inproceedings{hertzprompt, title={Prompt-to-Prompt Image Editing with Cross-Attention Control}, author={Hertz, Amir and Mokady, Ron and Tenenbaum, Jay and Aberman, Kfir and Pritch, Yael and Cohen-or, Daniel}, booktitle={International Conference on Learning Representations} }
问题
- What are some of the limitations and implications of these findings?
- I did not take this into account for my review, but there are many typos in the text and figures which can be corrected for the next version.
- This (https://arxiv.org/pdf/2404.07724) is a concurrent work, so it is not expected for this paper to compare against it. Their finding is applying guidance in the middle denoising steps improves image quality and distribution coverage. I am curious to hear how the findings this paper being reviewed can be connected to the phenomena observed there. It might be something to keep in mind for the next version, although it will not be used in assessing this paper.
局限性
Although there are no societal implications, a discussion of limitations is lacking.
We thank you for your valuable comments. Here we address your concerns as follows.
Q1: “As mentioned in the Strengths section above, the findings are not completely surprising (for instance, the shape reconstruction or reliance on text in the early denoising steps, then detail-filling in the later steps). However, this work takes a principled approach in studying these phenomena which have largely been used in diffusion application literature”
A1: We thank you for appreciating our “principled approach in studying these phenomena”. As you mentioned and our claim in line 161, these phenomena has been previously observed by single cases. However, our contributions are conducting a systematically exploration to them by frequency analysis. Besides, we also study the working mechanism of text prompts, and link it with these observed phenomena.
Q2: What are some of the limitations of these findings?
A2: For the limitation of this paper, owing to the auto-regressive textual encoder used in stable diffusion explored in this paper, it is natural for [EOS] contains more information. However, under bidirectional textual encoder, every token should have the same information magnitude. Thus, the conclusion “[EOS] contains more information” may not hold anymore.
In fact, when for T2I model Pixel-Art [1] with bidirectional text encoder T5 [2], the conclusion indeed does not hold anymore. However, the other conclusions “first overall shape then details” and “text prompt convey their information in the early stage of diffusion process” are still hold and has been verified in Table 2, where we have conducted the proposed sampling strategy on Pixel-Art.
We will clarify these limitations in the revised version.
Q3: What are some of the implications of these findings?
A3: The straightforward implication of our finding is designing the efficient sampling strategy as we proposed in Section 6. Besides, we wonder is there any possibility to combine these findings during the training stage of T2I model, e.g., using our noise prediction (9) during training stage to accelerate training. For the other conditional generation tasks e.g., human-face generation or subject-driven generation, we find that our sampling strategy is still applicable. Please see General Response for more details.
We will clarify these implications in the revised version.
Q4: “I did not take this into account for my review, but there are many typos in the text and figures which can be corrected for the next version.”
A4: We will carefully check the typos and revise them accordingly in the revised version.
Q5: “This (https://arxiv.org/pdf/2404.07724) is a concurrent work, so it is not expected for this paper to compare against it. Their finding is applying guidance in the middle denoising steps improves image quality and distribution coverage. I am curious to hear how the findings this paper being reviewed can be connected to the phenomena observed there.”
A5: Thank you for pointing out such interesting reference [3]. Their main conclusion is actually consistent with ours. They divide the denoising process into three stages: early, middle, and late. They claim that the guidance (textual prompt in this paper) of conditional generation in late stage is useless, which is consistent with our conclusion that textual prompts convey their information in early stage of diffusion process. Moreover, they claim the guidance existence in early stage of diffusion is strong and make the generation saturates into several modes, so that the guidance should only applied in the middle stage of diffusion process. This is also consistent with our observation that the overall shape is quickly decided by the text prompt in early stage of diffusion process, owing to the strong guidance of text prompt during this stage.
The main difference between our methods is they propose to remove the guidance in early stage of diffusion process to improve diversity, while we add guidance in this stage. Our method improves the alignments of generated images with target prompts, especially for small or middle size model with relatively poor alignments, while for model with poor diversity, their method seems fix the issue.
We will add this comparison in the revised version.
References:
[1] Chen et al., 2024. PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis.
[2] Raffel et al., 2019. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.
[3] Kynkäänniemi et al., 2024. Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models。
Thank you for your detailed response. I have also gone through the responses to the other reviews as well as the rebuttal PDF. As other reviewers have also pointed out, some of the findings have been discussed before in the literature (shape-then-detail). Although the findings and applications are fairly narrow in scope, I think they can have an impact on further interesting studies of diffusion models. Moreover, this work takes a principled approach. Thus, I maintain my score of weak accept. I appreciate the response on the limitations of the findings, so I suggest the updated paper to include those.
This paper explores the mechanism in the text-to-image diffusion model, including the generation order of image components, the influence of various tokens, and the steps in which tokens work. These observations bring some insight into understanding the diffusion model. Besides, the authors also design a sampling strategy that accelerates the sampling of the denoising process by 25%+.
优点
- The conclusion of the [EOS] token is interesting and has been rarely investigated in previous papers.
- The analytical experiments in this article are sufficient and strongly support its conclusion.
- The writing expression of this article is very clear.
缺点
- The other conclusions in this paper, e.g., shape first then details, have been discussed in previous works.
- The sampling strategy is more like a sample trick than a method.
问题
- Is the proposed sampling strategy still feasible for generating tasks that require preserving specific details, e.g., subject-driven generation?
局限性
Suggest the author to discuss the applicability and limitations of the proposed sampling scheme. For example, can it be applied to human face generation without losing human identity?
We thank you for your valuable comments. Here we address your concerns as follows.
Q1: The other conclusions in this paper, e.g., shape first then details, have been discussed in previous works.
A1: Yes, and we have mentioned in line 161, footnote 5. However, the existing literature only observe this phenomenon by a single case, while we conduct a frequency analysis to systematically explore this phenomenon, and our contributions also include the exploration to the working mechanism of text prompt.
Q2: “The sampling strategy is more like a sample trick than a method.”
A2: The sampling strategy is an application of the conclusion that text prompt convey their information at first few steps of denoising process. It saves computational cost in a simple yet efficient way.
Q3: “Is the proposed sampling strategy still feasible for generating tasks that require preserving specific details, e.g., subject-driven generation?”
A3: Thanks for your suggestion, we implement the proposed sampling strategy on a recent subject driven generation method, Anydoor [1]. Here, the visual features of reference image are used as condition to guide image generation. We remove the condition from different time steps during denoising process similar to Figure 10 in our paper. The generated images are given in the Rebuttal Figure 2 of attached file of General Response, and still preserve the specific details as baseline model (start point a=0) when start removing time steps is set to 20. Thus our sampling strategy is suitable for this task.
Q4: Suggest the author to discuss the applicability and limitations of the proposed sampling scheme. For example, can it be applied to human face generation without losing human identity?
A4: Following your suggestion, we conduct an experiment on the human face generation task using PhotoMaker [2] to verify our sampling strategy are still applicable. Unlike the T2I task, this experiment includes both text prompts and reference faces as condition information. We applied the proposed sampling strategy (removing all condition information) and find that the generated images, including faces (a=20), are similar to those produced by the baseline method (a=0). The generated images similar to Figure 10 in our paper are in the Rebuttal Figure 4 of the attached file of General Response.
References:
[1] Chen et al., 2023. AnyDoor: Zero-shot Object-level Image Customization
[2] Li et al. 2024. Photomaker: Customizing realistic human photos via stacked id embedding.
Thanks for your responses. After reading the rebuttal, I think this paper provides interesting new points about how text prompts work in diffusion models. Thus, I will raise my point.
The paper investigates the denoising process in DPM, identifying that the overall shape of the image is formed early in the process while details are added later. It further examines the influence of different text prompt tokens, finding that the end-of-sequence token [EOS] plays a crucial role in shaping the initial stages of image generation. The authors propose a method to speed up the generation process by removing text guidance after the initial stages, achieving a significant reduction in computational cost.
优点
- Comprehensive analysis of the denoising process stages in DPM.
- Detailed exploration of the influence of different tokens in the text prompt.
- Practical application of findings to accelerate the T2I generation process.
- Empirical and theoretical support for the proposed acceleration method.
缺点
- The paper might lack clarity in explaining the theoretical aspects of frequency signal analysis.
- Limited exploration of potential biases introduced by the dominance of the [EOS] token.
- The study may benefit from a broader range of experiments to validate the generalizability of the findings.
问题
- Can you provide a more detailed explanation of the theoretical aspects of frequency signal analysis used in your study? Specifically, how do the low and high-frequency components influence the denoising process? Including more accessible explanations or visual aids to illustrate the frequency signal analysis could help readers better understand this aspect of your work.
- Your experiments are primarily based on a specific set of text prompts and Stable Diffusion model versions. How do you ensure that your findings generalize across different models and broader text prompt sets?
- The paper uses various metrics like CLIPScore, BLIP-VQA, and MiniGPT4-CoT for evaluation. Can you provide a more detailed explanation of why these particular metrics were chosen and how they comprehensively assess the text-image alignment?
局限性
- Authors should discuss the robustness of their findings and the need for further experiments across various models and more complex or diverse text prompts to validate their conclusions.
We thank you for your valuable comments. Here we address your concerns as follows.
Q1: The paper might lack clarity in explaining the theoretical aspects of frequency signal analysis.
A1: The theoretical aspects of frequency signal are in Proposition 1, where we have actually proved that the added standard Gaussian noise has almost the same magnitude under each frequency spectrum. This means Gaussian noise has more high frequency signals compared with low frequency signals (clarified in line 172 and Figure 2b), since the high-frequency parts contain 80% spectrum.
In a word, the theoretical aspects of frequency signal analysis Proposition 1 show the added noise mainly contain high-frequency part, so that the high-frequency part of noisy data will be quickly destroyed during the adding noise process, and will not be recovered until the end of the reverse denoising process. The conclusion is oppositely hold for low-frequency part. Altogether, it explains the “first overall shape then details” phenomenon.
Q2: “Limited exploration of potential biases introduced by the dominance of the [EOS] token.”
A2: In line 215, we said “an ablation study in Appendix C verifies the influence to the dominance of the [EOS] token.” In Appendix C, we verify the impact of [EOS] partially originates from its dominance number, but also the more information contained in it.
Q3: “The study may benefit from a broader range of experiments to validate the generalizability of the findings.”
A3: In this paper, we verify the conclusion on simple and complicated text prompts and evaluated on T2I models based on UNet Stable Diffusion 1.5, 2.1 and DIT based model Pixel-Art. To further address your concerns, we conduct experiments on two tasks: subject driven generation and human face generation [1,2]. For such two tasks, we verify whether the conclusion “textual information is conveyed in the first few steps”. The results are in the Rebuttal Figure 2 and Rebuttal Figure 4 in the attached file of General Response. Please refer A3, A4 to Reviewer xbNR for more details.
Besides, as you suggested, we further verify the conclusion “[SOS] contains more information” on dataset MS-COCO. Please see A5 for details.
Q4: “How do the low and high-frequency components influence the denoising process? Including more accessible explanations or visual aids to illustrate the frequency signal analysis could help readers better understand this aspect of your work.”
A4: How does low and high-frequency components influence the denoising process is clarified in A1, such that the add-noise process takes noise with more high-frequency components so that the high-frequency parts in original image will quickly destroyed and will not be recovered until the end of the reverse denoising process. The conclusion is oppositely hold for low-frequency parts. This phenomenon is also visualized in Figure 2b. We will make this more clearly and readable in the revised version.
Q5: “Your experiments are primarily based on a specific set of text prompts and Stable Diffusion model versions. How do you ensure that your findings generalize across different models and broader text prompt sets?”
A5: As mentioned in A3, we further verify the conclusion that text prompts convey information at the first few steps of denoising process. It shows the conclusion can be generalized to the other tasks. The models we used in this paper includes three T2I generative models (Stable Diffusion 1.5, 2.1 and Pixel-Art). The three models are SOTA of open-sourced diffusion-based generative models.
As for text prompts, the results in Section 5 are verified on the constructed PromptSet in line 120, which consists of prompts from a benchmark dataset T2I commonbench [3] used to verify the quality of T2I generation. Notice that the prompts contain 1000 natural complex text prompts generated LLM. The prompts are actually diverse.
Besides, the experiments for verifying our second conclusion “The information of text prompt is conveyed during the early stage of denoising process.” in Section 6 are conducted under 30K text prompts from MS-COCO, which are diverse enough, and consist with our conclusion.
To further address your concerns, we conduct additional experiments on 15K pairs of text prompts from MS-COCO dataset with switched [EOS] to further support the conclusion in Section 5, “[EOS] Contains More Information”. The experiment is similar conducted as in Section 5.1 (with switched [EOS]), the results are shown in Table 1, and the generate images are in the Rebuttal Figure 3 in the attached file of General Response. The experimental results show the conclusion is still hold for MS-COCO dataset.
| Source Prompt | Target Prompt | |
|---|---|---|
| Text-CLIPScore | 0.2086 | 0.2696 |
| BLIP-VQA | 0.3735 | 0.5655 |
| MiniGPT-COT | 0.6512 | 0.7479 |
Q6: “Can you provide a more detailed explanation of why CLIPScore, BLIP-VQA, and MiniGPT4-CoT are chosen as metrics?”
A6: As mentioned in line 206, we give an explanation to the three chosen metrics in Appendix B, where we have described these metrics. The three metrics are proposed in [3], which fully utilized the strong text-to-image alignment capability of multi-modality models CLIP, BLIP, MIiniGPT4. They are standard metrics in measuring text-image alignment as mentioned in [3]. Please see more details in Appendix B or [3].
References:
[1] Chen et al., 2023. AnyDoor: Zero-shot Object-level Image Customization
[2] Li et al. 2024. Photomaker: Customizing realistic human photos via stacked id embedding.
[3] Huang et al., 2023. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation.
This paper study how the EOS token plays a role in the generation process of diffusion model. In particular, this paper finds that diffusion models tend to first generate low frequency part of the image at the beginning of the generation process, then gradually add high frequency signal to it. Experiments show that the low frequency signal is conditional on the EOS token while the high frequency signal can be generated without text guidance. In combined with the aforementioned observation, this paper proposes to remove in classifier-free guidance once the low frequency signal has been generation to improve generation efficiency.
优点
- This paper offers a new perspective for understanding the role of textual condition in diffusion models. By exploring how the EOS influence the generation process of diffusion model, this paper argues that the conditional part in classifier-free guidance (CFG) might be unnecessary after certain denoising step .
- Most experiments are inspirational and interesting. By swapping the EOS token and the sentence body, it demonstrates that diffusion models rely on the EOS token to synthesize low frequency part of the image.
- This paper explains the tendency of generating image from low-to-high frequency in diffusion models.
缺点
-
It is not clear that how the "computational cost" is defined in this paper. If the computational cost is GPU VRAM, then the claimed efficiency improvement might be invalid, as the required GPU VRAM for computing or is unchanged.
-
This paper mainly focus on the role of EOS token in T2I diffusion models while neglecting the SOS token. Despite the weight of SOS token is significantly higher than SEM and EOS token (see Figure 3). However, the authoer(s) claims that the SOS carries no information due to the autoregressive nature CLIP text encoder. Since this claim is not yet supported by other works, the author(s) should have conducted experiments to support this claim, as there is a chance that EOS and SOS tokens altogether influence the generation process.
问题
- Please clarify how the computation cost is defined in this paper and how the efficiency gain is computed.
局限性
N/A
We thank you for your valuable comments. Here we address your concerns as follows.
Q1: “It is not clear that how the "computational cost" is defined in this paper. If the computational cost is GPU VRAM, then the claimed efficiency improvement might be invalid, as the required GPU VRAM for computing or is unchanged.”
A1: In this paper, the “computational cost” we mentioned is used for computing the noise prediction () defined in (3), to conduct the diffusion process. Clearly, to get , one need two number of model evaluations on and .
However, as we have empirically verified that text prompt has conveyed their information after a certain time step , then for , we suggest to substitute with (as in (9)), which means for diffusion steps , we only one number of model evaluation to conduct diffusion step, which reduces the computational cost (no matter measured under which metric), compared with the original noise prediction requires two number of model evaluations.
To see the saved computational cost clear, please check “Saved Latency” in Table 2, where we report the saved latencies of our method under different settings, compared with the original method.
Q2: “This paper mainly focuses on the role of EOS token in T2I diffusion models while neglecting the SOS token. Despite the weight of SOS token is significantly higher than SEM and EOS token (see Figure 3 in attached file of General Response). However, the author(s) claims that the SOS carries no information due to the autoregressive nature CLIP text encoder. Since this claim is not yet supported by other works, the author(s) should have conducted experiments to support this claim, as there is a chance that EOS and SOS tokens altogether influence the generation process.”
A2: Thank you for pointing out this. First, due to the auto-regressive encoding process of CLIP, [SOS] should contain no textual information, while serves as a “dummy variable” to adjust weights of the other tokens in cross-attention module [1].
Moreover, following your suggestion, we design an experiment to verify that [SOS] indeed contains no textual information. For Text-to-Image generation, given a text prompt, we constructs two prompts, 1) all 77 tokens are [SOS] from the given text prompt, 2) expected for the first [SOS] (used to adjust attention map), the other all 76 tokens are [EOS] from the given text prompt. For such two constructed prompts, they respectively only contain textual information from [SOS], and [EOS]. We generate images under these prompt, and the generated results are in the Rebuttal Figure 1 in attached file of General Response.
As can be seen, the images generated with information from [EOS] can be consistent with the target text prompt. However, this phenomenon does not happen for prompt with only information from [SOS] injected. This further verify our conclusion that [SOS] only adjusts the cross-attention map but contains no textual information in line 193.
Thanks for inspiring us for such experiment, and we have added this in the revised version.
Q3: “Please clarify how the computation cost is defined in this paper and how the efficiency gain is computed.”
A3: Please check A1.
References:
[1] Xiao et al., 2023. Efficient streaming language models with attention sinks.
Dear reviewer:
Thanks for your reviewing, we are very happy to see your feedback, and address your further concerns.
Thanks
Thanks for the responses during the rebuttal period. The authors have addressed my major concerns. I am happy to increase my ranking of this manuscript as Weak Accept.
Dear Reviewer
We are happy that our reply address your concerns. We are sorry to bother you, but It seems that the score has not been changed yet. It will be nice to change the score when you are in convenience.
Thanks
Thank you for your meticulous review, which played a pivotal role in enhancing the quality of our paper. We apologize for any inconvenience, but as the discussion deadline approaches (Aug 13 11:59 pm AoE), we would like to provide an update on our progress.
We have clarified that the computational effciency of the proposed sampling method originates from the less number of function evaluation, and the experiments to verify no information in [SOS] are conducted.
If you require any further clarification or have additional questions, please do not hesitate to reach out. Again, we sincerely appreciate your time and effort in reviewing our paper.
Thanks
General Response:
We thank all reviewers for their valuable comments. It seems a common question is whether our sampling strategy can be applied to the other conditional generation tasks.
To verify this, we further apply our sampling strategy to the other two conditional generation tasks: subject-driven generation and human face generation. For such two tasks, there is extra reference image (given subject and human face) used as condition to guide image generation. Our sampling strategy are conducted on backbone methods AnyDoor [1] and Photomaker [2] respectively.
The generated results of these two tasks are in the Rebuttal Figure 2 and Rebuttal Figure 4 in attached file, and the results show that our sampling strategy is still applied for such two tasks. Because the images generated with conditions removed in the final stage of diffusion are consistent with baseline method (similar to Figure 10 in our paper).
The Figures for extra experiments are in the attached file.
References
[1] Chen et al., 2023. AnyDoor: Zero-shot Object-level Image Customization
[2] Li et al., 2024. Photomaker: Customizing realistic human photos via stacked id embedding.
This paper provides a comprehensive analysis of the denoising process stages in DPM, such as how the shape of the image is reconstructed after a few denoising steps, then filled with details. The final image is mostly determined by the special token [EOS]. The paper also proposes accelerating the T2I generation process by removing text guidance, which could speed up sampling by over 25%.
Initially, this paper received mixed reviews. However, after discussions between the authors and reviewers, the paper received unanimous acceptance from all reviewers, leading to an "Accept" decision.