KOALA: Empirical Lessons Toward Memory-Efficient and Fast Diffusion Models for Text-to-Image Synthesis
We present empirical lessons for building an efficient text-to-image model
摘要
评审与讨论
The paper at hand performs a series of ablation studies evaluating the impact of three design decisions on the quality of Text to Image Generation models. Specifically, the authors look at the target of knowledge distillation, training data as well as choice of teacher for distillation. In the end, the authors present a model named KOALA, building upon the insights gained from those ablations. The model is smaller and more efficient than the comparisons presented in the paper.
优点
Improving the efficiency of Text To Image Generation models is of key importance to the research and industrial community, as the size an inference speed of current state of the art models is often a limiting factor for real world applications.
The paper shows positive results and a good set of comparisons to prior art.
The paper touches important aspects, such as distillation objectives as well as data selection.
缺点
My main concern about the paper is its approachability. The presentation of the paper is subpar. Specifically, the paper is very verbose and repetitive, but then on the other hand imprecise and lacking important details.
The paper could be half its size and get more information across. For example, the paper has an almost two page discussion on knowledge based distillation, but I could not a single precise definition how the proposed approach actually performs.
Similarly, the entire discussion on the U-Net architecture could be a single table. This would make it much more approachable and easier to understand.
问题
It would be great, if the authors could provide a short precise explanation how the proposed distillation loss works.
局限性
The limitations are discussed in the paper.
Thank you for thoroughly reviewing our work and for your insightful and helpful suggestions for improving our paper. We provide our response in the following. We have made every effort to address all your concerns. Please don’t hesitate to let us know if your concerns/questions are still not clearly resolved. We are open to active discussion and will do our best to address your concerns during the discussion period.
W1 & Q1. lacking important details & single precise definition of how the proposed approach actually performs
→ We apologize for any ambiguity regarding how the proposed distillation method performs. The main reason we explained the distillation part in almost two pages is that we needed to address two main aspects simultaneously: 1) pruning the SDXL-Base model and building efficient KOALA U-Nets, and 2) exploring the effective feature distillation strategy. Furthermore, our key finding for the distillation is that self-attention is the most crucial part. To provide sufficient evidence, we conducted in-depth qualitative (refer to Fig. 3) and quantitative (refer to Tab. 3 and Tab. 10) analyses, which took up considerable space in the manuscript.
In addition, as you suggested, we introduce a brief summary of self-attention-based distillation:
- For every transformer block composed of transformer layers at each stage, we allow the student model to mimic the self-attention features from the teacher model (refer to
Fig. 2). - At the highest feature level (e.g., DW-1 & UP-3), since there are no transformer blocks, we compare only the last output features at that stage (refer to
Fig. 2&Tab. 10d).
Since we don’t have enough space to address all of our findings (e.g., four findings) for the effective distillation strategy, we kept only two points in Sec. 3.1.2 and moved the other findings to Appendix B, along with Table 10. As a result, it may cause some details to be missing.
Let us describe more details of four findings for our distillation strategy:
- Among different feature types for distillation, self-attention shows the best performance (
Tab. 3a) and represents more discriminative features (Fig. 3&Fig. 10in Appendix). - Self-attention at the decoder (UP-1 to UP-3) has a larger impact on the quality of generated images than the encoder (DW-1 to DW-3). Therefore, we keep more transformer blocks in the decoder part when pruning SDXL-Base’s U-Net (
Tab. 3b). - When comparing self-attention features, the features of the early layers in the transformer block, which is composed of multiple layers, are more significant for distillation (
Tab. 10c). This is supported by the feature cosine similarity analysis, which shows that early layers exhibit more distinct feature discrimination (Sec. B3&Fig. 11). - The combination of the last output features (LF) at the highest feature level and the self-attention features at the other stages shows the best results (
Tab. 10d).
W2. the entire discussion on the U-Net architecture could be a single table.
→ Thank you for your suggestion. As you pointed out, we will consolidate the U-Net section into a single, more concise and clear table to make it more approachable and easier to understand.
I would like to thank the authors for their response to both my questions and the questions by the fellow reviewers. After going over the other reviews and considering all the answers, I come to believe that the contributions and presentation of the paper might just pass the bar for acceptance. However, I strongly encourage the authors to significantly improve the organization and conciseness of the paper.
We deeply appreciate your valuable feedback and the thoughtful reconsideration of our paper's score. We will make sure to carefully reflect on your suggestions and update the paper to improve its organization and conciseness. Once again, thank you very much for the time and effort you dedicated to thoroughly reading and reviewing our paper.
This paper presents a set of empirical guidelines to use when distilling Stable Diffusion XL (SDXL) when computational/data resources are limited. The presented guidelines focus on 1) identifying which transformer blocks to drop, 2) what features are the best to distill, and 3) which publicly-available datasets are the best to use. The paper presents analyses for all guidelines, including ablations and comparisons to what other popular distillation approaches do. Insights are extracted from these analyses and used to motivate the guidelines. Performance results are presented showing that the guidelines result in distilled models that are quite competitive with the parent/other distilled models yet are much more computationally efficient.
优点
The main topic of this paper -- how to more easily generate cheaper SDXL distillations -- is of much relevance to virtually all associations. This paper does a thorough job of exploring many distillation aspects. The analyses it reports give good insights into what aspects are most important. For example, the feature distillation ablation section shows that self-attention are the best feature to distill. The authors then use those results to motivate removing fewer transformer decoder blocks than encoder ones. Fig 3c is compelling and really drives home the primary role of self attention in distillation. The suggested recipe for removing blocks goes beyond just a basic approach of removing all blocks in a stage, as some other distillation approaches suggest. I also thought the insights into the various LAION datasets was interesting and that the conclusion of using few high res images with detailed prompts is better than more low res with with short (or detailed) prompts is useful. Further, the performance results from the distilled models are quite competitive with other variants.
缺点
The guidelines this paper presents are very focused on specifically SDXL. It is unclear if any of them would apply to other transformer-based models. Whatever the next version after SDXL is created, it is unclear how relevant these guidelines would be.
问题
-
Sec 3.1.1. mentions both "block removal" as well as "layer-wise removal". However, the text describes layer-wise removal as "reducing the number of transformer layers". I found this vernacular confusing. It is common practice to let a "layer" mean a "block", so both approaches sounds as if they are doing the same thing: removing blocks. I think (?) when the text refers to "block removal" (108) it means "removal of 100% of the blocks in a sequence" but when it refers to "layer-wise removal" (line 117) it means "removal of less than 100% of the blocks of the sequence". Is this understanding correct? If so, then I recommend re-writing this section to make it more clear that "block removal" is the special case of "layer-wise removal" when all blocks are removed.
-
Table 10d shows that adding DW-1 and UP-3 feature distillation to SA + LF further boosts performance. But the formal paper stops at suggesting SA + LF3 is optimal. Why?
-
Nit: The only time I see the "KOALA" acronym explained is implicitly in a couple of the figure captions. Adding it somewhere in the introduction would be nice.
-
Nit: Line 199 uses the phrase "in the second row of the table" before any table has been presented to the reader.
局限性
The paper presents the commonly suggested limitations of stable diffusion -- it doesn't generate text well and has problems with multiplicity.
Thank you for thoroughly reviewing our work and for your insightful and helpful suggestions for improving our paper. We provide our response in the following. We have made every effort to address all your concerns. Please don’t hesitate to let us know if your concerns/questions are still not clearly resolved. We are open to active discussion and will do our best to address your concerns during the discussion period.
W1. focused on specifically to SDXL & would be appled to transformer-based models
→ Please refer to G1 section in the general response.
Q1. Confusion Between the Terms
BlockandLayer
→ The SDXL-Base architecture consists of convolutional residual blocks and transformer blocks, each of which is composed of multiple transformer layers. A potential point of confusion is the distinction between transformer blocks and transformer layers. Our intention is to use ‘block’ as a higher-level concept, referring to a group of ‘layers.’ Therefore, we will reiterate this definition in the manuscript to avoid any misunderstanding. Thank you for your comment. ”
Q2. SA+LF combination
→ We apologize for the confusion. The final feature combination for the distillation training is self-attention (SA) from all transformer layers plus the last feature (LF) from the highest feature level (DW1+UP3), as shown in Fig. 2 and Tab. 10d. Using only SA features already shows superior performance compared to using only LF in BK-SDM. However, we have found that adding the last features at DW-1 and UP-3, where there are no self-attention layers, further boosts performance. To explain the cause of this confusion, due to space constraints in the manuscript, we moved the optimal feature combination to the appendix along with Tab. 10(d) and missed mentioning SA+LF in the main text. We will clarify this and update the manuscript. We appreciate your comment once again.
Q3. "KOALA" acronym
→ Thank you for your kind advice. As you noted, the explanation of the "KOALA" acronym is either implicitly included in the figure captions or found later in the paper (lines 237-238). We will refine and move the explanation to the introduction section in the final version.
Q4. Tab.4
→ We intended to use the phrase “in the second row of the table” to indicate that the second data source is demonstrated in the second row of Table 4. However, as the reviewer noted, the current demonstrations might cause confusion for readers. Therefore, we will revise the notation (to ensure consistency of data source notation both in the table and the document) and clarify the phrasing in the final version.
Thank you for your response to my comments. I had failed to appreciate the impact of Table 8 wrt showing generalization of KOALA. I will take it into account in my final rating.
We sincerely thank you for your thoughtful review and for considering the impact of Table 8 in your comment. If possible, we would greatly appreciate it if you could provide further details or explain which specific aspects of Table 8 failed to persuade you. We genuinely hope you might give us the chance to address any concerns. Once again, we are truly grateful for the time and effort you have dedicated to thoroughly reviewing our paper.
As we near the end of the discussion phase, we would like to once again express our sincere gratitude for the time and effort you dedicated to thoroughly reviewing our paper. Your detailed suggestions, such as clarifying notations and refining the formatting of experimental results, are greatly appreciated. We will carefully revise our manuscript to reflect your valuable feedback. Once again, thank you very much for your thoughtful and comprehensive review.
This paper presents KOALA, a pair of efficient text-to-image synthesis models that reduce computational and memory requirements compared to the base model. This paper achieves this through three key innovations: knowledge distillation into a compact U-Net, the strategic use of high-resolution images and detailed captions in training data, and the employment of step-distilled teachers to enhance the learning process. The resulting models, KOALA-Turbo and KOALA-Lightning, demonstrate faster inference times and the ability to generate high-quality images on consumer-grade GPUs.
优点
- This paper is easy to follow.
- This paper offers a practical solution for generating high-quality images with reduced computational requirements.
缺点
- Although KOALA achieves good visual quality, the extent of degradation in text rendering is unclear as it has not been compared with a baseline in terms of text rendering capabilities.
- For each baseline, KOALA requires a carefully designed pruning network architecture followed by retraining, which entails a significant acceleration cost compared to training-free methods. However, this paper does not discuss or compare with training-free methods, nor does it compare with step distillation approaches.
问题
In the "Lesson 3" section, it is puzzling that SDXL-Base, as a teacher model, performs the worst. Intuitively, a better teacher model should have greater potential to distill a superior student model. I am curious if the results presented in the paper could be attributed to a suboptimal distillation approach.
局限性
The authors have discussed both limitations and potential negative social impacts.
Thank you for thoroughly reviewing our work and for your insightful and helpful suggestions for improving our paper. We provide our response in the following. We have made every effort to address all your concerns. Please don’t hesitate to let us know if your concerns/questions are still not clearly resolved. We are open to active discussion and will do our best to address your concerns during the discussion period.
W1. extent of degradation in text rendering & quantitative comparison
→ Please refer to G2 section in the general response
W2. Comparison with training-free methods
→ Thank you for your valuable insight. To the best of our knowledge, very recent training-free acceleration methods for diffusion models leverage the redundancy of features over the denoising steps. These methods (e.g., DeepCache [1], Cache Me If You Can [2]) cache and retrieve features of previous steps for efficient inference of the current step. In this rebuttal, we compare our KOALA model with DeepCache, which has an official implementation available. Using their code, we measured the inference cost, memory footprint, latency, and performance in terms of HPSv2 and Compbench scores. The results are summarized in the table below. Compared to the original SDXL-Base, DeepCache achieves faster inference speed at the cost of performance and increased memory usage. We speculate that the increased memory usage of DeepCache is due to the caching and retrieval process of features, which requires additional memory. In contrast, when compressing the same SDXL-Base model, our KOALA model reduces both memory usage and latency while achieving superior performance compared to DeepCache. Additionally, due to the step-distilled teacher and longer training, our KOALA model is capable of generating more faithful images with fewer diffusion steps, resulting in lower latency. Furthermore, the training-free method is orthogonal to our approach, we expect that combining the training-free method with ours could exhibit a synergy effect, suggesting a promising future direction.
*Note: KOALA-SDXL-Base and KOALA-Lightning denote models trained with the SDXL-Base teacher (1st row of Tab. 5) for 100K iterations and with the SDXL-Lightning teacher (11th row of Tab. 6) for 500K iterations, respectively.
| Method | Backbone | #Step | Memory(GB) | Latency | HPSv2 | CompBench |
|---|---|---|---|---|---|---|
| SDXL-Base-1.0 | SDXL-Base | 25 | 11.9 | 3.229 | 30.82 | 0.4445 |
| DeepCache [1] | SDXL-Base | 25 | 14.6 | 1.453 | 26.32 | 0.4094 |
| KOALA-SDXL-Base | KOALA-700M | 25 | 8.3 | 1.424 | 27.79 | 0.4290 |
| KOALA-Lightning | KOALA-700M | 10 | 8.3 | 0.655 | 31.50 | 0.4505 |
W3. Comparison with step-distillation methods
→ We would like to clarify that we have already compared the step-distillation methods, such as SDXL-Turbo and SDXL-Lightning, in Sec. 4.2 and Tab. 6 of our manuscript. Below, we re-summarize a comparison with step-distillation methods, including the more recent work, PCM [3], in the following table. We observe that, compared to SDXL-Lightning and PCM at a resolution, our KOALA-Lightning-700M demonstrates competitive performance while achieving lower latency and memory usage with a model size that is smaller.
*Table note: SDXL-Turbo generates images with resolution while the other models use resolution.
| Method | Backbone | Res. | #Step | Param.(B) | Memory(GB) | Latency | HPS | CompBench |
|---|---|---|---|---|---|---|---|---|
| SDXL-Tubo | SDXL-Base | 512 | 8 | 2.56 | 5.6 | 0.245 | 29.93 | 0.4489 |
| SDXL-Lightning | SDXL-Base | 1024 | 8 | 2.56 | 11.7 | 0.719 | 32.18 | 0.4445 |
| PCM [3] | SDXL-Base | 1024 | 8 | 2.56 | 12.1 | 0.884 | 30.01 | 0.4360 |
| KOALA-Lightning (ours) | KOALA-700M | 1024 | 10 | 0.78 | 8.3 | 0.655 | 31.5 | 0.4505 |
→ Furthermore, we have verified the synergy effect between the step-distillation method (e.g., PCM) and our KOALA backbones during this rebuttal period. We performed the step-distillation training using PCM with our KOALA backbones and obtained results as shown in the table below. Due to their efficiency, our KOALA backbones enable PCM to further boost speed-up with only a slight performance drop compared to PCM with the SDXL-Base backbone. Additionally, we presented qualitative comparisons of our PCM-KOALA models with PCM-SDXL-Base in Fig. 1 of the attached general_response.pdf file (Please see the PDF file).
*Table note: SDXL-Turbo and SDXL-Lightning do not provide training code, while PCM releases the official training code, allowing us to train PCM with our KOALA backbones.
| Method | Teacher | Student | #Step | Param.(B) | Memory(GB) | Latency | HPS | CompBench |
|---|---|---|---|---|---|---|---|---|
| PCM [3] | SDXL-Base | SDXL-Base | 2 | 2.56 | 12.1 | 0.345 | 29.99 | 0.4169 |
| PCM [3] | SDXL-Base | KOALA-700M | 2 | 0.78 | 8.2 | 0.222 | 28.78 | 0.3930 |
| PCM [3] | SDXL-Base | KOALA-1B | 2 | 1.16 | 9.0 | 0.235 | 29.04 | 0.4055 |
Q1. a better teacher model should have greater potential to distill a superior student model.
→ We would like to clarify that in the main state-of-the-art comparison table (e.g., Tab. 6) of our manuscript, SDXL-Lightning shows better or comparable performance to SDXL-Base on HPSv2 and CompBench, respectively. Thus, SDXL-Lightning could be considered a better teacher model. Your statement that “a better teacher model should have greater potential to distill a superior student model” aligns with our findings. Indeed, we observed that the KOALA models distilled from SDXL-Lightning achieve better performance than those distilled from SDXL-Base as shown in Tab 3.
[1] Ma et al. DeepCache: Accelerating Diffusion Models for Free. CVPR 2024.
[2] Wimbauer et al. Cache Me if You Can: Accelerating Diffusion Models through Block Caching. CVPR 2024.
[3] Wang et al. Phased Consistency Model. Arxiv (2405.18407) 2024.
Thank you to the authors for the detailed response. Most of my concerns have been addressed, so I have raised my score.
We deeply appreciate your valuable feedback and the thoughtful reconsideration of our paper’s score. We will ensure that your suggestions, such as the training-free method comparison and text-rendering capability, are carefully reflected upon, and we will update the paper to enhance its completeness. Once again, we are truly grateful for your dedication and the thorough review you have provided.
-
The authors propose two different designs for an efficient denoising U-Net based on SDXL.
-
The authors present three empirical lessons for developing efficient U-Net designs.
优点
-
The authors present empirical findings to justify their design considerations and provide further analyses to examine the effects of those choices.
-
The authors conduct a systematic analysis with extensive quantitative and qualitative experiments, comparing their results with different baselines.
缺点
-
The proposed method is specific to the SDXL U-Net and is based on heuristics, rather than presenting a generalizable approach. This makes the paper resemble a technical report more than an academic paper. It would be beneficial if the authors could further clarify their novelty and contributions in terms of academic research.
-
The authors present some failure cases of their methods, including rendering long legible text, complex prompts with multiple attributes, and human hand details. Do the teacher models and other baseline models also suffer from the same problems? Were any of these failure cases aggravated in the proposed model to some extent? If so, what could be the reasons?
问题
Please refer to the "Weaknesses" section.
局限性
The authors adequately address the limitations in section 5.
Thank you for thoroughly reviewing our work and for your insightful and helpful suggestions for improving our paper. We provide our response in the following. We have made every effort to address all your concerns. Please don’t hesitate to let us know if your concerns/questions are still not clearly resolved. We are open to active discussion and will do our best to address your concerns during the discussion period.
W1-1. specific to the SDXL U-Net and heuristics, not a generalizable approach
→ Please refer to G1 section in the general response.
W1-2. novelty and contributions to the academic community
→ Novelty: We believe that our self-attention-based distillation approach for text-to-image model compression is novel, supported by two key points.
- To the best of our knowledge, our approach is the first to identify that self-attention is the most crucial element for feature distillation in T2I model compression.
- The self-attention-based distillation approach can be generally applied across various architectural designs of diffusion models (e.g., U-Net-based and transformer-based) that consist of transformer blocks with self-attention layers.
→ Contribution: We believe that we have made two contributions to the academic community from the perspective of practical impact.
-
An efficient T2I Backbone for Consumer-Grade GPUs:
- Many downstream tasks, such as personalized image generation [1] and image-editing [2], leverage SDXL-Base as a de facto backbone due to its open-source model and superior performance. However, the SDXL model barely works on GPUs with more than 11GB of VRAM due to its substantial model cost, and its slower latency may cause delays in the outcomes of new research using it. To overcome this limitation, our efficient KOALA T2I backbones can serve as a cost-effective alternative, facilitating further downstream research, especially for those working in limited-resource environments. Furthermore, our efficient KOALA models can assist practitioners and designers by enabling rapid ideation and supporting their creativity with low latency in a consumer-grade GPU environment.
-
An effective recipe of distillation for T2I model compression:
- We presented three key lessons in our paper: i) self-attention-based distillation strategy, ii) data considerations, and iii) the influence of the teacher model. Additionally, through an in-depth analysis of each lesson, we offer general insights into popular U-Net-based diffusion models, such as the role and computational burden of each layer. We strongly believe that the lessons and insights we provide will benefit researchers and developers striving to make T2I models more efficient.
- In the large language models (LLMs) field, larger models continue to emerge rapidly, and there have been many efforts to compress these models. Similarly, new and larger T2I models have also been developed. However, unlike in the LLM field, there have been relatively few attempts to compress T2I models due to factors such as the high cost of training (i.e., healing) and data scarcity (i.e., high-resolution and expensive copyright). In this context, we believe that our lessons can be particularly helpful for the community, especially for those working with resource constraints.
- It is important to note that research providing crucial recipes for effective training, alongside proposing entirely novel methods, has historically played a significant role in revolutionizing their respective fields and boosting overall performance. For instance, Improved GANs [3] for GANs, MoCo-v3 [4] for self-supervised learning, DeiT [5] for transformers, and ConvNext [6] for CNNs have all made substantial impacts. Similarly, we hope our work brings valuable insights and benefits to the research community.
W2. Discussion on failure cases
→ Please refer to G2 section in the general response.
[1] Ruiz et al. DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation. ICLR 2023
[2] Zhang et al. Adding Conditional Control to Text-to-Image Diffusion Models. ICCV 2023
[3] Salimans et al. Improved Techniques for Training GANs, NeurIPS 2016
[4] Chen et al. An Empirical Study of Training Self-Supervised Vision Transformers. ICCV 2021
[5] Touvron et al. Training data-efficient image transformers & distillation through attention. ICML 2021
[6] Xie et al. A ConvNet for the 2020s. CVPR 2022
Thank you for the time and effort you put into your responses. After reviewing the responses, I believe my current rating remains appropriate. I will maintain my rating with increased confidence.
We are sincerely grateful for the time and effort you dedicated to thoroughly reading and reviewing our paper. Your feedback, including the emphasis on novelty and contribution, the generality of the methodology, and the discussion on failure cases, will be carefully reflected upon as we update the paper to enhance its completeness. Once again, thank you very much for your thoughtful and thorough review.
General response
Thank you for thoroughly reviewing our work and for your insightful and helpful suggestions. We have made every effort to address all your concerns. Please let us know if any questions remain unresolved. We are open to discussion and will do our best to address your concerns during the discussion period.
In this general response, we have addressed the common concerns reviewers raised.
G1. specific to the SDXL U-Net and heuristics, not a generalizable approach by reviewers
dMSC&C2ZB
→ We respectfully clarify that although our method has been mainly explored with SDXL, the most popular T2I model at the time of submission, we have already applied our self-attention-based distillation scheme to a transformer-based method, Pixart- [1], the only model providing training code at the time of submission. As shown in Tab. 9 of the manuscript (we include the table below for reference), self-attention-based distillation exhibited superior performance compared to other feature types. These results align with those obtained on U-Net-based diffusion models, demonstrating that our self-attention-based distillation scheme has been proven to be a general approach across various architectural designs of diffusion models (e.g., U-Net-based and transformer-based). Given these findings, we can also expect that our self-attention-based distillation can be applied to more recent transformer-based methods with self-attention operations, such as SD3 [2]. Consequently, our approach can serve as a crucial baseline for further T2I model compression research.
Additionally, we would like to explain that the reason we designed an SDXL-tailored method is largely due to the heterogeneous structure of SDXL’s U-Net. It consists of irregularly combined convolutional and transformer layers, unlike pure transformer-based methods such as Pixart [1,3], and SD3 [2], which build a uniform stack of transformer layers. However, since our method is primarily based on in-depth analysis and understanding of each layer of SDXL, we believe our findings will benefit popular SDXL-like T2I models and their applications.
| KD type in Pixart- | HPSv2 | CompBench |
|---|---|---|
| SA | 25.16 | 0.4281 |
| CA | 24.94 | 0.4279 |
| FF | 24.80 | 0.4191 |
| LF in BK | 21.62 | 0.3527 |
G2. Discussion on failure cases (e.g., text rendering) by reviewers
dMSC&EitJ
→ While recent text-to-image diffusion models can generate images of unprecedented quality in general, they still exhibit limited performance in certain specialized cases. These cases include rendering long and legible text, handling complex prompts with multiple attributes, and generating intricate structures such as detailed human hands. The authors of SDXL-Base [4] acknowledge these limitations in their paper, and other baseline models we considered also share these limitations, as shown in Fig. 2 of the attached general_response.pdf. Due to the nature of knowledge distillation, our distilled models inevitably inherit these limitations. Among these models, our KOALA model shows inferior text rendering quality compared to the teacher models, as shown in the table below. We conjecture that the main cause of this inferior performance originates from the training dataset(LAION-POP) we used. Although this dataset includes images with high visual aesthetics, the corresponding text prompts rarely describe the text within the images. This leads to difficulties in complex text rendering.
As reviewer EitJ suggested, we quantitatively evaluate the text rendering capability on the MARIO-Eval benchmark [5]. Specifically, we utilize existing OCR tools to detect and recognize text regions in the generated images and measure the performance. Using 5,414 test prompt sets in the benchmark, we generated images with rendered text for each model and compared performance using the OCR metrics, as shown in the table below. Our KOALA models show inferior performance compared to SDXL-Base teacher models. In addition, SDXL models achieve much lower performance compared to the specialized T2I model for text rendering, TextDiffuser [5].
| Metrics | TextDiffuser [5] | SDXL-Base | SDXL-Lightning | KOALA-Lightning-700M | KOALA-Lightning-1B |
|---|---|---|---|---|---|
| OCR(accuracy) | 0.5609 | 0.0181 | 0.0169 | 0.0123 | 0.0153 |
| OCR(precision) | 0.7846 | 0.0223 | 0.0228 | 0.0012 | 0.0024 |
| OCR(recall) | 0.7802 | 0.0463 | 0.044 | 0.0033 | 0.0065 |
| OCR(F-measure) | 0.7824 | 0.0301 | 0.0301 | 0.0018 | 0.0035 |
The limitations that our model has inherited remain an open question in the community. To address these limitations, several specialized models have been recently proposed by designing specialized data and architecture. For example, TextDiffuser [5] aims to improve the capability of rendering long, legible text by constructing text-rendered images with OCR annotation, while Paragraph-Diffusion [6] attempts to handle more complex prompts faithfully. Exploring the synergy between these specialized models and our distillation framework could be an interesting direction for future work.
[1] Chen et al. PixArt-: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation. Arxiv 2024
[2] Esser et al. Scaling Rectified Flow Transformers for High-Resolution Image Synthesis. ICML 2024
[3] Chen et al. PixArt-: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis. ICLR 2024
[4] Podell et al. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. ICLR 2024
[5] Chen et al. TextDiffuser: Diffusion Models as Text Painters. NeurIPS 2023
[6] Wu et al. Paragraph-to-Image Generation with Information-Enriched Diffusion Model, arXiv 2023
Figures in the attached pdf file
↓ Please refer to the PDF file.
The paper received Weak Accept, Accept, Weak Accept, and Borderline Accept.
The paper presents a detailed empirical study on how to more effectively distill a text-to-image model such as StableDiffusionXL into a more efficient model by carefully evaluating what to drop, what to tune, and how to best leverage data during the distillation phase. Reviewers are generally convinced of the immediate applicability of the method and potential impact but are worried about longer term impact since most of the paper is very specific to a single model. However there is a small section that seems to have been overlooked at first with DiT diffusion transformers on sigma-pixart. Since the general consensus is to accept this paper by the reviewers I agree with the recommendation but also suggest authors to improve their manuscript by emphasizing what are recommendations from their work that can apply generally.