ARTIST: Towards Disentangled Text Painter with Diffusion Models

Yufan Zhou,Jianyi Zhang,Jiuxiang Gu,Curtis Wigington,Tong Yu,Tong Sun,Ruiyi Zhang

OpenReview PDF

提交: 2023-09-22更新: 2024-03-26

摘要

关键词

Diffusion ModelsImage generation

评审与讨论

审稿意见

评分: 5置信度: 42023-10-30

This work presents ARTIST, a novel framework to address text rendering problems in diffusion models for image generation. They employ LLMs to identify users’ intentions and introduce dual stages of training to master text structure and visual quality. They achieve up to 15% improvement in various metrics on the MARIO-Eval benchmark, demonstrating the effectiveness of their approach.

优点

This paper demonstrates a well-structured and clearly articulated research strategy. The experimental results presented in the paper showcase good empirical performance.

缺点

The formatting of the main text has some issues, i.e., page 5 with the presentation of Fig. 2 and Eq. (2). Such formatting problems could hinder the overall readability and should be addressed.
Visual results of ARTIST and the analysis of experiments are somewhat lacking. More qualitative results, similar to those presented in the 2nd column of Fig. 5, should be provided to better illustrate the differences between ARTIST and TextDiffuser.
The references need some revisions. For instance, the reference to "Classifier-Free Diffusion Guidance" should be updated to reflect its publication in the NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications. Ensuring that all references are accurate and up-to-date.

问题

See the weakness.

审稿意见

评分: 3置信度: 42023-11-01

This paper proposes a new framework ARTIST that includes one additional text diffusion model to learn visual text structure, which helps disentangle the learning of text structure and text aesthetics. The experiments show improvement of OCR performance compared to previous methods. This paper also introduces large language model to resolve the issues of extracting target text.

优点

The idea to disentangle the learning of text structure and text aesthetics is novel. The experiments also show the efficacy of the proposed method.
Utilizing LLM can help resolve the issues of target keyword extraction from prompts with no explicit mark for target text.

缺点

I am not convinced as to why we should use LLM to improve target text extraction since it is very easy for the user to specify the target text.
The authors did not provide enough information about how LLM performs in generating a suitable layout.
The method part writing is not clear regarding the input to the model. How can a model decide which mask to generate the suitable word? For example, in Figure 2, how are "ARTIST" and "MODEL" specified for the two mask regions?

问题

How are the layouts generated using LLM? Are the layouts limited to normal-text layouts? Can LLM generate rotated or curved text layouts? Could authors provide detailed prompts for LLM?
Could the authors provide more information about the input for the visual examples, e.g. the layouts in Figures 3 and 4.
How does ARTIST deal with multiple-word generation? How does the model decide which mask to generate for each word?
Instead of adding another module to learn the text structure, how does the model perform when training the original model in a two-stage process?

审稿意见

评分: 5置信度: 52023-11-04

They present ARTIST, which first leverages an external LLM to produce the layout of the target text. The disentangled pipeline then produces the visual text and the figure, and fuses both as the final generated result. ARTIST demonstrates its creative and robust text rendering ability on the MARIO-Eval dataset.

优点

This paper is well-written and easy to follow.
Text painting is a challenging task for current T2I models, and the idea of disentanglement between text and figure is well-motivated.
They provide lots of qualitative examples for the visual comparison.
Their proposed ARTIST achieves notable improvements over the previous TextDiffuser, especially on the crucial OCR metric.

缺点

It seems that the ARTIST framework is the combination of LLM-Layout and TextDiffusers, which both have been proposed before. Not sure if this achieves ICLR's novelty bar.
There are so many metrics used for the evaluation (FID, CLIP-S, and OCR). Which one is the most appropriate to evaluate the overall performance? Or is there any way to combine all of them as a final metric?
As a generative task, a human generation should be conducted to compare the performance in a human aspect.
Since ARTIST relies on LLM to derive the layout, there should be a discussion about the quality of the generated layout. Is it visually appealing or in a reasonable position?
Missing reference about LLM-Layout: [NeurIPS'23] Compositional Visual Planning and Generation with Large Language Models / [arXiv'23] Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models

问题

Please see the weaknesses

伦理问题详情

n/a

审稿意见

评分: 5置信度: 32023-11-10

This paper studies text rendering with images and proposes to use disentangled text and visual modules (both are fine-tuned from Stable Diffusion) along with LLM to identify which text and where to render. In the experimental results, the proposed approach outperforms the baseline, and also ablation study shows the effectiveness of the proposed components.

优点

The idea to use disentangled modules for text and image, along with LLM for generating guidance, looks novel.
The proposed approach outperforms the baseline, TextDiffuser, in experimental results.
The presented ablation study shows the effectiveness of the proposed components.

缺点

The experiment lacks an important qualitative evaluation, namely, human evaluation. This may limit to show effectiveness of the proposed approach.
Some experimental results are not very convincing. e.g., FID is worse than Fine-tuned SD on MARIO-Eval benchmark. Also, in Fig. 5, it is not clear the proposed model is better than TextDiffuser.

问题

When training visual module, do the targets also include text, or only images?
Does LLM always succeed? Is there any failure cases, and if so, why?
What is "s" below as it is not clear. You need to define it first before using it. page 4: ... model to extract s which ... page 4: ... identify the essential s
You claim that "our computation requirement is still similar to the previous SOTA TextDiffuser", and can you provide actual model sizes such as the number of parameters of models?
In Fig. 6, how can the proposed model without mask achieve the better CLIP and FID compared to the one with mask?