Prompt-Free Diffusion: Taking “Text” out of Text-to-Image Diffusion Models
We proposed Prompt-Free Diffusion empowered with a new visual encoder SeeCoder, using which pretrained T2I diffusion models can now synthesis high-quality images using reference image inputs and replace the need of text prompt.
摘要
评审与讨论
This paper propose Semantic Context Encoder (SeeCoder) to replace the function of CLIP text encoder to achieve image-to-image variation. More specifically, they trained a transformer-based model to replace the CLIP in stable diffusion. They show good performance on image variation task.
优点
- The proposed method is intuitively sound. Training an image encoder to align with the stable diffusion to achieve image variant.
- Also, following intuition, it works on diffusion model based on stable diffusion such as some ControlNet checkpoint, which expands the results.
- They demonstrate good perfomance on image variant tasks.
缺点
- The image encoder requires further training, while methods like MasaCtrl could achieve image variants without help from an extra image encoder. MasaCtrl is definitely a missing baseline.
- Many other methods like AnyDoor: Zero-shot Object-level Image Customization achieved great performance in learning image representation. For example, AnyDoor also does virtual try-on tasks.
- The author should do an ablation on Seecoder since this is the main structure novelty. Also, if that structure is important, they should compare it with transformer-based block. Without further investigation exhibited in the paper, I am not sure if such a specifically designed Seecoder is necessary. Is it possible that a vision transformer block with a convolution to output feature NxC also works?
问题
- Why using such design in Seecoder? Have you tried other structure?
- How does it compare with other methods?
- How does it perform for general images when no cherry-pick is used?
The paper introduces a diffusion-based, "prompt-free" image generation model. The main contribution is its design of the "Semantic Context Encoder", abbreviated as "SeeCoder", which is a "reusable" image encoder that can be plugged into trained diffusion-based image generation models. The proposed model takes no text input, but image input ("structural conditioning") that is scribble, depth, canny edge etc.
优点
- The authors test their model on a wide range of image generation tasks, including conditional image generation, anime figure generation, and virtual try-on. The results appear competitive against the state of the arts.
缺点
-
The paper's motivation is not convincing. The paper tries to replace text prompts with image-based "prompts", as text prompt is ambiguous and less controllable. However, in many of the presented examples, the image prompts cannot be generated without using a specialized tool - Producing depth map, canny edge, human pose skeleton all require running certain program on existing image. In addition, it is hard to produce a depth map or canny edge without having the target image ready, and they cannot be easily altered. Therefore, the proposed image prompt appears neither as accessible nor as controllable as text prompt.
-
The paper's description on the method and the experiments are oversimplified and vague. In particular, I find little explanation on how the method achieves training-free reusability, which is a key point of this paper. I also find little information on how the model comes to understand the various types of "structural conditioning", i.e. image prompt. Many such questions remain after reading the paper and are listed below. They severely hurt the readability of this paper.
问题
-
In Section 3.3, the "Decoder" is described as a "transformer-based network with several convolutions (where?), ... uses convolution to equalize channels (whose?), ... 6 multi-head self-attention modules". The model looks like a Transformer encoder rather than decoder - otherwise, what are the query and memory?
-
Also in Section 3.3, the "Query Transformer ... started with 4 freely-learning global queries and 144 local queries", What exactly are these queries? Where "local" comes from? "The network also contains free-learned query embeddings, level embeddings, and optional 2D spatial embeddings". I could not find the definition of these in the text or in any illustration.
-
As aforementioned, how the method achieves training-free reusability?
-
As aforementioned, how the model is trained to understand, and hence follows the instruction of, image prompts that comes in various forms? Are the image prompts illustrated in the paper, including depth map, canny edge, scribble, also in the training data?
This paper discusses the rapid growth of text-to-image (T2I) research, driven by large pre-trained models and personalization techniques. However, a challenge remains in crafting effective text prompts for customized results. To address this issue, the authors propose a novel framework called "Prompt-Free Diffusion" that eliminates the need for text prompts in T2I models. This framework relies solely on visual inputs, using a Semantic Context Encoder (SeeCoder) instead of text encoders like CLIP or LLM. Through experiments, Prompt-Free Diffusion is shown to outperform prior methods, match state-of-the-art T2I models, and extend to applications like anime figure generation and virtual try-on. The authors plan to open-source their code and models.
优点
The authors have dealt with an interesting problem. The problem statement is well-defined but the novelty/contributions are poor.
缺点
- The novelty of this work is very limited. This seems to be a naive architectural of ImageVariation work.
- The methodology is not nicely written.
- Fig. 3 diagram should have been more polished and crisper.
问题
What is the difference between Image-Variation and the proposed method -- besides replacing CLIP Image encoder with SeeCoder (consisting of Backbone Encoder, Decoder, and Query Transformer)?
This paper proposes an image generation method, called Prompt-Free Diffusion. It relies on only visual inputs to generate new images, and removes the text prompts from the image generation process. Also, to extract the rich visual information from input images, the authors devise a visual encoding module Semantic Context Encoder (SeeCoder).
优点
I cannot find any strength from this work.
缺点
- The motivation of this work is questionable. Why do you want to taking “text” out of text-to-image diffusion models? Why not keep compatibility with "text"? Take ControlNet as an example, you can add more conditions (such as Canny edges and depth maps) into the framework, while allowing text prompts. Eliminating text prompts will simply limit the potential abilities of the model. When you do not need a text prompt, it is fine to just set it as an empty string.
- Only several groups of qualitative examples are presented, while NO quantitative results are given.
- The effectiveness of the proposed Semantic Context Encoder (SeeCoder) module is not fully validated. The authors should conduct more experiments and ablation study to prove its value.
- The authors claimed that they used a in-house T2I diffuser, which outperformed SD1.5, but did not provide any convincing evidence. This is not acceptable.
问题
The authors should resolve the concerns in the Weaknesses section.