4.8

/10

withdrawn4 位审稿人

最低3最高6标准差1.1

3.3

置信度

正确性2.8

贡献度2.3

表达2.5

ICLR 2025

MinorityPrompt: Text to Minority Image Generation via Prompt Optimization

Soobin Um,Jong Chul Ye

OpenReview PDF

提交: 2024-09-26更新: 2024-11-14

TL;DR

We present a novel minority generation approach for text-to-image diffusion models base on prompt optimization.

摘要

关键词

text-to-image generationdiffusion modelsminority generation

评审与讨论

审稿意见

评分: 3置信度: 32024-10-27

This paper investigates the generation of minority and uncommon samples using pre-trained text-to-image diffusion models. The authors propose a framework to shift the focus of these models from high-density regions towards areas of lower density by minimizing a likelihood metric tailored to capture the uniqueness of noisy intermediate samples. This is done by optimzing a new token embedding on the fly Additionally, they present techniques to enhance both the quality of generated results and semantic controllability. Qualitative and quantitative comparisons were conducted across three different diffusion models to demonstrate the effectiveness of their approach.

优点

The paper is well orginized and easy to follow.
The idea of optimizing a single toke embedding to preserve the intended semantics while generating minority features is interesting.

缺点

Limited Novelty: The authors adapt an existing idea from [1] for text-to-image models, adding techniques to enhance optimization stability and semantic controllability for generating minority images. However, the core concept remains similar to that of [1].
No qualitative or quantitative comparison is provided against the proposed approach in Eq.(5). The authors argue that it has “theoretical issues that limit performance gains,” but there is no supporting evidence in the paper.
DDIM+null seems like a strong baseline however no qualitative comparison is shown against it.
The qualitative results in Table 1 are unconvincing. While the likelihood is lowest, the method mostly shows improved results against baselines in SD2.1. The lack of similar improvements in other diffusion models is not clear.
How does the proposed method impact image quality? A systematic evaluation is needed to assess this.
Precision and recall are known to be inadequate metrics for diversity evaluation. The authors should consider using [2] to assess their method.
Why would CLIPScore improve for the proposed method if the text input remains unchanged?
Following my last three comments, I suggest to conduct a user study to measure the diversity and quality of your approach compared to other baselines.
Would optimizing more than one token lead to better results?
On line 296, it’s noted that placing the placeholder string at the end of the prompt yields the best performance. Why might this be the case?

[1] Um., et al. (2024) Self-guided generation of minority samples using diffusion models.
[2] Naeem., et al. (2024) Reliable Fidelity and Diversity Metrics for Generative Models.

问题

See weaknesses. My biggest concern is the limited novelty as it is a relatively small incremental step of [1].

审稿意见

评分: 5置信度: 32024-11-01

This paper proposes a method to generate more minority instances. The framework appends a trainable token after the prompt and optimizes this token in real-time during the sampling process. This approach aims to generate more minority instances while preserving the semantic integrity.

优点

This paper proposes a prompt optimization method by placing a learnable token at the end of the sentence to preserve the original semantic information.
The paper explores a self-learning approach to optimize the prompt token, thereby enhancing the model's ability to generate more minority instances.
By setting different objective functions, more functionalities can be achieved.
The article is well-written, with clear and precise explanations.

缺点

For example, as mentioned in Fig. 1, there is ambiguity with biases such as 'man' with 'young'. Why can't we directly use prompt engineering methods like 'old man' as a prompt to solve the problem you mentioned?
It is necessary to use prompts corresponding to minority instances to generate images and observe the advantages of your method compared to existing methods. Without a detailed prompt, generating any image is reasonable, and I cannot consider it a minority instance scenario; it only indicates that the model tends to generate certain samples.
Additional experiments on different samplers (ODEs, SDEs) are needed to verify the effectiveness.
This paper introduces additional training, so how is the efficiency?
How about the performance changes in diffusion models with fewer steps?

问题

As shown in Weaknesses.

审稿意见

评分: 5置信度: 32024-11-01

This paper investigates the behavior of T2I model in low-density data distribution and proposes an online prompt optimization framework to improve minority generation. Concretely, it injects a learnable token in the text encoder that is updated on the fly to maximize a carefully designed objective function to achieve the desired generation result. Through extensive results, the author show that the proposed method can generate images with high quality and prompt alignment in low likelihood regions. The author also explored to use this method as a way to mitigate biase in pretrained T2I model

优点

The presentation is clear and easy to follow. The paper provides clear intuition and motivations for the proposed objective by starting with a naive application of previous methods. It provides careful theoretical analysis on the weakness of this naive application (eq 5) and proposed a novel approach to address them.
The author provides extensive experiments on three models (SDv1.5,SDv2.0, SDXL-LT), demonstrating the proposed method can generalize to different model architectures.

缺点

The paper lacks analysis on the static significant of evaluation results. This is particularly relevant as the different metrics have varying scales. One of the author's major claim is that the proposed method achieves "reason generation quality" in "low likelihood" regime. For example, the paper shows MinorityPrompt has 0.17 drop in PickScore on SDv1.5 (Table 1). Its hard to contextualize if such difference should be considered as a major difference or minor difference without seeing the standard error or confidence interval. For example, if the stderr is +-0.2, then it means a statistical tie. If the stderr is less, then it may indicate that MinorityPrompt is worse than baseline with a sufficiently low p-value.
For the quantitative evaluation, it appears that MinorityPrompt often leads to higher prompt alignment (ClipScore) at the expense of image quality (PickScore). A similar tradeoff is oftentimes achieved through classifier free-guidance (CFG). The author uses a fixed CFG of 7.5. However, the author should vary the cfg of the base model and establish the frontier of ClipScore-ImageQuality tradeoff. It may be the case that MinorityPrompt is outside the frontier and is strictly better, or there may be a CFG that achieves higher ClipScore and PickScore than MinorityPriompt. Without this study, the results are inconclusive.
Experiments on Fairness are inconclusive, and the author fails to compare against baselines such as Iti-Gen[1],FairDiffusion[2], aDFT[3]. Use learnable token to achieve fair generation is not a novel idea. Hence, it is important to compare against existing literatures.

In the absence of human evaluation (which is understandable as they can be very costly), I would expect more discussion on these numerical metrics and how they translate into perceptual quality of generated images.

[1]Zhang, Cheng, et al. "Iti-gen: Inclusive text-to-image generation." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. [2]Friedrich, Felix, et al. "Fair diffusion: Instructing text-to-image generation models on fairness." arXiv preprint arXiv:2302.10893 (2023). [3]Shen, Xudong, et al. "Finetuning text-to-image diffusion models for fairness." arXiv preprint arXiv:2311.07604 (2023).

问题

See weakness. Overall, I find the paper well-motivated with good theoretical foundation. However, I find the current experiments failed to show the practical significance of the proposed method, especially since the statistic significance is not discussed and the paper uses a non-conventional benchmark. I would welcome responses that address weakness 1,2,3.

A few additional questions not mentioned in the weakness and not taken into the consideration of the decision:

The paper uses 10k images for SD1.5,2 and 5k for SDXL-LT. Why is this setup adopted? This is also relevant to weakness 1 as different number of samples will lead to different standard errors/confidence interval.
How is figure 4 generated? Are samples randomly picked from different models? Or is the latent fixed.

审稿意见

评分: 6置信度: 42024-11-11

This paper presents a method to enable text-to-image diffusion models to generate minority samples, those less common in training data. Specifically, an online prompt optimization framework is developed to encourage the emergence of desired properties by optimizing text embedding of learnable tokens. Subsequently, this framework is tailored into a specialized solver that promotes the generation of minority features by incorporating a carefully crafted likelihood objective. Comprehensive experiments, conducted across various types of T2I models, demonstrate that the proposed approach significantly enhances the capability to produce high-quality minority instances compared to existing samplers.

优点

The proposed method is validated with multiple text-to-image diffusion models, showing the generalizability across different models including distilled backbones such as SDXL-Lightning
The proposed method can effectively encourage the emergence of low-likelihood samples and can be applied to mitigate the bias issue of text-to-image diffusion models, as supported by the quantitative evaluation results.
The proposed method only needs to optimize for learnable tokens, without affecting the semantics of the input text prompt and therefore can improve diversity without compromising text alignment and image quality too much
The manuscript presents detailed analysis and effective solutions for the issues of related work (Um & Ye, 2024)

缺点

The authors claim that the method improves the ability of creating minority samples with minimal compromise to image quality but there are no experimental results to support this point. It would make the manuscript stronger if the authors could add image quality analysis such as the FID comparisons.

问题

would it be possible to provide a quantitative analysis of how favoring low-density samples would affect image quality?

撤稿通知

2024-11-14

I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.