CHATS: Combining Human-Aligned Optimization and Test-Time Sampling for Text-to-Image Generation
We propose CHATS, a novel generative framework that facilitates the collaboration between human preference alignment and test-time sampling.
摘要
评审与讨论
The paper introduces CHATS, a text-to-image generation framework that integrates human preference alignment with test-time sampling. It employs two distinct models to capture preferred and dispreferred distributions, trained with a new objective based on Direct Preference Optimization (DPO), and uses a proxy-prompt sampling strategy. The main experimental results claim CHATS outperforms traditional methods like Diffusion-DPO across benchmarks。
Update after rebuttal
Thanks to the authors for their efforts in their rebuttal. I decided to keep my initial positive rating.
给作者的问题
Please refer to the weakness.
论据与证据
The claims of improved performance and data efficiency are supported by experiments on SD1.5, SDXL, and an in-house model, with results in Tables 1-3 showing CHATS outperforming baselines.
方法与评估标准
The proposed CHATS method, combining human preference alignment and test-time sampling, makes sense for improving text-to-image generation quality and alignment.
理论论述
I reviewed the correctness of the training objective derivation for CHATS in Section A.2 (Appendix A). The logic is sound.
实验设计与分析
I checked the experimental design in Section 5, focusing on Tables 1-3. The comparison with Standard and Diffusion-DPO baselines across SD1.5, SDXL, and In-house T2I models is valid, and the metrics (HPS v2, ImageReward, PickScore) are well-established.
补充材料
Yes. I reviewed Appendix A (Mathematical Derivations), specifically A.1 and A.2, which detail the global optimum and CHATS training objective.
与现有文献的关系
CHATS extends DPO (Rafailov et al., 2023) and CFG (Ho & Salimans, 2022) by integrating preference alignment and sampling.
遗漏的重要参考文献
No.
其他优缺点
Weakness:
- The training objective (Section 4.1, Eq. 12) uses Jensen’s inequality to approximate an intractable expectation, but the paper doesn’t quantify the impact of this simplification. Could this lead to suboptimal convergence or bias in the preferred/dispreferred split?
- The small dataset size (7,459 pairs) is a strength for efficiency but a weakness for validation. The paper doesn’t test CHATS on larger, messier datasets.
- The default =0.5 and =5 work well, but the ablation (Table 4, Fig. 3) suggests that sensitivity isn’t deeply explored.
- Using two models increases the computing cost compared to single-model methods like Diffusion-DPO. The paper doesn’t report runtime or memory metrics.
其他意见或建议
N/A.
We thank the reviewer for the valuable and insightful feedback!
1. Clarification on approximating an intractable expectation using Jensen’s inequality
We acknowledge the reviewer's concern regarding the use of Jensen’s inequality in our derivation of Eq. 12. Specifically, we approximate the intractable expectation as follows:
please refer to Eq.39 and Eq.40 in appendix for more details.
Although applying Jensen’s inequality in this manner provides an upper bound rather than the exact loss, our extensive empirical evaluations across multiple benchmarks (e.g., Tables 1–3) indicate that the resulting training dynamics lead to consistent improvements in generation quality and robust convergence. In practice, the bias introduced by this approximation is effectively absorbed during optimization, and the desired preferred/dispreferred split is maintained.
While a tighter approximation might further reduce any potential bias, the experimental results confirm that our current approximation does not lead to suboptimal convergence or a misrepresentation of the two distributions. Quantifying the exact impact of this inequality remains challenging due to the inherent complexity of the diffusion process. However, the empirical performance gains observed across diverse datasets and model architectures strongly suggest that the approximation error is minimal.
2. Validation on different dataset
Although the 7,459-pair dataset (OIP) demonstrates our method's high data efficiency, we have also evaluated CHATS on a larger and noisier dataset. Our experiments on the PaP v2 dataset, which comprises 851,293 preference pairs, are reported in Table 5 and confirm that CHATS consistently outperforms baseline methods on both datasets. Furthermore, our results indicate that training with the higher-quality OIP dataset yields better performance, as discussed in "A small high-quality preference dataset is enough" (Line 379, right column). Detailed information about the datasets can be found in the appendix, Line 681-687.
3. More analysis on sensitivity of and
We conduct additional experiments to further examine the impact of hyperparameters. The results on SDXL are summarized below.
For the guidance scale , we obtained the following:
| HPS v2 on Photo () | |
|---|---|
| 2 | 28.83 |
| 3 | 29.36 |
| 4 | 29.81 |
| 5 (default) | 29.62 |
| 6 | 29.61 |
| 7 | 29.39 |
| 8 | 28.72 |
These results indicate that while our default value works well, a value of yields a slightly higher score. It is important to note that is a user-specified hyperparameter that governs the trade-off between concentrating generation on high semantic-density areas and maintaining output diversity. As such, it is not unique to CHATS and was not extensively tuned in our framework.
Regarding , our experiments yield the following:
| HPS v2 on Photo () | |
|---|---|
| 0.5 (default) | 29.62 |
| 0.0 | 29.36 |
| -0.1 | 29.34 |
| -0.3 | 29.25 |
These results confirm that the best performance is achieved around , which is consistent with our analysis presented in Figure 5 and Line 765-769 of the appendix.
In summary, our ablation studies demonstrate that CHATS is robust to moderate variations in these hyperparameters. The default settings of and represent a high-quality generation, with minor adjustments yielding comparable performance.
4. Computational cost
We would like to point out that Table 6 already reports computational cost metrics in terms of images generated per second, comparing CHATS with single-model methods like Diffusion-DPO. In addition, we have performed supplementary experiments to address the increased cost introduced by the dual-model architecture. As explained in Line 436-439 (right column), by simultaneously distilling both the guidance scale (i.e., in Eq. 17) and the two models into a single model, the extra inference cost can be completely eliminated while still achieving high-quality generation. For example, the following table shows that our distillation variant (CHATS-distill) not only reduces memory usage and increases throughput relative to CHATS but also maintains the improved HPS v2 score:
| Method | Memory() | Throughput () | HPS v2 on Photo() |
|---|---|---|---|
| Standard | 1 | 1 | 26.88 |
| CHATS | 2 | 0.97 | 29.62 |
| CHATS-distill | 1 | 2 | 29.53 |
Thus, while the dual-model approach in CHATS does introduce additional cost in its raw form, our distillation strategy fully recovers efficiency without compromising the quality of the generated images.
This paper aims to improve the performance of text-to-image diffusion models by using a human preference dataset. To make better use of DPO and CFG, they propose a training objective that trains a preferred model and a dispreferred model. During the sampling step, they introduce a new guidance method that incorporates both models. Experimental results show performance improvements on several benchmark datasets.
Update after rebuttal
Thank you for the authors' response and the additional responses to further comments. After considering the authors' rebuttal and the reviews from other reviewers, I have revised my rating to a borderline accept. This work is meaningful in that it explores DPO applied to diffusion models sampled with CFG. However, this approach requires twice the memory unless a distillation method is applied. The authors have properly presented and analyzed the strengths of the proposed method, but the need for further hyperparameter tuning and the need for a distillation method are still apparent. Therefore, while I lean towards a positive evaluation, my support is not particularly strong.
给作者的问题
I would like the response to primarily address the issues and questions raised in the Claims and Evidence section.
论据与证据
The primary claim of the paper is that training and sampling with separate models for preferred and dispreferred outputs improves text-to-image diffusion models aligned with human preferences. The proposed method modifies the DPO objective to accommodate this dual-model setup and introduces a corresponding guidance-based sampling strategy. While the experimental results partially support this claim, I find that the advantages of this approach over existing methods are not clear.
-
The modification of the DPO objective to fit the dual-model framework requires further justification. The traditional DPO is designed from reinforcement learning setups where a single policy model represents the reward function. Splitting it into two models raises the question of whether the theoretical foundations of DPO still hold in this new formulation.
-
It is also unclear whether the (dis)preferred model only receives training signals from its corresponding (dis)preferred dataset and, if so, whether this setup can truly be considered a dual-objective framework.
-
A key concern is to understand the convergence properties of the diffusion model under the proposed objective, i.e., where does the optimal policy of the new formulation converge?
-
In the sampling step, the derivation of Eq. (15) requires further clarification. For example, in Eq. (16), what does it mean for to receive a signal in the positive direction? A more intuitive or theoretical explanation of the behavior of this term would strengthen the argument.
-
The paper introduces proxy prompts but does not provide clear evidence of their effectiveness. Specifically, it is unclear why linear interpolation remains effective in this setting. The lack of a strong theoretical justification for this design choice weakens the argument for its necessity.
-
Training and sampling with two models raises concerns about memory efficiency.
方法与评估标准
Similar to my concerns above, it is uncertain whether the proposed dual-model training scheme preserves the theoretical foundations of DPO and whether it leads to a well-defined optimal policy.
理论论述
I have reviewed the derivations of RLHF and the training objective of CHATS. However, I am particularly interested in whether the convergence point of the loss function from Eq. (9) is consistent with that of existing RLHF or DPO methods. A clearer discussion of how the loss formulation ensures consistency with established preference learning frameworks would strengthen the paper's claims.
实验设计与分析
I have reviewed Experiments section of the paper and have the following concerns:
-
In Table 4, the ablation study shows that a single model trained on the full dataset already outperforms the baseline. This raises the question of whether the proposed two-model approach provides a significant advantage over a simpler alternative. A clearer rationale is needed to demonstrate the need to train separate preferred and dispreferred models.
-
In Figure 3, the paper provides a sensitivity analysis for , but while the main text discusses cases where is negative, the experiments do not include results for negative . Including this analysis would strengthen the empirical evaluation by providing a more complete picture of the behavior of the method.
补充材料
I did a rough review for the appendix, focusing on the mathematical derivation section.
与现有文献的关系
The paper proposes a new preference optimization approach for diffusion models, building on methods widely used in large language models (LLMs) and other domains. By extending preference optimization to the diffusion model framework, the work introduces a potentially impactful direction for aligning generative models with human preferences.
If the proposed method demonstrates significant advantages over existing approaches, it could have broader applicability beyond diffusion models, potentially influencing preference optimization strategies in LLMs and other generative models. A stronger discussion on the generalizability of this approach to different architectures would further highlight its relevance to the broader scientific community.
遗漏的重要参考文献
As far as I know, the essential papers seem to be well-referenced.
其他优缺点
I have discussed in above sections.
其他意见或建议
N/A
Thank you for your helpful feedback and questions! Due to space limitations, we provide responses to your main comments. Further questions can be discussed in subsequent responses.
1. Theoretical foundations & convergence properties of CHATS
Given that DPO is invariant to affine transformations of the reward, for reward , the optimal policy becomes (cf. Eq.29):
with omitted for simplicity.
CHATS decomposes reward of traditional DPO (Eq.30) into two parts:
with and omitted since they are constants for optimization. Defining the effective reward as: , the optimal distribution becomes:
Under the assumption of -smoothness and using standard gradient descent, we obtain the inequality:
which ensures the CHATS loss (Eq. 9) decreases and converges. Since their combination recovers the same optimal joint distribution as traditional DPO methods [1], CHATS preserves theoretical foundations of DPO.
2. Dual training signals
In our training procedure, we do not split the data into separate (dis)preferred subsets. Instead, when minimizing losses such as those in Eq. 13 and 14, we select a ranked preference pair from the entire dataset. This single pair is then used to simultaneously update both the preferred and dispreferred models within a unified dual-objective framework. This joint training approach ensures that both models receive complementary signals derived from the ranked pair.
3. Clarification on Eq. 15
As indicated in Line 220–226 (right column), when , the dispreferred distribution is partially incorporated so that it contributes useful patterns while remaining less influential than . In other words, the dispreferred model's noise prediction is toned down, partially suppressing undesirable patterns while preserving beneficial information. Conversely, when , the terms and actively push samples away from undesired modes, effectively suppressing the entire output of the dispreferred model, similar to using a null prompt .
4. Justification on proxy prompt
Generative models represent prompts in a continuous embedding space where linear operations reflect meaningful semantic changes. For example, word embedding arithmetic [2] (e.g., “queen” ≈ “king” – “man” + “woman”) shows that semantic attributes can be linearly combined. Thus, interpolating between a prompt and the null prompt (i.e., forming ) not only reliably captures the semantic content but also reduces the number of forward passes—from 3 in Eq.16 to 2 in Eq.17, thereby cutting inference cost by about 1/3.
5. Memory efficiency
While the dual architectures in CHATS introduce additional cost (see Table 6), as noted in Line 436–439 (right column), this extra inference cost can be completely eliminated via distillation. As shown in the table below, by simultaneously distilling the guidance scale (i.e., in Eq. 17) and the two models into a single one, CHATS achieves both high efficiency and high-quality generation (Model: SDXL).
| Method | Memory() | Throughput () | HPS v2 on Photo() |
|---|---|---|---|
| Standard | 1 | 1 | 26.88 |
| CHATS | 2 | 0.97 | 29.62 |
| CHATS-distill | 1 | 2 | 29.53 |
6. Justification on Table 4
Even though a single model trained on the full dataset outperforms the baseline, our CHATS method consistently delivers further improvements as shown in Table 4. Similar trends are observed with SDXL:
| Config | HPS v2 on Photo () |
|---|---|
| single model (full data) + =5 | 28.20 |
| two models + =5,=0.5 | 29.62 |
7. More analysis on
We show more ablation on on table below (SDXL, =5):
| HPS v2 on Photo() | |
|---|---|
| 0.5 (default) | 29.62 |
| 0.0 | 29.36 |
| -0.1 | 29.34 |
| -0.3 | 29.25 |
We observe the best choice of occurs around 0.5, consistent with our analysis in Fig.5 and Line 765-769 in appendix.
References
[1] Diffusion Model Alignment Using Direct Preference Optimization, CVPR'24
[2] Distributed Representations of Words and Phrases and their Compositionality, NeurIPS'13
Thank you for the authors’ response. Some of my concerns have been addressed. In particular, I had overlooked and misunderstood the dual training objective, but the theoretical analysis and explanations provided by the authors help clarify this point. Since this is my primary concern, it brings me to at least a borderline recommendation for this paper. However, I still have some remaining concerns that make me hesitant to move toward acceptance just yet.
First, while it makes intuitive sense what guidance the authors want each of their terms in Eq. (15) to give as a sampling method, it is theoretically unclear what distribution should ultimately follow. For example, classifier-free guidance can be interpreted as, from classifier guidance using Bayes’ rule, sharpening a classifier and then replacing it with a combination of unconditional and conditional denoised networks. I wonder if a similar theoretical explanation could be applied here as well.
Additionally, I am not sure that the current explanation of proxy prompts convincingly supports the proposed method. Modern diffusion models use much more complex text encoders than [2], and the claim that the text embeddings follow linear properties based on the analysis of [2] is not particularly convincing. At the very least, there should be experimental evidence to support this property. For example, let denote the text embedding for prompt . Then, as the authors mentioned, I would be interested to see whether aligns with , or whether sampling from using the diffusion model generates images that semantically represent 'queen'.
Thank you very much for your thoughtful and insightful comments. We respond to your remaining concerns on:
1.What distribution should ultimately follow
Below is a compressed derivation of Eq.15 starting from Bayes’ rule and extending the standard classifier‐free guidance (CFG) derivation to our dual‐model setting.
Start from Bayes’ rule for a classifier:
since can be regarded as a constant during optimization, CFG defines the guided distribution by raising with a guidance scale :
Substitute the expression for and omit yields:
In CHATS, two models are used:
- The preferred model ,
- The dispreferred model (with its unconditional form ).
For each model, we can write a classifier-like term via Bayes’ rule. For the preferred model:
and for the dispreferred model:
Assuming , we combine the two signals by defining a composite log-odds score:
The first term tends to generate features favored by the preferred model while suppressing the background features typically produced by the dispreferred model in its unconditional output (similar to CFG), and the second term further accounts for the shift in the output of the dispreferred model when conditioned on , with its impact regulated by a scalar . In this form, the useful information in is effectively utilized as well.
Following CFG, we define the CHATS guided distribution as:
Substitute :
Using , we have
Grouping terms, we obtain
which is the same with Eq.15. The final guided distribution is not merely a sharpened version of . It also leverages the dispreferred model. The term adjusts the output based on how conditioning on changes the dispreferred model’s behavior. This derivation, starting from for both models, provides a theoretical foundation for the CHATS sampling distribution analogous to that of CFG.
2. More evidence on proxy prompt
We appreciate the reviewer's concern regarding the linearity assumption in the text embedding space, especially given that modern diffusion models use more complex text encoders than those analyzed in [2]. To address this, we perform an experiment to verify whether an random additive fusion of two text embeddings can indeed capture meaningful semantic information, which is more similar to proxy prompt than "queen-king" case.
Setup:
We randomly masked certain components of the original prompt by replacing them with a [mask] token and generated images under the following four conditions (Model: SDXL):
1: Using only the unmasked portion of the prompt.
2: Using only the masked components (i.e., the content replaced by [mask]).
3: Using the original, unaltered prompt.
4: Converting both the unmasked and masked components into text embeddings and merging them via element‐wise addition. The resulting fused embedding is then used for image generation.
The qualitative results in this PDF show that using the fused text embedding (Condition 4) captures the intended semantics and produces images of equal or higher quality than those from the original prompt (Condition 3). This preliminary evidence indicates that an additive fusion in the text embedding space effectively integrates semantic features, thereby supporting our proxy prompt approach in Eq.17 even with modern, complex text encoders.
This paper presents CHATS, a framework for text-to-image generation (T2I) that enhances both text-image alignment and generation quality. Unlike traditional approaches that separately apply human preference alignment and classifier-free guidance, CHATS integrates both components to optimize text-to-image diffusion models. The proposed method models both preferred and dispreferred distributions and employs a proxy-prompt-based sampling strategy to leverage useful information from both. CHATS demonstrates data efficiency, achieving good performance with minimal fine-tuning data. Experimental results show that CHATS outperforms existing preference alignment techniques on some evaluation metrics.
给作者的问题
NA
论据与证据
NA
方法与评估标准
yes
理论论述
NA
实验设计与分析
- The performance improvement, especially over SDXL + Diffusion-DPO, on Tab 1 - Tab 3 is marginal. The effectiveness of the proposed method may not be significantly verified.
补充材料
all
与现有文献的关系
NA
遗漏的重要参考文献
[d] Null-text Inversion for Editing Real Images using Guided Diffusion Models. CVPR’23
其他优缺点
Strengths:
- A new framework combining RL and guidance. The paper introduces an approach that jointly optimizes human preference alignment and test-time sampling, addressing some limitations in existing text-to-image models.
- Data-Efficient Fine-Tuning. CHATS achieves good performance with a small, high-quality fine-tuning dataset, making it more practical and resource-efficient for real-world applications.
Weaknesses:
-
Lack of novelty. Human alignment methods for diffusion models [a, b, c] and learning negative/disprefer concepts for text-time sampling [d] have been already proposed. This method seems to simply combine exisiting technoligies.
-
The performance improvement, especially over SDXL + Diffusion-DPO, on Tab 1 - Tab 3 is marginal. The effectiveness of the proposed method may not be significantly verified.
-
The deployment cost may be double compared with a single diffusion model, due to the introduction of the minus model. Besides, the efficiency can also be influenced as shown in Tab 6.
[a] Diffusion Model Alignment Using Direct Preference Optimization. CVPR’24 [b] Training Diffusion Models with Reinforcement Learning. ICLR’22 [c] DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models. NeurIPS’23 [d] Null-text Inversion for Editing Real Images using Guided Diffusion Models. CVPR’23
其他意见或建议
NA
Thank you for your comments.
1. Novelty
We respectfully disagree with the assertion that CHATS merely combines existing technologies. In our work, we explicitly differentiate our approach from prior DPO methods for diffusion models in related work [1–3] in Line 135-146 (left column). In fact, the 3 papers you cited are evidences that support our novelty rather than undermining it. The key insight of our CHATS is to integrate human preference optimization finetuning with the sampling process, leveraging their inherent synergy to refine image generation. However, none of DPO methods you mentioned explore it.
Moreover, Null-text Inversion [4] concentrates on image editing by optimizing the null-text embeddings using gradient updates to achieve a superior DDIM inversion. In contrast, the proxy-prompt-based sampling strategy employed by CHATS is primarily designed to enhance sampling efficiency by reducing the number of forward passes from 3 in Eq.16 to 2 in Eq.17, while all prompt features remain frozen and untrained. Additionally, [4] neither utilizes human preference data nor trains separate models to explicitly capture the distributions of preferred and dispreferred images, and the dual architectures used by CHATS naturally separate these distributions into two distinct parts. As a result, our method is not simply a combination of existing techniques but rather a novel integration that leverages human preference data to improve the generative process. To the best of our knowledge, such an approach has not been previously explored in the context of text-to-image generation.
2. Clarification on performance improvement
We acknowledge that the numerical improvements, especially when comparing SDXL + Diffusion-DPO with CHATS, may appear modest on individual benchmarks. However, we emphasize that the improvements are consistent across multiple benchmark evaluations and are observed in both diffusion models (SDXL) and flow matching models (In-house T2I). Our extensive evaluations across aesthetic scores, GenEval, and DPG-Bench consistently demonstrate that CHATS improves aesthetic alignment and generation quality. The consistent gains across diverse datasets and model architectures validate the effectiveness of our approach. Moreover, these improvements are achieved with only a small high-quality fine-tuning dataset, highlighting the data efficiency of CHATS.
3. Clarification of deployment cost
While CHATS introduces a dual-model architecture that, if used naively, doubles the mode size and slightly decreases the throughputs compared to a single diffusion model (as shown in Table 6), this cost can be completely eliminated through distillation. As noted in Line 436–439 (right column), by simultaneously distilling the guidance scale (i.e., in Eq. 17) and the two models into a single model, we achieve both high efficiency and high-quality generation. For example, our distillation variant, “CHATS-distill,” attains a memory footprint and throughput comparable to or even exceeding that of the standard model, while retaining the improved HPS v2 score, as demonstrated in the following table:
| Method | Memory () | Throughput () | HPS v2 on Photo () |
|---|---|---|---|
| Standard | 1× | 1× | 26.88 |
| CHATS | 2× | 0.97× | 29.62 |
| CHATS-distill | 1× | 2× | 29.53 |
Thus, while the raw dual-model approach incurs additional computational cost, our results show that a distilled version can match or surpass the efficiency of a single model without compromising the quality gains provided by CHATS.
References
[1] Diffusion Model Alignment Using Direct Preference Optimization, CVPR’24.
[2] Training Diffusion Models with Reinforcement Learning, ICLR’24.
[3] DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models, NeurIPS’23.
[4] Null-text Inversion for Editing Real Images using Guided Diffusion Models, CVPR’23.
This paper introduces CHATS, a framework integrating human preference optimization and sampling guidance for t2i models. Reviewers acknowledged the novel integration approach, data efficiency, and performance improvements across several benchmarks.
Initial concerns regarding theoretical justification were resolved by the authors' rebuttal. The weakness of the increased computational cost (memory, inference time) due to the dual-model architecture was addressed by the author with CHATS-distill. After carefully reading the paper and all reviews, I generally agree with reviewers comments and the discussion outcome. Overall, I think the paper has values in the image diffusion domain. I’d recommend the authors to address the reviewers’ comments in the revision.