PaperHub
7.0
/10
Spotlight3 位审稿人
最低4最高5标准差0.5
4
5
4
3.7
置信度
创新性3.0
质量3.0
清晰度3.0
重要性3.0
NeurIPS 2025

Aligning Text-to-Image Diffusion Models to Human Preference by Classification

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29

摘要

关键词
Diffusion ModelsHuman Preference

评审与讨论

审稿意见
4

This work deals with the alignment of text-to-image generative models. In short, diffusion-based generative models (and other generative models such as flow matching, rectified flows, and so on, by the way) suffer from a number of problems when generating an image, given a prompt. Those issues range from catastrophic neglecting (objects present in the prompt are not generated as part of the final image), attribute binding (textures, colors, and other image characteristics) are not properly respected, spatial binding (positioning of objects as defined by the prompt) is not respected, numbering (e.g. object counts) is not correct, and many more.

The authors build on fine-tuning methods that were originally devised for LLM alignment, and that have been also recently applied to diffusion models, to derive a new approach that relates alignment to classification accuracy. They amend the existing connection between alignment and semi-supervised learning by introducing a simple form of data augmentation, transforming the problem into a supervised classification task. Leveraging insights on bounds on the AM-Softmax loss and the Diffusion-DPO loss, as well as results indicating that ideal alignment requires the diffusion model to be discriminative, they propose a new loss that is an instance of the Circle Loss, which is used to fine-tune the (conditional) score network based on positive and negative examples, with their associated labels, plus additional parameters to improve the discriminative power of the diffusion model.

Experiments using standard metrics and a user study indicate that the proposed ABC-loss works well and outperforms existing variants of DPO-based alignment methods. Additional experiments serve the purpose of validating the theoretical analysis, and study the impact of one important hyper-parameter of the proposed method.

优缺点分析

  • Strengths:

    • The authors propose a simple solution to a difficult problem, building on existing work on model alignment through direct preference optimization, through some valid insights connecting generative modeling and classification

    • The article is well written and easy to follow

    • The experimental validation goes beyond a quantitative/qualitative reporting of alignment results, and provides additional insights on the benefits of aligned models when used as classifiers, as well as an ablation study on an important hyper-parameter

  • Weaknesses:

    • State-of-the-art on text-to-image alignment is assessed superficially. This works only focuses on DPO-based variants, and neglects many existing work in the domain. On the one hand, we have inference-time methods, which attempt to align T2I models by steering the generative paths through optimization of the attention layers of the score network, such as [1], and many others, on the other, we have fine-tuning methods that do not belong to the DPO family, such as [2] and [3], [4], for example.

    • The quantitative analysis of the performance of the proposed method in Table 1 relies on a protocol that hinders a proper assessment of the absolute performance of compared methods.

    • The metrics used in Table 1 are only a small subset of the many that can be used to assess alignment. See for example [2].

    • The proposed method is only compared to DPO-based alternatives, but not to other existing methods that use a different approach to alignment, such as [1-4] (and many more)

    • The proposed method, in principle, can be applied to other generative models, such as rectified flows. The evaluation is based on “older” models such as SD1.5 and SDXL, whereas the current best practice is to use FLUX or SD3, which use a different generative modeling paradigm (still related to diffusion somehow), and that are known to be more aligned than SD1.5 and SDXL.

    • There is no mention to recent work on autoregressive generative models, which appear to have superior alignment when compared to diffusion models [5, 6].

[1] Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, Daniel Cohen-Or, “Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models”, https://arxiv.org/abs/2301.13826

[2] Kaiyi Huang, Chengqi Duan, Kaiyue Sun, Enze Xie, Zhenguo Li, Xihui Liu, “T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to-image Generation”, https://arxiv.org/abs/2307.06350

[3] Chao Wang, Giulio Franzese, Alessandro Finamore, Massimo Gallo, Pietro Michiardi, “Information Theoretic Text-to-Image Alignment”, https://arxiv.org/abs/2405.20759

[4] Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, Kimin Lee, “DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models”, https://arxiv.org/abs/2305.16381

[5] Zhiyuan Yan, Junyan Ye, Weijia Li, Zilong Huang, Shenghai Yuan, Xiangyang He, Kaiqing Lin, Jun He, Conghui He, Li Yuan, “GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation”, https://arxiv.org/abs/2504.02782v1

[6] Hu Yu, Hao Luo, Hangjie Yuan, Yu Rong, Feng Zhao, “Frequency Autoregressive Image Generation with Continuous Tokens”, https://arxiv.org/abs/2503.05305

问题

  • If recent autoregressive generative models are becoming the new standard for image generation, how relevant do you consider alignment approaches for diffusion models in general, and DPO-based fine-tuning approaches in particular, will be?

  • What are the key advantages of DPO-based alignment methods such as ABC when compared to other fine-tuning methods from the literature (e.g., you cite DPOK as reference [14], but do not even refer to it in the main), or other inference-time methods?

  • How would your approach perform with more natural prompts than the ones you used in your evaluation? Would it be possible to consider using T2I-CompBench++ (reference [2] above) for your evaluation?

  • Could you report absolute metrics rather than percentage of wins in Table 1?

  • In sec 4.1, line 116, you rely on a pretty strong assumption. I realize your work is not the only one requiring such an assumption, but it would be interesting to hear your opinion on how a DPO-based framework such as ABC could cope with the more realistic case of multiple images being aligned to a single prompt. Consider a realistic user prompt: natural language is ambiguous, subject to interpretation, thus there might be multiple generated images that could be considered as aligned to it

  • In Theorem 2, lines 142-143, you need to assume p(y)p(y) and p(x)p(x) to be uniform distributions. Also this assumption is pretty strong. Can you comment on what would it be necessary to do if we relax such an assumption?

局限性

Yes, limitations are discussed.

最终评判理由

I think the authors did a great job in addressing my concerns, and they run additional experiments that broaden the position of their work in the current landscape of T2I alignment methods.

格式问题

No issues remarked.

作者回复

We greatly appreciate the reviewers' hard work and thank you for your valuable comments. We will address each concerns in the following.


Q: This works only focuses on DPO-based variants, and neglects many existing work in the domain. On the one hand, we have inference-time methods, which attempt to align T2I models by steering the generative paths through optimization of the attention layers of the score network

A: We believe this concern arises from a misunderstanding of the comparison scope. As discussed in "Large Language Models Post-training: Surveying Techniques from Alignment to Reasoning," generative modeling methods can be broadly categorized into pre-training and post-training. Pre-training focuses on building general-purpose models from large datasets, while post-training adapts these models to specific downstream tasks.

Due to their distinct goals, evaluation strategies for these two categories differ. Pre-training methods are typically compared across architecture types, datasets, or training paradigms. In contrast, post-training methods—like ours—should be evaluated using the same foundation model and training setup, with only the loss objective varying to ensure a fair comparison.

Most works cited by Reviewer RW4E, except [2,3], focus on pre-training. For instance, [6] proposes an autoregressive model as a new foundation model and compares its generative capabilities with other pre-training approaches. By contrast, our ABC loss is a post-training method applied to a fixed foundation model. Analogously, comparing a fine-tuned GPT-3 with GPT-4o would not inform the quality of the fine-tuning method applied to GPT-3.

Additionally, [2] is an inference-time method without training, and [3] is self-supervised, whereas DPO-based methods—including ours—require supervised training. According to our tests conducted during the rebuttal period, [2] and [3] consistently underperform all DPO-based methods on the same base model. Given their fundamentally different setups, comparing their raw performance to ours is not meaningful for assessing the quality of our approach.


Q: How would your approach perform with more natural prompts than the ones you used in your evaluation? Would it be possible to consider using T2I-CompBench++ for your evaluation?

A: This is a valuable suggestion. However, implementing it poses several technical challenges. T2I-CompBench++ is specifically designed to test compositional generation—handling multiple objects, attributes, and relationships—which makes it more suited for evaluating foundation models developed through pre-training.

In contrast, post-training methods like ours require a well-defined downstream task and a corresponding dataset. The standard procedure involves fine-tuning a pre-trained foundation model using a specific loss and evaluating the resulting checkpoint against either the original model or other post-trained variants.

In our case, we currently lack a task-specific dataset aligned with the compositional goals of T2I-CompBench++. Therefore, although T2I-CompBench++ is a promising benchmark, it is not directly applicable under our current post-training setup. Currently, we make a comporise which take the checkpoints trained on Pick-a-Pic to conduct this evaultion. Due to the maximum length limitation, we have included the corresponding data in our response to Reviewer qFcD.


Q: If recent autoregressive generative models are becoming the new standard for image generation, how relevant do you consider alignment approaches for diffusion models in general, and DPO-based fine-tuning approaches in particular, will be?

A: Both autoregressive generative models and diffusion models are promising approaches for image generation. The DPO (Differentiable Prompt Optimization) method was initially proposed for aligning autoregressive language models, and Diffusion-DPO adapts this technique for use with diffusion models. The primary contribution of our paper is that we are the first to demonstrate that the alignment task can be transformed into a classification task. Reviewer RW4E can verify that this discovery does not depend on the model type. Both autoregressive and diffusion models can benefit from this finding.

Since Reviewer RW4E has a particular interest in the application of the ABC loss to autoregressive generative models, we believe it would be a great idea to prepare a separate paper discussing how to apply the ABC loss to autoregressive models.


Q: What are the key advantages of DPO-based alignment methods such as ABC when compared to other fine-tuning methods from the literature (e.g., you cite DPOK as reference [14], but do not even refer to it in the main), or other inference-time methods?

A: Both DPO and SFT methods have demonstrated their value in training large language models. SFT-based methods, such as DPOK, require a reward model to fine-tune the diffusion model. However, training a good reward model requires additional time and dataset resources. In contrast, DPO-based methods bypass the need for a reward model when fine-tuning the diffusion model.

Regarding inference-time methods, the running time for inference is significantly longer compared to SFT and DPO-based methods. In general, these three approaches represent different pathways toward achieving better results, and at this stage, it is difficult to predict which will ultimately prevail. This paper focuses solely on demonstrating that the alignment task can be transformed into a classification task through our ABC loss. We do not address whether DPO-based methods will outperform SFT and inference-based methods, as this is beyond the scope of our current work.


Q: In sec 4.1, line 116, you rely on a pretty strong assumption. I realize your work is not the only one requiring such an assumption, but it would be interesting to hear your opinion on how a DPO-based framework such as ABC could cope with the more realistic case of multiple images being aligned to a single prompt.

A: Thank you for your question. We acknowledge the limitation of this assumption. However, we could not avoid it in the current environment for two reasons:

  1. The current preference datasets are built on this assumption.
  2. The evaluation networks are also designed based on this assumption, as they are trained on the preference dataset.

Without new preference datasets and evaluation criteria, it is difficult for methods that handle multiple preferences to demonstrate their superiority and gain acceptance in top conferences like NeurIPS.

Since we have established the connection between alignment and classification, the assumption that "each text prompt corresponds to a single aligned image" can be interpreted as a multi-class classification task, where each sample is assumed to belong to only one class. In contrast, Reviewer RW4E's preference for a scenario where one prompt corresponds to multiple images can be interpreted as a multi-label classification task, where each sample may belong to multiple classes. In the future, we plan to explore the multi-label classification perspective to address this issue.


Q: In Theorem 2, lines 142-143, this assumption is pretty strong. Can you comment on what would it be necessary to do if we relax such an assumption?

A: We apologize for any confusion caused to Reviewer RW4E. In our paper, Theorem 2 is used to illustrate that classification performance directly determines alignment performance. To demonstrate this, we assume that we have a total of NN text prompts yi\mathrm{y}_i​, each corresponding to an aligned image x_y_i\pmb{x}\_{\mathrm{y}\_i} ​​. We believe it is reasonable to assume that the prior distribution for both the prompt yi\mathrm{y}_i​ and the image x_y_i\pmb{x}\_{\mathrm{y}\_i} ​​ follows a uniform distribution. If we were to assume otherwise, we would have to acknowledge that certain labels and images are inherently more special than others. However, we currently do not have a compelling reason to support this assumption.


Q: The proposed method, in principle, can be applied to other generative models, such as rectified flows. The evaluation is based on “older” models such as SD1.5 and SDXL, whereas the current best practice is to use FLUX or SD3, which use a different generative modeling paradigm (still related to diffusion somehow), and that are known to be more aligned than SD1.5 and SDXL.

A: Applying DPO-like methods to flow models is not an easy task. Flow models rely on a deterministic generative process based on Ordinary Differential Equations (ODEs), meaning they cannot sample stochastically during inference. In contrast, DPO relies on stochastic sampling to explore the environment, learning by trying different actions and improving based on rewards. This need for stochasticity conflicts with the deterministic nature of flow matching models. To apply DPO to flow models, one would need to conduct an ODE-to-SDE conversion, which transforms a deterministic ODE into an equivalent Stochastic Differential Equation (SDE) that matches the original model’s marginal distribution at all timesteps, thus enabling statistical sampling.

We recently discovered a concurrent work, "Flow-GRPO: Training Flow Matching Models via Online RL", which claims to be the first method to apply GRPO to flow models in order to address this problem. By employing Flow-GRPO, our approach could potentially be applied to models like FLUX or SD3. However, this paper was released on May 8, 2025, on arXiv, and the code was released on May 15, 2025. As a result, we were not aware of this work before submitting our paper.


Q: Could you report absolute metrics rather than percentage of wins in Table 1?

A: Thank you for the suggestion. Due to length limits, we report the absolute metrics in Table 2 of our response to Reviewer PucN, including appended results for DPOK and D3PO.

评论

Dear authors, thank you very much for your rebuttal, and hard work in producing additional results (I have read other reviews and your response).

While I appreciate the technical contribution of this work, I do not agree on the narrow angle taken by the authors concerning the empirical evaluation and comparison to alternative methodologies to address the T2I alignment problem. In my opinion the evaluation is limited in 1) comparison to other methods (inference time, other fine-tuning approaches such as MITUNE), 2) prompt complexity (e.g. using natural prompts). Although the authors indicate superior performance to alternative such as [2] and [3], at the time of this writing, I do not have enough information to properly assess such claim.

I will remain open during the discussion phase, both with authors and other reviewers, to revise my score and to avoid "blocking" an interesting idea to be published and discussed in the community. However, as of now, I will not change my score.

评论

Dear Reviewer RW4E,

We sincerely apologize for the delayed response, as we needed additional time to carefully consider how to conduct a fair comparison. We also acknowledge that our initial rebuttal did not provide sufficient evidence to fully address your concerns. We would like to take this opportunity to clarify why the comparisons you suggested were not included. Our decision was based on three main reasons: (1) the referenced methods belong to different methodological categories; (2) cross-category comparisons raise concerns regarding fairness and interpretability; and (3) space limitations prevented us from including additional results.

For the first reason, we note that A&E [1] is a training-free approach designed to improve inference-time performance without modifying model weights. In contrast, methods [3], [4], and ours are fine-tuning-based and rely on data-driven checkpoint updates. Even among fine-tuning methods, the assumptions differ significantly: MITUNE [3] generates pseudo-labeled data automatically using the base model, while method [4] assumes access to human-annotated data and sufficient resources to train a reward model, which is then used to guide the training of the generation model. In resource-constrained environments—especially those with limited GPU memory—method [4] becomes impractical, making DPO-style methods like ours the only viable option. Furthermore, the training data used also differs: DPO is trained on preference triplets (prompt, preferred image, less preferred image), whereas [4] uses prompt-image pairs without explicit preference signals. These differences in both input requirements and training objectives further complicate direct comparison.

For the second reason, even when technically possible, fair comparisons remain challenging. Fine-tuning methods yield new checkpoints that are evaluated under a shared sampling strategy, while training-free methods like A&E [1] modify the sampling process itself, operating on a fixed checkpoint. As such, the natural baselines for A&E [1] are other sampling strategies (e.g., DDPM, DDIM), not fine-tuned models.

To aid your evaluation, we report the performance of our fine-tuned checkpoint under both DDPM and A&E [1]’s sampling strategy, along with the performance of the original (unmodified) SDXL checkpoint under the same conditions. Note that A&E [1] requires manually specified subject tokens to guide inference—information that standard DDPM cannot leverage—further limiting the validity of direct comparisons.

ModelColor (B-VQA)Shape (B-VQA)Texture (B-VQA)Numeracy (UniDet)2D-Spatial (UniDet)3D-Spatial (UniDet)Non-Spatial (CLIP)
SDXL+DDPM0.57080.48800.56000.55910.19490.35510.3065
SDXL+A&E0.65890.51010.67370.53190.22720.36020.3204
ABC+DDPM0.67080.54500.68660.56230.24010.36970.3154
ABC+A&E0.69360.57260.70230.57440.23680.37450.3256

We also compare our fine-tuned model (ABC) against MITUNE [3] under the same DDPM sampling protocol:

ModelPickScore (P)HPS (P)Aesth. (P)CLIP (P)PickScore (H)HPS (H)Aesth. (H)CLIP (H)
SDXL- MITUNE21.61±2.2227.04±3.265.54±0.7430.11±8.5623.06±2.4729.80±2.985.92±0.9939.96±5.90
SDXL-ABC23.79±2.2729.42±3.296.35±0.8936.81±8.5124.39±2.3830.67±3.086.54±0.8638.97±6.02

We further provide a comparison on T2I-CompBench++, showing that ABC consistently outperforms MITUNE across most dimensions:

ModelColor (B-VQA)Shape (B-VQA)Texture (B-VQA)Numeracy (UniDet)2D-Spatial (UniDet)3D-Spatial (UniDet)Non-Spatial (CLIP)
MITUNE+DDPM0.69120.52610.66080.48660.23530.34720.3174
ABC+DDPM0.67080.54500.68660.56230.24010.36970.3154

Overall, our method consistently outperforms MITUNE across a wide range of metrics. We attribute this to two main factors: (1) our dataset includes human-annotated preferences, whereas MITUNE relies on synthetic labels generated by the base model; and (2) our proposed ABC loss is more effective than the pseudo-labeling strategy employed by MITUNE.

We acknowledge that on T2I-CompBench++, ABC underperforms MITUNE on two specific scores. We attribute this to differences in the training datasets: the dataset used for ABC was not specifically designed for the T2I-CompBench++ task, whereas the dataset used in MITUNE was optimized to maximize point-wise mutual information—that is, to emphasize the difference between conditional and unconditional matching scores. While this design benefits pixel-level fidelity, it may lead to weaker semantic representation and alignment, which in turn affects performance on certain benchmarks.

Please feel free to reach out if you have any further questions. We sincerely appreciate your thoughtful feedback.

Best regards,

Authors

评论

Dear authors, thank you for the additional experiments and discussion. I think this definitely strengthen your paper, and provides a broader view on the approach you proposed. I will raise my score.

Thanks

评论

Dear Reviewer RW4E

We sincerely thank you for the positive feedback and for recognizing the value of the additional experiments and discussions. We are pleased to hear that these additions have helped strengthen the paper and provide a broader perspective on our proposed approach. We greatly appreciate your support and are grateful for the updated score.

Best Wishes!

The Authors

审稿意见
5

This work proposes a novel connection between preference alignment and classification, i.e. being able to discriminate between preferred and unpreferred samples, and presents an alignment objective for aligning text-to-image diffusion models. The objective extends the Circle Loss objective into the setting of text-to-image diffusion models where the score, provided by the diffusion model, is used to approximate p(y|x). Unlike related approaches like Diffusion-DPO (which ABC, their method, is connected to), the provided objective does not require a reference model during training, and thus provides some memory/compute benefits as well. Results show that ABC outperforms prior alignment approaches according to existing reward models and human annotators.

优缺点分析

Strengths

  • Proposes novel connection between classification and alignment, that is justified both theoretically and empirically.

  • Does not require a reference model during training, saving memory & compute

  • Strong, comprehensive results on automated evaluations (e.g. using HPS reward model as judge) and via user study.

Weaknesses

  • Importance of data augmentation is unclear. The authors note that the training loss may diverge without the use of data augmentation. However, the effect of this augmentation is not well documented, e.g. there is no quantitative ablation on the effect of the augmentation.

  • Strong assumptions regarding the nature of preference data, e.g. each text prompt corresponds to one aligned image, or the data augmentation strategy that prefaces the losing image’s prompt with “The image that aligns less with human preferences”. Real-world preference data is quite noisy and its unclear why a formulation dependent on such strong assumptions is able to outperform other methods like Diffusion-KTO, for instance, which have some noise-resistant properties. Additionally, the data augmentation strategy is questionable as even though an image is less-preferred in a single triplet it may generally be more preferred on aggregate. It is unclear why this strategy works with noisy real-world data.

  • Statistical significance of results. The authors report win-rates in the quantitative preference alignment experiments. However, win-rate can be quite sensitive to the choice of seed, noise, etc, as in order for a method to win it just needs to at least very slightly outperform the other per the reward model score. To fully validate the performance of ABC, statistical significance of these results should be provided, e.g. via confidence intervals.

  • Lack of user study details. The details of the user study are largely omitted in the paper, outside of 2 lines in the checklist. Was this a blind user study? How many annotators were assigned per comparison?

  • [minor editorial] Line 75 is missing a citation for triplet loss.

问题

  • Can the authors report the statistical significance of the quantitative preference alignment experiments?

  • For the quantitative preference experiment using off-the-shelf reward models, can the authors report the average score (within some confidence interval) to help quantify the improvement of ABC over other methods?

  • Can the authors provide a quantitative ablation studying the effect of the data augmentation strategy?

局限性

yes

最终评判理由

The paper presents an interesting alternative to T2I alignment by formulating it is a classification problem. The results provided in the rebuttal are statistically significant (spec. for SDXL) and show the benefit of this approach. While the loss function is not novel, I believe it's application to this new domain and its performance merits acceptance.

格式问题

n/a

作者回复

We greatly appreciate the reviewers' hard work and thank you for your valuable comments. We will address each concerns in the following.


Q: Importance of data augmentation is unclear.

A: We apologize for the confusion regarding data augmentation in the preference data. Each dataset item is a triplet (y,x+_y,x_y)(\mathrm{y}, \pmb{x}^+\_{\mathrm{y}}, \pmb{x}^-\_{\mathrm{y}}), where y\mathrm{y} denotes the prompt, and x+_y\pmb{x}^+\_{\mathrm{y}} and x_y\pmb{x}^-\_{\mathrm{y}} represent the positive and negative images corresponding to the prompt y\mathrm{y}, respectively.

In this paper, we propose replacing the alignment loss (DPO loss) with a classification loss (ABC loss). Since the classification loss requires a label for each image, we need to transform the triplet (y,x+_y,x_y)(\mathrm{y}, \pmb{x}^+\_{\mathrm{y}}, \pmb{x}^-\_{\mathrm{y}}), where the two images share the same label y\mathrm{y}, into a form where each image has its own label. Specifically, we define y+\mathrm{y}^+ as the original prompt y\mathrm{y} and construct y\mathrm{y}^- by appending “The image that aligns less with human preferences” to y\mathrm{y}. This reformulation transforms each preference tuple (y,xy+,x_y)(\mathrm{y}, \pmb{x}^+_{\mathrm{y}}, \pmb{x}^-\_{\mathrm{y}}) into two supervised examples: (y+,x+_y)(\mathrm{y}^+, \pmb{x}^+\_{\mathrm{y}}) and (y,x_y)(\mathrm{y}^-, \pmb{x}^-\_{\mathrm{y}}). This is clarified in lines 173 to 176 of the paper.


Q: Real-world preference data is quite noisy and its unclear why a formulation dependent on such strong assumptions is able to outperform other methods like Diffusion-KTO. Additionally, the data augmentation strategy is questionable as even though an image is less-preferred in a single triplet it may generally be more preferred on aggregate. It is unclear why this strategy works with noisy real-world data.

A: Yes, this is a great question. Preference data is inherently noisy, making it difficult to establish a clear principle for determining whether one image is preferred over another. Current diffusion alignment methods do not take this issue into account. To address this problem, we believe it is necessary to build a new dataset that provides preference comparison results across different dimensions. However, current alignment methods are typically trained on datasets like Pick-a-Pic, which assumes that one image is better than another with respect to the given prompt.

We do not directly address this issue in the current paper, but we present a promising approach to tackle it. Specifically, we show that the alignment task can be transformed into a classification problem, and classification in noisy environments has been extensively studied. It may therefore be possible to adapt existing methods from noisy classification to address the noisy alignment problem.

As for the noise-resistant properties of our method, they are easy to interpret. The well-known ImageNet classification dataset contains incorrect labels, yet research has shown that simply using cross-entropy loss to train the network still yields state-of-the-art performance, without the need for special handling of mislabeled data. We attribute this noise-resistant property to the classification loss. In this paper, we transform the preference data into a classification form. As a result, even if the data is noisy, the network retains the ability to resist the effects of noise.


Q: Statistical significance of results. To fully validate the performance of ABC, statistical significance of these results should be provided, e.g. via confidence intervals.

A: Thank you for raising this important point. We agree that providing statistical significance is important for validating the reported win rates. While our experiments followed the standard protocols used in prior work, we now include win rates along with variance intervals.

Specifically, we compute these intervals by discarding the top 5% of deviations from the mean, resulting in a range that captures 95% of the scores. This provides a robust estimate of variability and offers a more informative view of our method’s performance.

Table 1. Win rate on PartiPrompts (P) and HPS (H) benchmarks for SD1.5 and SDXL-based models.

ModelPickScore (P)HPS (P)Aesth. (P)CLIP (P)PickScore (H)HPS (H)Aesth. (H)CLIP (H)
SD1.5-Base60.02 ± 2.40%81.51 ± 2.21%74.27 ± 2.14%59.72 ± 2.45%74.83 ± 3.10%85.75 ± 2.87%68.84 ± 3.31%59.65 ± 3.53%
SD1.5-DPOK57.15 ± 2.45%68.19 ± 2.21%62.51 ± 2.42%55.28 ± 2.45%52.17 ± 3.53%68.16 ± 3.46%65.24 ± 3.49%60.95 ± 3.56%
SD1.5-D3PO58.42 ± 2.43%72.75 ± 2.30%65.72 ± 2.40%53.61 ± 2.45%52.33± 3.52%73.25 ± 3.46%63.89 ± 3.51%54.14 ± 3.62%
SD1.5-DPO55.85 ± 2.44%73.02 ± 2.38%64.90 ± 2.34%44.97 ± 2.43%53.46 ± 3.51%71.50 ± 3.46%64.19 ± 3.56%52.06 ± 3.57%
SD1.5-SPO51.16 ± 2.41%61.59 ± 2.41%47.60 ± 2.45%60.02 ± 2.36%45.35 ± 3.57%54.99 ± 3.53%38.08 ± 3.47%64.83 ± 3.13%
SD1.5-KTO57.77 ± 2.41%44.72 ± 2.38%53.90 ± 2.44%47.22 ± 2.45%52.28 ± 3.52%42.88 ± 3.48%52.86 ± 3.57%53.93 ± 3.55%
SDXL-Base74.38 ± 1.84%79.26 ± 1.97%80.20 ± 1.56%52.46 ± 2.04%79.35 ± 2.23%70.17 ± 2.58%72.28 ± 2.86%60.38 ± 3.26%
SDXL-DPO73.22 ± 2.18%72.50 ± 2.33%68.25 ± 1.39%50.51 ± 1.88%77.26 ± 3.06%69.54 ± 3.42%70.19 ± 2.39%57.06 ± 3.07%
SDXL-SPO52.49 ± 2.34%40.31 ± 2.44%59.93 ± 2.41%55.53 ± 2.44%51.16 ± 3.34%52.41 ± 3.57%46.78 ± 3.53%59.87 ± 3.58%
SDXL-MAPO65.35 ± 1.81%81.17 ± 2.07%72.10 ± 1.82%46.97 ± 2.06%68.55 ± 2.37%64.89 ± 3.11%68.18 ± 2.96%51.14 ± 3.19%

Q: Lack of user study details. The details of the user study are largely omitted in the paper, outside of 2 lines in the checklist. Was this a blind user study? How many annotators were assigned per comparison?

A: Thank you for your question. We conducted a user study to compare the proposed ABC method with several baseline approaches. Specifically, we randomly sampled 100 prompts from the PartiPrompts dataset and another 100 prompts from the HPSv2 benchmark. For each prompt, we generated five images using five different methods.

Participants were shown five images per prompt (one from each method) and asked to answer three questions:

  1. Which image is your overall preferred choice?
  2. Which image is more visually attractive?
  3. Which image better matches the text description?

To minimize position bias, the order of images was randomized for each prompt. Each method’s final score was computed as a weighted sum of its win rates under the three criteria, with weights of 30% for general preference, 30% for visual appeal, and 40% for prompt alignment.

The study was conducted as a blind evaluation. Annotators were not informed about which method generated each image. We recruited participants from our research group, comprising approximately 100 students, and collected a total of 82 valid responses.

We hope this clarification provides a more complete picture of our user study design and evaluation protocol.


Q: Can the authors report the average score (within some confidence interval) to help quantify the improvement of ABC over other methods?

A: We thank the reviewer for the valuable suggestion. In addition to win-rate comparisons, we now report the absolute scores along with the setting in the Table 1.

Table 2. Absolute scores on the PartiPrompts (P) and HPS (H) benchmarks for SD1.5 and SDXL-based models.

ModelPickScore (P)HPS (P)Aesth. (P)CLIP (P)PickScore (H)HPS (H)Aesth. (H)CLIP (H)
SD1.5-Base21.25±2.0226.98±2.855.29±1.1729.57±9.7121.17±2.4527.61±3.105.45±1.0135.73±7.38
SD1.5-DPOK21.57±2.2127.21±3.035.59±1.2130.01±9.9321.80±2.4228.15±3.175.60±1.0436.75±8.12
SD1.5-D3PO21.41±2.0327.08±2.975.42±1.1329.87±8.6621.76±2.5828.20±3.235.54±1.0736.71±7.87
SD1.5-DPO21.49±2.1927.16±3.135.36±1.0829.80±9.4621.71±2.4128.23±3.355.59±0.9936.66±7.95
SD1.5-SPO21.53±2.3327.33±3.745.89±1.0428.13±9.2321.99±2.7628.53±3.695.96±1.1033.14±9.15
SD1.5-KTO21.46±1.9627.70±3.215.62±0.9930.78±8.0921.79±2.5328.95±3.125.62±0.9637.01±8.01
SD1.5-ABC21.85±2.0427.97±2.855.93±1.0131.07±8.1421.97±2.4528.86±3.025.72±0.8737.18±7.72
SDXL-Base22.76±2.4528.49±3.595.86±1.0535.76±9.6623.26±2.5029.38±3.596.08±1.0337.24±6.74
SDXL-DPO22.94±2.3728.93±3.526.01±0.9836.01±8.7323.59±2.6529.86±3.406.14±1.0038.34±5.98
SDXL-SPO23.56±2.6429.12±3.526.26±0.9233.82±9.8423.76±2.7030.30±3.186.48±0.8637.62±7.14
SDXL-MAPO22.82±2.4028.62±3.625.98±1.0436.58±9.3923.60±2.5329.92±3.526.19±0.9238.61±7.17
SDXL-ABC23.79±2.2729.42±3.296.35±0.8936.81±8.5124.39±2.3830.67±3.086.54±0.8638.97±6.02
评论

I would like to thank the authors for their work and for answering all of my questions.

审稿意见
4

This paper proposes a novel loss function for aligning text-to-image diffusion models with human preferences. The authors hypothesize that conventional DPO-type loss functions are limited due to their reliance on a reference model. Since the reference model itself is not aligned with human preferences, comparisons against it may lead to sub-optimal solutions. To address this, the authors introduce a new loss function called ABC (Alignment by Classification) loss, which does not require a reference model. ABC loss resembles a contrastive loss that directly compares positive and negative pairs. The authors first reformulate DPO with an ideal reference model as a classification problem within the diffusion framework and argue that alignment performance depends on the model’s discriminative capability. Based on this insight, they design the ABC loss. Experimental results demonstrate that the proposed approach achieves better alignment with human preferences compared to existing loss functions.

优缺点分析

Pros

  1. The paper proposes a novel loss function for aligning text-to-image diffusion models with human preferences, and it outperforms conventional loss functions.

  2. The semi-supervised learning technique introduced in Section 4.2 is interesting and appears effective.

  3. The authors provide various qualitative examples and compare their method with recent approaches.

Cons

  1. The paper is not easy to read, particularly the theorem sections. It would be helpful if the authors could provide informal interpretations or explanations in simpler terms to improve accessibility.

  2. Due to the large number of notations, the paper is difficult to follow. Including a table summarizing all notations and their meanings would greatly enhance readability.

  3. Additionally, comparisons of training time and analyses of the effect of dataset size are important. More extensive ablation studies exploring different aspects could further strengthen the paper.

问题

please check the weaknesses

局限性

Yes

最终评判理由

The authors have adequately addressed my concerns, and I will maintain my positive score. However, my confidence in the evaluation remains low.

格式问题

No

作者回复

We greatly appreciate the reviewers' hard work and thank you for your valuable comments. We will address each concerns in the following.


Q: The paper is not easy to read, particularly the theorem sections. It would be helpful if the authors could provide informal interpretations or explanations in simpler terms to improve accessibility.

A: We apologize that the theorem sections may have made the paper difficult to follow. We briefly interpret them here. Theorem 1 proves that the AM-Softmax loss is upper bounded by the Diffusion-DPO loss. In other words, minimizing the Diffusion-DPO loss for better alignment will also reduce the AM-Softmax loss, leading to improved classification performance. Simply put, better alignment leads to better classification. Conversely, Theorem 2 shows that, under certain conditions, improved classification leads to better alignment. Together, these two theorems reveal a strong connection between classification and alignment, forming the theoretical foundation of our approach, which replaces the DPO loss with the ABC loss for alignment tasks. We first establish a connection between alignment and classification. This idea is conceptually similar to the ICLR 2025 paper "On a Connection Between Imitation Learning and RLHF," which draws a connection between Reinforcement Learning from Human Feedback (RLHF) and Imitation Learning.


Q: Due to the large number of notations, the paper is difficult to follow. Including a table summarizing all notations and their meanings would greatly enhance readability.

A: We apologize for the oversight in summarizing the notations. Below is a table that summarizes all the notations used in the paper:

SymbolDescription
x0x_0Clean image sampled from real data distribution q(x0)q(x_0).
xy±x_y^±Clean image x0x_0 aligned (++) or misaligned (-) with prompt yy in preference pairs.
xt;y±x^±_{t;y}Noisy version of xy±x_y^± at timestep tt.
xy±x_{y^±}The same image xx conditioned on aligned (++) and misaligned (-) prompts y±y^±.
ϵ\epsilonGround-truth noise sampled from N(0,I)\mathcal{N}(0, \mathbf{I}), added during forward diffusion.
ϵθ(x,t)\epsilon_\theta(x,t)Noise predicted by the model ϵθ\epsilon_\theta at timestep tt, given noisy input xx.
ϵali(x,y,t)\epsilon_\mathrm{ali}(x,y,t)Noise prediction from the ideal alignment model for input image xx, prompt yy and timestep tt.
ϵref(x,y,t)\epsilon_\mathrm{ref}(x,y,t)Noise predicted by the reference diffusion model for input image xx, prompt yy and timestep tt.
ϵopt(x,y,t)\epsilon_\mathrm{opt}(x,y,t)Optimal noise prediction as a weighted average over clean images, obtained via marginalization of p(xy)p(x \mid y).
sθ(x,y)s_{\theta}(x,y)Score predicted by the traing model for image sample xx and prompt yy
sref(x,y)s_\mathrm{ref}(x,y)Score predicted by the reference model for image sample xx and prompt yy
δ\deltaMargin used to enforce separation between aligned (++) and misaligned (-) scores.
Δy±\Delta^±_{y}Margin offsets applied to the expected scores of aligned (++) and misaligned (-) images with prompt y{y}.
Oy±O^±_yIdeal score targets for aligned (++) and misaligned (-) images with prompt y{y}.
ηy±\eta^±_{y}Scaling factors that modulate the contributions of aligned (++) and misaligned (-) images with prompt y{y}.
ι(xy,xy+,y)\iota({x}^-_y, {x}^+_y, {y})Score difference in ABC loss between aligned (++) and misaligned (-) images with prompt y{y}.

PS:

  1. x0x_0 can refer to any clean image, while xy+x_y^+, xyx_y^- refer specifically to images aligned/misaligned with a prompt yy.
  2. For simplicity, the timestep tt is sometimes omitted in notation when it is clear from context, eg. ϵali(x,y)\epsilon_\mathrm{ali}(x,y).
  3. For simplicity, the prompt yy is sometimes omitted in notation , eg. ι(x,x+,y)\iota({x}^-, {x}^+, {y}).

Q: Comparisons of training time and analyses of the effect of dataset size are important. More extensive ablation studies exploring different aspects could further strengthen the paper.

A: Thank you for the suggestion. We believe it will significantly improve our paper. Due to the page limitations of NeurIPS, we initially restricted the training dataset to Pick-a-Pic v2, published in "Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation," which is commonly adopted by current diffusion alignment methods. Additionally, given the current research focus on designing various alignment losses, the training overhead for these methods is minimal. As a result, the training time for different alignment methods remains nearly the same for any given stable diffusion model. This helps explain why we did not include an ablation study comparing training times or analyzing the effect of dataset size.


We sincerely apologize for the inconvenience. Following is the supplementary response for Reviewer RW4E, provided here due to the rebuttal length limitation.


Q: How would your approach perform with more natural prompts than the ones you used in your evaluation? Would it be possible to consider using T2I-CompBench++ for your evaluation?

A: Following the official setup, we tested on 8,000 prompts in total, covering: attribute binding, relationships, numeracy, and complex compositions. We report the results in the tables below. Higher scores indicate better alignment with the intended composition.

ModelColor (B-VQA)Shape (B-VQA)Texture (B-VQA)Numeracy (UniDet)2D-Spatial (UniDet)3D-Spatial (UniDet)Non-Spatial (CLIP)Complex (3-in-1)
SD1.5-Base0.38110.33950.41920.44360.14600.29120.30920.3002
SD1.5-DPO0.39430.34400.43740.45230.16270.30900.30910.3032
SD1.5-SPO0.40300.40010.41520.44610.14710.29580.30100.3131
SD1.5-KTO0.46450.38150.47300.46180.19190.33180.31040.3514
SD1.5-ABC0.46470.40050.47510.45700.18950.33240.31060.3587
SDXL-Base0.57080.48800.56000.55910.19490.35510.30650.4383
SDXL-DPO0.65860.53580.65210.53000.23760.36680.31160.4923
SDXL-SPO0.64310.52000.64960.57650.22980.35130.30310.4424
SDXL-MAPO0.66820.51040.56500.51890.17000.35070.31360.4401
SDXL-ABC0.67080.54500.68660.56230.24010.36970.31540.5051
评论

The authors have addressed my concerns, so I will maintain my score. However, I am unsure if it is entirely appropriate for the authors to use their rebuttal to address another reviewer’s comments in response to mine.

评论

Thank you for your feedback and for maintaining your score.

最终决定

This paper proposes a genuinely new approach to the problem of T2i diffusion model alignment (an important problem with wide applications). It recasts the problem as a certain kind of classification problem, using ideas from circle loss. All reviewers appreciated the novelty and interestingness of this idea. A significant advantage of the method is that a reference model is NOT needed during training which saves a lot of memory, a very significant factor for frontier diffusion models. Objections mainly focused on the robustness of the evaluations, many of which were effectively addressed by the authors with new experiments etc. Overall, a solid work that is likely to influence new ideas and directions.