PaperHub
4.8
/10
withdrawn4 位审稿人
最低3最高6标准差1.1
6
5
3
5
2.8
置信度
正确性2.8
贡献度2.3
表达2.8
ICLR 2025

Dual Caption Preference Optimization for Diffusion Models

OpenReviewPDF
提交: 2024-09-24更新: 2024-12-16

摘要

关键词
Preference OptimizationDiffusion ModelsAlignment

评审与讨论

审稿意见
6

This paper introduces Dual Caption Preference Optimization (DCPO) to enhance text-to-image diffusion models by aligning them with human preferences. Traditional methods face issues like overlapping distributions and irrelevant prompts. DCPO addresses these using two distinct captions for each image, mitigating conflicts in preference data. The authors introduce the Pick-Double Caption dataset to support this approach. They apply three strategies—captioning, perturbation, and hybrid methods—to generate unique captions. Experiments show DCPO improves image quality and prompt relevance. DCPO outperforms prior models on multiple metrics, validating its effectiveness.

优点

As a reviewer from a broader field, I am not very familiar with the specific domain of this paper. Therefore, I am reviewing this paper from a generalist’s perspective. The strengths of this paper are:

  1. It provides sufficient theoretical support for the motivation, which aligns well with the characteristics of ICLR papers.
  2. The issues raised seem quite reasonable.
  3. Extensive quantitative and qualitative experiments support the arguments presented.

缺点

However, I still have a few concerns:

  1. The issues of conflict distribution and irrelevant prompts seem like two aspects of the same problem—both involve a single prompt (C) corresponding to two different images, which can lead to unstable optimization. Therefore, I think they could be consolidated into a single issue.
  2. When comparing generated images, the improvements achieved by the proposed method could be highlighted more clearly; otherwise, it’s often not immediately obvious, as in Figure 1.
  3. In fact, the explanations of conflict distribution and irrelevant prompts in the abstract and introduction are quite obscure and difficult to understand. I had to reread these sections several times, only gaining clarity after reading the methods section. This part may need reorganization.

问题

The issues of conflict distribution and irrelevant prompts seem like two aspects of the same problem—both involve a single prompt (C) corresponding to two different images, which can lead to unstable optimization. Therefore, I think they could be consolidated into a single issue.

评论

We greatly value the reviewer’s insightful feedback on our paper and look forward to engaging in a productive discussion, as there is still significant time left in the discussion period. Below, we offer a detailed response to your comments:


W1: The issues of conflict distribution and irrelevant prompts seem like two aspects of the same problem—both involve a single prompt (C) corresponding to two different images, which can lead to unstable optimization. Therefore, I think they could be consolidated into a single issue.


Reply to W1: We appreciate the reviewer’s observation, but conflict distribution and irrelevant prompts address distinct challenges.

Conflict Distribution refers to the similarity between preferred (xwx^w) and less preferred (xlx^l) images, quantified using the correlation between images (xx) and captions (zz) via CLIPscore. High similarity can make it harder to distinguish preferences during optimization.

Irrelevant Prompts highlight misalignment between the prompt (cc) and the less preferred image (xlx^l). Prompts often include details relevant to xwx^w but not xlx^l, which hinders the model's ability to learn effectively.

For example, even when xwx^w and xlx^l have similar CLIPscores, a prompt cc may still lack relevance for xlx^l, introducing additional optimization challenges. These two issues complement but do not fully overlap, as both similarity and prompt relevance impact performance differently.


W2: When comparing generated images, the improvements achieved by the proposed method could be highlighted more clearly; otherwise, it’s often not immediately obvious, as in Figure 1.


Reply to W2: Thank you for this comment. We have updated Figure 1 in the revised version. Additionally, the reviewer can find more qualitative examples across different benchmarks in Appendix F.


W3: In fact, the explanations of conflict distribution and irrelevant prompts in the abstract and introduction are quite obscure and difficult to understand. I had to reread these sections several times, only gaining clarity after reading the methods section. This part may need reorganization.


Reply to W3: Thank you for your suggestion. We are in the process of re-organizing this section and will add a note here when it is completed on our end.

评论

Dear Reviewer Y9en,

As the deadline is nearing, we wanted to gently follow up on our recent submission. Your feedback is highly valuable to us, and we would appreciate any updates or further guidance you might have regarding our revisions and responses.

Thank you for your time and consideration.

评论

W3: In fact, the explanations of conflict distribution and irrelevant prompts in the abstract and introduction are quite obscure and difficult to understand. I had to reread these sections several times, only gaining clarity after reading the methods section. This part may need reorganization.


Reply W3: Following your suggestions, we have added two additional lines to clarify the explanation of the irrelevant prompts issue in the abstract (L16–L18) and introduction (L101–L107).

We hope our explanations have clarified any unclear aspects of our proposed method. Your questions and feedback have been invaluable, and we sincerely appreciate the opportunity to address your concerns. If you feel that we have adequately resolved your queries and you are satisfied with the discussion, we would be grateful if you could consider revisiting the rating of our work.

评论

Thank you for the author's response. I agree with the other reviewers' opinions. Although the author has explained it, the current motivation remains quite unclear. Including two more intuitive examples in Figure 1 would be helpful. I suggest the author refine the motivation further.

评论

We sincerely appreciate your time and thoughtful review!

In our general comments, we provided both theoretical and experimental motivation for our work, along with a series of proofs demonstrating the effectiveness of DCPO compared to diffusion-DPO. As the rebuttal period approaches its conclusion, we would like to respectfully ask if you have any remaining concerns regarding the motivation behind our work, the effectiveness of our method (including the additional results provided), or any other aspects of our submission. We have been more than happy to address any remaining or new concerns during the discussion period.

If our responses have adequately addressed your concerns, we kindly ask you to consider reevaluating our work in light of the updated information.

Thank you once again for your valuable feedback!

审稿意见
5

The paper presents a preference optimization technique called Dual Caption Preference Optimization (DCPO). This method aims at improving text-to-image diffusion models. DCPO tackles issues inherent in current preference datasets, namely conflict distribution and irrelevant prompts, by introducing separate captions for preferred and less preferred images. This dual-caption approach is implemented through three methods: captioning, perturbation, and a hybrid method, all aimed at enhancing the clarity of distinctions between preferred and non-preferred images. Experimental results demonstrate that DCPO outperforms existing models across several benchmarks and metrics, including Stable Diffusion 2.1 and Diffusion-DPO.

优点

  1. The dual caption framework is reasonable. DCPO introduces a dual-caption system that effectively addresses the problem of overlapping distributions in existing datasets.
  2. This paper achieves better performance. Demonstrated improvements across multiple metrics (e.g., Pickscore, CLIPscore) and benchmarks (e.g., GenEval) show that DCPO enhances image quality and relevance significantly.
  3. The experimental results are analyzed in detail. The paper includes extensive quantitative and qualitative analysis, supporting the effectiveness of DCPO with various baselines and ablation studies.

缺点

  1. The proposed method depends on the caption quality. The quality of generated captions significantly affects performance, and challenges remain in creating effective captions for less preferred images without straying out-of-distribution.
  2. While DCPO demonstrates quantitative improvements across several metrics, the qualitative results (e.g., Figure 1) indicate that the visual distinctions between images generated by DCPO and baseline methods are not significant. This subtle difference may limit the perceived impact of DCPO in practical applications.
  3. The DCPO has limited generalizability compared to real-world large-scale datasets. Although leveraging preferred and non-preferred images is a novel approach for enhancing diffusion models, high-quality, large-scale datasets from real-world settings often provide stronger improvements in model performance. This reliance on real-world data diminishes the relative advantage of DCPO, potentially limiting the distinctiveness of its contributions in scenarios where comprehensive datasets are available.
  4. The LAION-2B and MSCOCO datasets are widely regarded benchmarks for image generation tasks, yet they are not discussed or evaluated within this study. The absence of experiments or comparisons involving LAION-2B raises questions about DCPO’s general applicability.

问题

Please address my concerns above.

评论

W4: The LAION-2B and MSCOCO datasets are widely regarded benchmarks for image generation tasks, yet they are not discussed or evaluated within this study. The absence of experiments or comparisons involving LAION-2B raises questions about DCPO’s general applicability.


Reply to W4: LAION and MSCOCO are not equipped with preference data by design, and there is no publicly annotated version of them. Hence, we are unable to perform experiments on them. Our practice is well aligned with existing works such as [1], [2]. Still, for image generation purposes, we evaluated DCPO and Diffusion-DPO on the FID metric to provide a broader perspective on their performance, and DCPO consistently outperformed Diffusion-DPO.

[1] Diffusion-DPO: https://arxiv.org/abs/2311.12908

[2] MaPO: https://arxiv.org/abs/2406.06424

评论

We sincerely thank the reviewer for their insightful feedback on our paper. We look forward to engaging in a constructive discussion, as there is still plenty of time remaining in the discussion period. Below, we present a detailed response to your comments:


W1: The proposed method depends on the caption quality. The quality of generated captions significantly affects performance, and challenges remain in creating effective captions for less preferred images without straying out-of-distribution.


Reply to W1: One of the key experiments in our paper investigates how the quality of generated captions influences the alignment performance of diffusion models. We first define high-quality captions as those with a higher correlation between image xx and caption zz. In Table 5 in Appendix B, we reported the average CLIP score as a standard correlation metric.

The results in Table 5 in Appendix B indicate that captions generated by LLaVA have better quality, evidenced by a higher average CLIP score compared to Emu2. Specifically, the difference in correlation scores between preferred and less preferred captions is approximately 4 for LLaVA and 3 for Emu2, both of which are larger than the original prompt (~2, as explained in Lines 340–357). However, despite the higher caption quality of LLaVA, the performance of DCPO-c-Emu2 remains comparable to DCPO-c-LLaVA, as shown in Tables 1 and 2.

This suggests that while caption quality plays a role, it is not the primary determinant of performance. Instead, the critical factor is the difference in correlation between captions and the preferred and less preferred images. Evidence for this conclusion arises from the superior performance of DCPO-h, which achieves the largest correlation distance among DCPO-c and DCPO-p techniques, as shown in Figure 4. This supports the hypothesis that maximizing the correlation difference, rather than caption quality alone, is one of the keys to optimizing alignment performance.


W2: While DCPO demonstrates quantitative improvements across several metrics, the qualitative results (e.g., Figure 1) indicate that the visual distinctions between images generated by DCPO and baseline methods are not significant. This subtle difference may limit the perceived impact of DCPO in practical applications.


Reply to W2: Thank you for this comment. We have updated Figure 1 in the revised version. Additionally, the reviewer can find more qualitative examples across different benchmarks in Appendix F.


W3: The DCPO has limited generalizability compared to real-world large-scale datasets. Although leveraging preferred and non-preferred images is a novel approach for enhancing diffusion models, high-quality, large-scale datasets from real-world settings often provide stronger improvements in model performance. This reliance on real-world data diminishes the relative advantage of DCPO, potentially limiting the distinctiveness of its contributions in scenarios where comprehensive datasets are available.


Reply to W3: To the best of our knowledge, real-world datasets do not have a preference structure. Preference data requires selecting preferred and less preferred images from multiple images generated for the same prompt based on human judgment. While real-world datasets like MSCOCO and LAION-2B are valuable, they are not suitable for preference optimization due to the lack of this structure.

Preference datasets are specifically used in post-training, a step following pre-training, to enhance the performance of pre-trained models, as outlined in Diffusion-DPO. For preference optimization, a dataset D ={c,xw,xlc, x^w, x^l} is required, where xwx^w and xlx^l represent the preferred and less preferred images for the same prompt cc. This specialized data structure is essential for methods like DCPO to achieve their objectives.

To demonstrate DCPO's generalizability, we fine-tuned Stable Diffusion 2.1 using Diffusion-DPO and DCPO on another high-quality preference dataset - Rapidata Image Generation Preference Dataset (RIGPD) [1]. The table below shows that DCPO variants consistently outperform Diffusion-DPO across multiple benchmarks, including Geneval, Pickscore, HPSv2.1, ImageReward, and CLIPscore.

Method (SD2.1)Geneval (Overall)PickscoreHPSv2.1ImageRewardCLIPscore
Diffusion-DPO0.481320.3425.1055.426.84
DCPO-c0.486720.4425.4355.726.86
DCPO-h0.497820.4225.1055.626.91

These results highlight DCPO's versatility and superior performance across benchmarks.

[1] Rapidata Image Generation Preference Dataset (RIGPD): https://huggingface.co/datasets/Rapidata/700k_Human_Preference_Dataset_FLUX_SD3_MJ_DALLE3

评论

Dear Reviewer CPj9,

As the deadline is nearing, we wanted to gently follow up on our recent submission. Your feedback is highly valuable to us, and we would appreciate any updates or further guidance you might have regarding our revisions and responses.

Thank you for your time and consideration.

评论

Thank you for your response. Most of my concerns have been addressed. However, after reviewing all the reviewers' discussions, I still have concerns regarding the paper's motivation and novelty, so I will maintain my original score.

评论

Thank you for your response. We kindly ask the reviewer to clarify the concerns regarding motivation and novelty and highlight any similar work. We believe our paper's novelty is its key contribution, and we are actively conducting a new experiment to address any ambiguity. We also hope the reviewer will consider the extensive efforts and experiments we have already undertaken to address these concerns.

评论

We sincerely appreciate your time and thoughtful review!

In our general comments, we provided both theoretical and experimental motivation for our work, along with a series of proofs demonstrating the effectiveness of DCPO compared to diffusion-DPO. As the rebuttal period approaches its conclusion, we would like to respectfully ask if you have any remaining concerns regarding the motivation behind our work, the effectiveness of our method (including the additional results provided), or any other aspects of our submission. We have been more than happy to address any remaining or new concerns during the discussion period.

If our responses have adequately addressed your concerns, we kindly ask you to consider reevaluating our work in light of the updated information.

Thank you once again for your valuable feedback!

审稿意见
3

This paper propose an improved method for aligning text to image diffusion models using human labeled preference datasets. Instead of using the original caption that is used to generate the preferred and less preferred image pair, the method propose to generate new captions from the generated images or original captions so as to increase/decrease the text image alignment for the generated images, which make the the distribution difference between preferred and less preferred data larger than using original captions.

Experiments using diffusion DPO method are conducted in various versions of caption image combinations, which show adding perturbed captions for less preferred image helps finetuned model get better performance on automatic metrics, including itemized metrics such as HPSv2, as well as side by side metrics using GPT-4o as judge.

优点

The paper is well written with right amount of details in both main text and appendix. The proposed method is clear, and relativly straightforward to implement.

On a popular open source diffusion model ( SD 2.1), several experiments are done to ablate the design details of the proposed approach. The used set of metrics are comprehensive, including both single side evaluation such as HPSv2, as well as side by side evaluation such as the one using GPT4-o as judge.

缺点

The motivation behind the proposed approach is not clear to me. For the conflict distribution challenge, when the distribution overlap becomes larger, the dataset is proposing a harder problem for the model to optimize, but it isn't necessary an issue as long as the two distributions are not identical. When the diffusion models's quality gets better, the two distribution will inevitably become more and more similar, as both preferred and less preferred images from an optimized model will be closer to real human preference. So it's more of the nature of the task itself, unless the task is defined differently.

From the description of L175-L180, the irrelevant problem is hardly a problem either. It is an inherently part of the objective in Eqn (1), where one way of minimizing Eqn(1) is to decrease log(pθ(x0:Tlzl)\log(p_{\theta}(x_{0:T}^l|z^l), which makes the model less likely to generate the less preferred image. So to me this is a desired behavior instead of a problem.

By changing the captions, authors changed a prefer/less prefer pair into two separate samples. In this sense it is no longer the original DPO problem, yet there is no clear connection between the original DPO formulation and the new problem e.g. is the new one an upper-bound of the original so minimizing the new problem potentially minimize the original one? or why solving the new problem will necessarily give better results than original DPO?

The change of captions made the problem closer to the KTO problem referenced in the paper, where text-image data are labeled by like and dislike binary labels. Please describe the connection and difference between the modified problem represented by the new data and the KTO problem formulation above.

It is great to conduct extensive experiments on SD 2.1, but the paper will be stronger if there are experiment results on other diffusion models, even if the experiments are not as complete as on SD 2.1.

问题

Despite the experiments suggests the proposed approach is better, it is unclear to me why this would be the case, any proofs or intuitions will help reader better understand it.

Several papers appear multiple times in the References section, please dedupe.

评论

W3: From the description of L175-L180, the irrelevant problem is hardly a problem either. It is an inherently part of the objective in Eqn (1), where one way of minimizing Eqn(1) is to decrease log(p(xizl))log(p(xi|zl)), which makes the model less likely to generate the less preferred image. So to me this is a desired behavior instead of a problem.


Reply to W3: We observed that the prompt cc often does not serve as a meaningful caption for less preferred images (see Figure 2 in the main text and Figure 10 in Appendix C) because it may contain irrelevant information. We refer to this as the "irrelevant prompt" problem. The key question is: Why is an irrelevant prompt problematic?

U-Net predicts the added noise for both preferred and less preferred images conditioned on a prompt, as shown in Equation 2. If the prompt is irrelevant for the less preferred images, U-Net’s ability to make accurate predictions becomes limited. This issue is supported by findings in the DALL-E 3 paper [1], which demonstrates that more relevant captions improve U-Net’s performance in predicting noise. Building on this, we generated relevant captions for the less preferred images using a captioning method that conditions the original prompt.

During direct preference optimization, logP(xwc)\log P(x^w|c) increases while logP(xlc)\log P(x^l|c) decreases, where xwx^w and xlx^l represent the preferred and less preferred images generated for the same prompt cc. Theoretically, if a prompt c1c_1 has a stronger correlation with both the preferred and less preferred images, the optimization process is more effective, resulting in better model performance. Intuitively, a diffusion model better understands why humans prefer image xwx^w. Because the prompt c1c_1 contains more relevant information. Thus, prompts with high correlation between xwx^w and xlx^l can influence logP(xc)\log P(x|c) more effectively during optimization. The results in Table 4 provide strong evidence supporting this explanation.

Inspired by this behavior, we hypothesized that generating a more suitable caption for less preferred images, distinct from the original prompt, helps decrease the likelihood of less preferred images during optimization. This aligns with our findings and further validates the importance of addressing the irrelevant prompt issue.

[1] DALL-E 3: https://cdn.openai.com/papers/dall-e-3.pdf


W4: By changing the captions, authors changed a prefer/less prefer pair into two separate samples. In this sense it is no longer the original DPO problem, yet there is no clear connection between the original DPO formulation and the new problem e.g. is the new one an upper-bound of the original so minimizing the new problem potentially minimize the original one? or why solving the new problem will necessarily give better results than original DPO?


Reply to W4: In the DCPO framework, we modify the original prompt by replacing it with a caption generated by a captioning model QϕQ_\phi. This modification enables better alignment with the less preferred images while maintaining the DCPO in the original task space.

To establish the equivalence with the original Diffusion-DPO problem, we theoretically demonstrate that Diffusion-DPO provides a lower bound of the DCPO loss function. In the DCPO framework, the caption zz is generated by the captioning model QϕQ_\phi conditioned on the image xx and the original prompt cc, i.e., zQϕ(zx,c)z \sim Q_\phi(z \mid x, c). Importantly, in scenarios where QϕQ_\phi generates a caption zz that closely matches the original prompt cc, we effectively have zcz \simeq c. In this case, substituting zz for cc in the DCPO loss function produces the original Diffusion-DPO loss function (refer to Equation 9 in Appendix A).

This theoretical equivalence ensures that minimizing the DCPO loss function aligns with minimizing the Diffusion-DPO loss. Moreover, the modification introduced by QϕQ_\phi helps address challenges arising from irrelevant prompts in the original framework, which we have shown to improve alignment and performance on various benchmarks. Thus, solving the DCPO problem not only retains the original Diffusion-DPO framework’s objectives but also provides additional flexibility for addressing inherent limitations, leading to better results.

评论

We sincerely thank the reviewer for their thoughtful feedback and the time dedicated to evaluating our paper. We hope our responses below address the concerns raised and support a potential reconsideration of the score.


W1: The motivation behind the proposed approach is not clear to me. For the conflict distribution challenge, when the distribution overlap becomes larger, the dataset is proposing a harder problem for the model to optimize, but it isn't necessary an issue as long as the two distributions are not identical.


Reply to W1: Suppose we aim to optimize a policy model pθp_\theta, such as a large language model by preference optimization methods. For this, we require a dataset DD that includes both preferred (xwx^w) and less preferred (xlx^l) responses generated for the same prompt cc. The standard method involves generating multiple responses for a single input, ranking these responses through human evaluation or automated Judger models (e.g., GPT-4o), and selecting the best response as preferred and the least desirable as less preferred. This methodology is outlined in the OpenAI paper on Reinforcement Learning with Human Feedback (RLHF) [1].

Also, in commonly generated preference datasets such as UltraFeedback-binarized [2][3] and UltraFeedback-PairRM [4][5], a similar approach is utilized with RLHF. However, we observed that existing vision-based preference datasets, such as Pick-a-Picv2, demonstrate a high correlation between preferred and less-preferred images—a phenomenon we refer to as "conflict distribution."

Our hypothesis is that increasing the divergence between the correlation distributions of preferred and less preferred images can enhance the performance of diffusion models optimized using direct preference optimization methods. By addressing this "conflict distribution," our approach aims to make the optimization process more effective and improve overall model performance.

[1] RLHF: https://arxiv.org/pdf/2203.02155

[2] UltraFeedback Binarized: https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback\_binarized

[3] Zephyr: https://arxiv.org/abs/2310.16944

[4] UltraFeedback PairRM: https://huggingface.co/datasets/princeton-nlp/llama3-ultrafeedback

[5] SimPO: https://arxiv.org/abs/2405.14734


W2: When the diffusion models's quality gets better, the two distribution will inevitably become more and more similar, as both preferred and less preferred images from an optimized model will be closer to real human preference. So it's more of the nature of the task itself, unless the task is defined differently.


Reply to W2: The goal of preference optimization is to train a diffusion model to distinguish between preferred and less preferred images generated for the same prompt, where the less preferred image exhibits some differences from the preferred image. Theoretically, if the preferred (xwx^w) and less preferred (xlx^l) images are too similar, the values of logP(xwc)\log P(x^w|c) and logP(xlc)\log P(x^l|c) will converge, leading to a loss value close to zero. This would hinder the model's ability to effectively learn the preferred distribution.

Moreover, during optimization, the diffusion model is designed to increase the likelihood of xwx^w while decreasing the likelihood of xlx^l. However, when there is a high degree of similarity between preferred and less preferred images, the model may become confused, struggling to differentiate between the two distributions effectively.

To address the potential convergence of the preferred and less preferred distributions as models improve, we propose an alternative evaluation framework. Instead of assigning binary preference scores to image pairs (e.g., Ig=1I_g = 1 and Ib=0I_b = 0), we suggest adopting a soft-scoring system that captures varying degrees of alignment with human preference across multiple images. For example, in a set of four generated images, scores could indicate preference levels such as {I1:0.7,I2:0.5,I3:0.1,I4:0.6I_1: 0.7, I_2: 0.5, I_3: 0.1, I_4: 0.6}, where I1I_1 represents the most preferred image and I3I_3 the least preferred.

Using the FLUX model, we tested this framework on 100 samples, generating I1I_1 and I2I_2 from different seeds. The average difference across 100 samples was 0.4, demonstrating that current models like FLUX still struggle to produce multiple well-aligned images from the same prompt, further supporting the relevance of our proposed framework.

评论

Really appreciate the replies to my questions. I want to get a clear story on the motivation and intuition of this work. So far it's still confusing. Let's keep on discussing. Seems the argument on W1 and W2 are self contradictory. In reply to W1, the authors claim they observed high correlation between preferred and less-preferred images. In reply to W2 authors observed that advanced model FLUX struggle to produce multiple well-aligned images, and average difference is 0.4. Does this suggest the image correlations are not high? or the correlation is strong on worse models but weak on more recent almost SOTA models?

评论

Thank you for your response.

In response to W1, we mention that "Our hypothesis is that increasing the divergence between the correlation distributions of preferred and less preferred images can enhance the performance of diffusion models optimized using direct preference optimization methods. By addressing this "conflict distribution," our approach aims to make the optimization process more effective and improve overall model performance." . The distribution we refer to here is that of the CLIPScore, which we refer to as the conflict distribution throughout our paper. To clarify, we refer to the distribution of the correlation measured by CLIPScore(s), which takes an input a text and an image, and returns their alignment as measured by CLIP. We find that this correlation is high between a {preferred image, caption} and {less-preferred image, caption} in existing preference datasets, i.e., the difference between the CLIPScore of {preferred image, caption} and {less-preferred image, caption} is similar.

In response to W2, we argue that existing SOTA Models, such as FLUX, are also unable to generate multiple images aligned with the prompt. This was in response to the question that despite advancements in T2I generation, there is still a substantial gap in generating well-aligned images (as denoted by the 0.4 difference in our results).

We hope this clarifies any confusion. Please let us know if you have more questions. Thank you.

评论

W5 The change of captions made the problem closer to the KTO problem referenced in the paper, where text-image data are labeled by like and dislike binary labels. Please describe the connection and difference between the modified problem represented by the new data and the KTO problem formulation above.


Reply to W5: Assume a preference dataset D ={c,xw,xlc, x^w, x^l}, where xwx^w and xlx^l represent the preferred and less preferred images for the prompt cc. The Diffusion-KTO [1] hypothesizes optimizing a diffusion model using only a single preference label based on whether an image xx is suitable or unsuitable for a given prompt cc. Thus, Diffusion-KTO utilizes a dataset D ={c,xc, x}, where xx is a generated image corresponding to prompt cc.

This hypothesis is fundamentally different from ours. While Diffusion-KTO focuses on binary preferences (like/dislike) for individual image-prompt pairs, our approach involves paired preferences. We do not claim that having two preferences is problematic; rather, we observed that using the same prompt cc for both preferred and less preferred images may not be ideal. To address this, we propose optimizing a diffusion model using a dataset D ={zw,zl,xw,xl z^w, z^l, x^w, x^l}, where zwz^w and zlz^l are captions generated by a captioning model QϕQ_\phi for the preferred and less preferred images with respect to the original prompt, respectively.

However, we compared Diffusion-KTO on various benchmarks, and the results are presented in the following table. This comparison highlights how the differences in problem formulation influence the model's performance and further emphasizes the distinct nature of our proposed approach relative to Diffusion-KTO.

Method (SD2.1)Geneval (Overall)PickscoreHPSv2.1ImageRewardCLIPscore
Diffusion-DPO0.485720.3625.1056.426.98
Diffusion-KTO0.500820.4124.8055.526.95
DCPO-h0.510020.5725.6258.227.13

[1] Diffusion-KTO: https://arxiv.org/abs/2404.04465


W6: It is great to conduct extensive experiments on SD 2.1, but the paper will be stronger if there are experiment results on other diffusion models, even if the experiments are not as complete as on SD 2.1.


Reply to W6: In response to the reviewer's suggestion, we conducted additional experiments to evaluate the performance of DCPO on the SDXL model. Because SDXL is a large model, we perform LoRA fine-tuning with minimal hyper-parameter search.

The results presented in the following table demonstrate that DCPO outperforms Diffusion-DPO on metrics such as Pickscore, Geneval, HPSv2, and CLIPscore. Additionally, DCPO achieves performance comparable to Diffusion-DPO on ImageReward. These findings highlight the generalizability and effectiveness of DCPO across different diffusion models.

Method (SDXL)Geneval (Overall)PickscoreHPSv2.1ImageRewardCLIPscore
Diffusion-DPO0.564521.7728.6471.228.61
DCPO-c0.575821.8728.6571.228.63
DCPO-h-weak0.570421.8728.6471.228.62
DCPO-h-medium0.570021.8628.6471.228.63
DCPO-h-strong0.569621.8628.6471.228.62

Q1: Despite the experiments suggests the proposed approach is better, it is unclear to me why this would be the case, any proofs or intuitions will help reader better understand it.


Reply to Q1: The Diffusion-DPO objective is designed for preference optimization using a dataset D ={c,xw,xlc, x^w, x^l}, where xwx^w and xlx^l represent preferred and less preferred images for a given prompt cc. DCPO enhances this framework by replacing the prompt cc with a caption zz, generated by a captioning model QϕQ_\phi, which provides more relevant and contextually aligned information about each image. This adjustment improves performance by enabling better contextual conditioning, where zlz^l captures specific details relevant to xlx^l that cc might lack. It also reduces noise by eliminating irrelevant details in cc, supporting the model in distinguishing xwx^w from xlx^l more effectively. Additionally, conditioning on zlz^l simplifies dependencies, ensuring the model focuses entirely on the necessary information for preference evaluation. These improvements result in better preference optimization, as demonstrated by the stronger correlation between captions and their corresponding images compared to the original prompts.


Q2: Several papers appear multiple times in the References section, please dedupe.


Reply to Q2: Thank you for pointing this out! We’ve fixed the issue in the updated version.

评论

Dear Reviewer f715,

As the deadline is nearing, we wanted to gently follow up on our recent submission. Your feedback is highly valuable to us, and we would appreciate any updates or further guidance you might have regarding our revisions and responses.

Thank you for your time and consideration.

评论

We sincerely appreciate your valuable time and thoughtful review.

Regarding your concern about using different captions for the preferred and less preferred images, we would like to clarify that our approach aligns with prior work. As a side note, we have found that a recent paper accepted at ICML proposes DOVE, a loss function in a design similar to our DCPO, to address the limitations of the DPO loss function in language models. Similarly to our method, DOVE generates distinct instruction inputs for the less preferred examples, resulting in different inputs for both the preferred and the less preferred cases. We kindly refer you to [1], and its Figure 1 and Equation 2, for reference.

As the rebuttal period is near its conclusion, we respectfully ask if you have any further concerns regarding the motivation behind our work, the effectiveness of our method (including the additional results provided) or any other aspects of our submission. We have been more than happy to address any remaining questions or new ones during the remaining discussion period.

Thank you again for your constructive feedback!

[1] DOVE: https://openreview.net/pdf?id=AzMnkF0jRT

审稿意见
5

This work first presents the conflict distribution issue in preference datasets, where preferred and less-preferred images generated from the same project exhibit significant overlap. For this issue, they introduce the Captioning and Perturbation methods: generate a caption based on the image and the prompt, create three levels of perturbation from the prompt. They also explore the irrelevant prompt issue in previous DPO methods and propose Dual Caption Preference Optimization (DCPO) to improve diffusion model alignment. Lastly, they show promising results compared to the existing methods.

优点

  1. The paper is well-organized and easy to follow. Figures are clear to read, such as Figure 2.
  2. The story is complete: they propose hypothesis and then use experimental results to verify them in Sec 3.3 with clear ablation studies.
  3. The problem setup is clear. They also provide enough details to reproduce the work.

缺点

  1. My biggest concern is about the generalization of the approach method in the development of diffusion models. For example, in Figure 2, it is easy to distinguish the preferred and less-preferred image as the latter one even does not align with the original prompt. What if the model's development is already beyond the alignment stage? The current positive/negative samples are only about alignment, what about more advanced difference if both have enough alignment?

  2. Line 188-189, could you explain more details on how to get the preferred and less-preferred images? Human annotation?

  3. It would be beneficial to highlight the difference between medium and strong permutation. Do we have a way to quantify the difference between them? Are they controllable generated? Why do we need medium permutations? Would weak/strong be enough?

  4. In terms of GPT-4o evaluation, does it matter for showing the images together or showing them separately? And how about the order of showing them to GOT-4o if showing separately?

问题

See weaknesses.

评论

Thank you for your valuable feedback. We have thoughtfully addressed your comments below and look forward to discussing any remaining concerns during the discussion period to facilitate a positive re-evaluation of the score.


W1: My biggest concern is about the generalization of the approach method in the development of diffusion models. For example, in Figure 2, it is easy to distinguish the preferred and less-preferred image as the latter one even does not align with the original prompt. What if the model's development is already beyond the alignment stage? The current positive/negative samples are only about alignment, what about more advanced difference if both have enough alignment?


Reply to W1: Thank you for your question. In this study, we follow the prevailing paradigm in preference data, where for a given text prompt PP, an image IgI_g is deemed preferred, while an image IbI_b is less favored. We first observe that even leading models, such as FLUX, currently struggle to generate multiple aligned images from the same prompt. This current limitation highlights the significant potential for improvement in producing images with robust alignment.

Assuming future advancements close this alignment gap, we propose an alternative evaluation framework: rather than assigning binary preference scores to image pairs (i.e., Ig=1I_g = 1 and Ib=0I_b = 0 ), we envision a soft-scoring system that captures varying degrees of alignment with human preference across multiple images. For instance, if four images are generated, they could be ranked with scores indicating preference levels, such as {I1:0.7,I2:0.5,I3:0.1,I4:0.6I_1 : 0.7, I_2 : 0.5, I_3 : 0.1, I_4 : 0.6}, where I1 I_1 is the most preferred and I3I_3 the least preferred.

To test this approach, we generated 100 samples I1I_1 and I2I_2 using different seeds with FLUX and asked two volunteers to assign soft scores. The observed average difference of 0.4 demonstrates that current models like FLUX face challenges in producing multiple well-aligned images from the same prompt, emphasizing the importance of our proposed framework.


W2: Line 188-189, could you explain more details on how to get the preferred and less-preferred images? Human annotation?


Reply to W2: In our experiments, we use the Pick-a-Pic v2 dataset, which is human annotated. This dataset is constructed where, for a given prompt, 2 images are generated; the user then chooses the preferred image between the two.


W3: In terms of GPT-4o evaluation, does it matter for showing the images together or showing them separately? And how about the order of showing them to GPT-4o if showing separately?


Reply to W3: We follow standard practice, as in Diffusion-DPO [1] and MaPO [2], by showing two images side by side for comparison. To address positional bias in GPT-4o's evaluations, we alternate the positions of the images across different criteria (explained in Section 3).

The results below show that DCPO consistently achieves better performance than Diffusion-DPO, even when positional bias is accounted for.

ModelGeneral Preference (Win Rate%)Visual Appeal (Win Rate%)Prompt Alignment (Win Rate%)
SD2.1-DCPO-h58%64.5%56.5%
SD2.1-DPO42%35.5%43.5%

This demonstrates that DCPO provides more reliable results under unbiased conditions.

评论

W4: It would be beneficial to highlight the difference between medium and strong permutation. Do we have a way to quantify the difference between them? Are they controllable generated? Why do we need medium permutations? Would weak/strong be enough?


Reply to W4: In our study, perturbations are controllably generated using DIPPER, a model that allows configuration via a 'lexicon diversity' parameter, which ranges from 1 to 100. Lexicon diversity measures the likelihood of a perturbation introducing alternative expressions or synonyms. By setting this parameter to 40, 60, and 80, we create weak, medium, and strong perturbations, respectively.

Qualitatively, weak perturbations replace certain words in the original caption with synonyms, medium perturbations combine synonym swapping with the reordering of words, and strong perturbations paraphrase the caption into entirely different sentence structures.

Quantitatively, Figure 4 illustrates the differences across these levels. Based on the CLIPScore distribution of the original captions (turquoise histogram in (a)), we observe that stronger perturbations shift the CLIPScores further left. Medium perturbations (c) remain closer to the original captions in terms of mean and standard deviation compared to strong perturbations (d), demonstrating that medium perturbations achieve a balance between preserving the original meaning and introducing diversity.


[1] Diffusion-DPO: https://arxiv.org/abs/2311.12908

[2] MaPO: https://arxiv.org/abs/2406.06424

评论

Dear Reviewer tyap,

As the deadline is nearing, we wanted to gently follow up on our recent submission. Your feedback is highly valuable to us, and we would appreciate any updates or further guidance you might have regarding our revisions and responses.

Thank you for your time and consideration.

评论

Thank you for your replies to my questions. I have read all reviews and discussions from reviewers and authors. I agree with Reviewer f715 and am also getting confused about the motivation. Therefore, I may tend to downgrade my rate if this not cleared.

I want to get a clear story on the motivation and intuition of this work. So far it's still confusing. Let's keep on discussing. Seems the argument on W1 and W2 are self contradictory. In reply to W1, the authors claim they observed high correlation between preferred and less-preferred images. In reply to W2 authors observed that advanced model FLUX struggle to produce multiple well-aligned images, and average difference is 0.4. Does this suggest the image correlations are not high? or the correlation is strong on worse models but weak on more recent almost SOTA models?

评论

Thank you for your response.

In response to W1, we mention that "Our hypothesis is that increasing the divergence between the correlation distributions of preferred and less preferred images can enhance the performance of diffusion models optimized using direct preference optimization methods. By addressing this "conflict distribution," our approach aims to make the optimization process more effective and improve overall model performance." . The distribution we refer to here is that of the CLIPScore, which we refer to as the conflict distribution throughout our paper. To clarify, we refer to the distribution of the correlation measured by CLIPScore(s), which takes an input a text and an image, and returns their alignment as measured by CLIP. We find that this correlation is high between a {preferred image, caption} and {less-preferred image, caption} in existing preference datasets, i.e., the difference between the CLIPScore of {preferred image, caption} and {less-preferred image, caption} is similar.

In response to W2, we argue that existing SOTA Models, such as FLUX, are also unable to generate multiple images aligned with the prompt. This was in response to the question that despite advancements in T2I generation, there is still a substantial gap in generating well-aligned images (as denoted by the 0.4 difference in our results).

We hope this clarifies any confusion. Please let us know if you have more questions. Thank you.

评论

We sincerely appreciate your time and thoughtful review!

In our general comments, we provided both theoretical and experimental motivation for our work, along with a series of proofs demonstrating the effectiveness of DCPO compared to diffusion-DPO. As the rebuttal period approaches its conclusion, we would like to respectfully ask if you have any remaining concerns regarding the motivation behind our work, the effectiveness of our method (including the additional results provided), or any other aspects of our submission. We have been more than happy to address any remaining or new concerns during the discussion period.

If our responses have adequately addressed your concerns, we kindly ask you to consider reevaluating our work in light of the updated information.

Thank you once again for your valuable feedback!

评论

Thank you to the reviewers for their comments and feedback.


  • We found a recent COLM 2024 paper in which the authors, in the future work section, suggested aligning diffusion models using two distinct captions and revising the DPO loss function. In DCPO, the captioning model functions similarly to the Refiner described in their work. Further details are in Appendix D of the paper:

[1] From r to Q∗: Your Language Model is Secretly a Q-Function: https://arxiv.org/abs/2404.12358


Our motivations follow the same spirit and we elaborate them as follows. In preference datasets, preferred and less preferred images must exhibit clear differences. If these differences are minimal, the DPO loss function collapses theoretically because logP(xwc)logP(xlc)where:xwxl\log P(x_w|c) \simeq \log P(x_l|c) where: x_w \simeq x_l, reducing the loss to near zero. To address this, current preference datasets select different images as preferred and less preferred for a given prompt cc (addressing W1 and W2 for reviewer f715).

An unexplored area in preference optimization for diffusion models is the effect of improving captions and using distinct captions. We approached this by enhancing the original captions through captioning models. As shown in Table 4, improved captions enhance the performance of preference optimization. The reasoning is theoretical: when a caption c1c_1 has higher semantic similarity with the image xx than c2c_2, logP(xwc)\log P(x_w|c) in the first term of the DPO loss improves, leading to better alignment performance. Thus, improving captions boosts DPO performance by improving logP(xwc) \log P(x_w|c).

Hypothesis 1: If a distinct caption for the less preferred image correctly describes it, this will improve logP(xlc)\log P(x_l|c) in the DPO loss function.

To test this hypothesis, we conducted the following two Experiments:

Experiment 1: Optimize a diffusion model with DPO using irrelevant captions for less preferred images created by heavily perturbing the prompt.

Experiment 2: Generate captions for less preferred images using a captioning model based on the original caption.

Method (SD2.1)(Caption of Preferred, Less Preferred Images)GenEval (Overall)PickscoreHPSv2.1ImageRewardCLIPscore
DPOoriginal caption c for both0.485720.3625.1056.426.98
DPO (Experiment 1)original caption c, perturbed original caption0.485220.2125.3453.126.87
DPO (Experiment 2)original caption c, generated caption for less preferred0.487020.4125.1156.526.98

As shown in the above table, alignment performance improves when aligned captions are used for less preferred images, supporting Hypothesis 1. This resolves the 'irrelevant prompts' issue in both the optimization process and the datasets.

Observation 1: While alignment performance increases by generating a good caption for less preferred images, the improvement is modest. Our data analysis reveals that the CLIPscores of preferred and less preferred images are often similar (see Figure 3). The similarity arises because the caption for the less preferred image was generated by a captioning model, leading to minimal differences.

Hypothesis 2: Based on observation 1, we hypothesize that increasing the difference in average CLIPscore between preferred and less preferred images will improve the alignment performance of the diffusion model.

To test this, we generated two distinct captions for preferred and less preferred images, respectively. The goal was to improve logP(xwc)\log P(x_w|c) and logP(xlc)\log P(x_l|c) in the DPO loss function by having correspondingly more suitable captions. The results in Tables 1 and 2 confirm that DCPO significantly improves alignment performance across different benchmarks.


For Reviewer Y9en - Difference between Conflict Distribution and Irrelevant Prompts While both issues share similarities, we argue they are not exact duplicates according to the two Experiments we mentioned earlier. In Experiment 1, the difference between the CLIPscore of the preferred image with the original caption and the less preferred image with the perturbed caption is large, meaning there is no conflict distribution. It shows that having little conflict distribution in fact indicates to alignment performance decrease due to irrelevant prompts. In Experiment 2, the alignment performance is now better than the original DPO, but having two similar distributions of CLIPscore between preferred and less preferred images restricts the performance from improving futhermore. Thus, while Conflict Distribution and Irrelevant Prompts overlap in certain aspects, they are in fact distinct problems.


We thank all reviewers for their constructive suggestions. We hope the reviewers take into account the comprehensive analyses and experiments conducted during the rebuttal and reconsider their scores.

评论

Really appreciate authors' efforts on the clarifications.

Hypothesis 1: not sure what does "improve" mean. I can see that a more relevant caption zlz_l will increase \logP(xlc)\logP(x_l|c) to \logP(xlzl)\logP(x_l|z_l) which increase the loss. So it to some degree sets an upper bound to the original loss, but it is unclear why optimizing this upper bound would be necessarily better than optimizing the original loss.

Overall I'd hold a neutral position if this paper simply states that through experiments they found generating zlz_l and zwz_w , and use them in DPO instead of cc yields better performance. It's quite unfortunate that so far none of the motivations sounds reasonable to me. It seems other reviewers are also confused.

评论

continue of Proof 2:


Assuming that the neural network ϵθ\boldsymbol{\epsilon}_\theta is capable of approximating the optimal predictor

ϵθ\boldsymbol{\epsilon}_\theta^\ast, especially as training progresses and the model capacity is sufficient, we can write:

\left\| \boldsymbol{\epsilon}^l - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t^l, t, z^l) \right\|_2^2 \approx $$ $$\left\| \boldsymbol{\epsilon}^l - \boldsymbol{\epsilon}_\theta^\ast(\mathbf{x}_t^l, t, z^l) \right\|_2^2

Similarly for cc

\left\| \boldsymbol{\epsilon}^l - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t^l, t, c) \right\|_2^2 \approx $$ $$\left\| \boldsymbol{\epsilon}^l - \boldsymbol{\epsilon}_\theta^\ast(\mathbf{x}_t^l, t, c) \right\|_2^2 .

Therefore, the expected squared error satisfies:

\mathbb{E}\left[ \left\| \boldsymbol{\epsilon}^l - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t^l, t, z^l) \right\|_2^2 \right] \leq $$ $$\mathbb{E}\left[ \left\| \boldsymbol{\epsilon}^l - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t^l, t, c) \right\|_2^2 \right]

Since the term of Δless-preferred\Delta_{\text{less-preferred}} in the loss function involves the difference of squared errors, using zlz^l instead of cc for the less preferred sample results in a lower error term:

\Delta_{\text{less-preferred}}^{(z^l)} = $$ $$\left\| \boldsymbol{\epsilon}^l - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t^l, t, z^l) \right\|_2^2 - $$ $$\left\| \boldsymbol{\epsilon}^l - \boldsymbol{\epsilon}_{\text{ref}}(\mathbf{x}_t^l, t, z^l) \right\|_2^2

Comparing with the original:

\Delta_{\text{less-preferred}}^{(c)} = $$ $$\left\| \boldsymbol{\epsilon}^l - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t^l, t, c) \right\|_2^2 - $$ $$\left\| \boldsymbol{\epsilon}^l - \boldsymbol{\epsilon}_{\text{ref}}(\mathbf{x}_t^l, t, c) \right\|_2^2

Assuming the reference model ϵref\boldsymbol{\epsilon}_{\text{ref}}

Remains the same or also benefits similarly from the additional information in zlz^l, the net effect is that the first term decreases more than the second term, leading to a reduced Δless-preferred\Delta_{\text{less-preferred}}.

[1] Law of Total Variance (conditional variance formula): Ross, S. M. (2014). Introduction to Probability Models (11th ed.). Academic Press.


Proof 3 - Replacing caption cc with the specifically generated caption zwz^w for the preferred image x0wx_0^w increases Δpreferred\Delta_{\text{preferred}}.

To prove that replacing c\mathbf{c} with zwQ(zwxw,c)\mathbf{z}^w \sim Q(z^w|x^w, c), where czw\mathbf{c} \subset \mathbf{z}^w, for x0w\mathbf{x}_0^w also contributes to a better optimized loss L(θ)L(\theta), we examine how this particular substitution affects the loss function.

We let

Rθ(c)=ϵwϵθ(xtw,t,c)22,R_\theta(\mathbf{c}) = \| \boldsymbol{\epsilon}^w - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t^w, t, \mathbf{c}) \|_2^2, Rref(c)=ϵwϵref(xtw,t,c)22.R_{\text{ref}}(\mathbf{c}) = \| \boldsymbol{\epsilon}^w - \boldsymbol{\epsilon}_{\text{ref}}(\mathbf{x}_t^w, t, \mathbf{c}) \|_2^2.

The rate of decrease in RθR_\theta due to zw\mathbf{z}^w is proportional to the model's ability to exploit the additional conditioning. Since ϵθ\boldsymbol{\epsilon}_\theta is learnable,

it can more effectively leverage zw\mathbf{z}^w than ϵref\boldsymbol{\epsilon}_{\text{ref}}, yielding:

ΔRθ=Rθ(c)Rθ(zw)ΔRref=Rref(c)Rref(zw).\Delta R_\theta = R_\theta(\mathbf{c}) - R_\theta(\mathbf{z}^w) \gg \Delta R_{\text{ref}} = R_{\text{ref}}(\mathbf{c}) - R_{\text{ref}}(\mathbf{z}^w).

We further elaborate on why the learnable model's noise prediction residual (RθR_\theta) decreases faster than the reference model's residual (RrefR_{\text{ref}}) when c\mathbf{c} is replaced by zw\mathbf{z}^w. The residuals for the learnable and reference models are defined as:

Rθ(c)=ϵwϵθ(xtw,t,c)22,R_\theta(\mathbf{c}) = \| \boldsymbol{\epsilon}^w - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t^w, t, \mathbf{c}) \|_2^2, Rref(c)=ϵwϵref(xtw,t,c)22.R_{\text{ref}}(\mathbf{c}) = \| \boldsymbol{\epsilon}^w - \boldsymbol{\epsilon}_{\text{ref}}(\mathbf{x}_t^w, t, \mathbf{c}) \|_2^2.

When c\mathbf{c} is replaced with zw\mathbf{z}^w (where czw\mathbf{c} \subset \mathbf{z}^w), the residuals become:

Rθ(zw)=ϵwϵθ(xtw,t,zw)22,R_\theta(\mathbf{z}^w) = \| \boldsymbol{\epsilon}^w - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t^w, t, \mathbf{z}^w) \|_2^2, Rref(zw)=ϵwϵref(xtw,t,zw)22.R_{\text{ref}}(\mathbf{z}^w) = \| \boldsymbol{\epsilon}^w - \boldsymbol{\epsilon}_{\text{ref}}(\mathbf{x}_t^w, t, \mathbf{z}^w) \|_2^2.

The rate of decrease for each residual is defined as:

ΔRθ=Rθ(c)Rθ(zw),\Delta R_\theta = R_\theta(\mathbf{c}) - R_\theta(\mathbf{z}^w), ΔRref=Rref(c)Rref(zw).\Delta R_{\text{ref}} = R_{\text{ref}}(\mathbf{c}) - R_{\text{ref}}(\mathbf{z}^w).

The quality of conditioning, Q(c)Q(\mathbf{c}), represents how well the conditioning c\mathbf{c} aligns with the true noise ϵw\boldsymbol{\epsilon}^w. We assume that

Q(zw)>Q(c),Q(\mathbf{z}^w) > Q(\mathbf{c}),

where the improvement in conditioning quality ΔQ\Delta Q is defined as

ΔQ=Q(zw)Q(c).\Delta Q = Q(\mathbf{z}^w) - Q(\mathbf{c}).
评论

Proof 2 - Replacing caption cc with the specifically generated caption zlz^l for the less-preferred image x0lx_0^l decreases Δless-preferred\Delta_{\text{less-preferred}}.

To analyze how replacing c\mathbf{c} with zl\mathbf{z}^l, where czl\mathbf{c} \subset \mathbf{z}^l and zlQ(zlxl,c)\mathbf{z}^l \sim Q(\mathbf{z}^l | x^l, c), for the less-preferred image x0l\mathbf{x}_0^l improves the optimization, we delve into how the loss function is affected by this substitution.

The term relevant to the less-preferred image xtl\mathbf{x}_t^l in the loss is:

\Delta_{\text{less-preferred}} = \| \boldsymbol{\epsilon}^l - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t^l, t, \mathbf{c}) \|_2^2 - $$ $$\| \boldsymbol{\epsilon}^l - \boldsymbol{\epsilon}_{\text{ref}}(\mathbf{x}_t^l, t, \mathbf{c}) \|_2^2.

Replacing c\mathbf{c} with zl\mathbf{z}^l modifies the predicted noise term ϵθ(xtl,t,c)\epsilon_\theta (x_t^l, t, c) to ϵθ(xtl,t,zl)\epsilon_\theta(\mathbf{x}_t^l, t, \mathbf{z}^l). Since zl\mathbf{z}^l better represents xtl\mathbf{x}_t^l, we have:

\| \boldsymbol{\epsilon}^l - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t^l, t, \mathbf{z}^l) \|_2^2 < $$ $$ \| \boldsymbol{\epsilon}^l - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t^l, t, \mathbf{c}) \|_2^2 \ [Eq.1].

When ϵlϵθ(xtl,t,zl)22\|\epsilon^l - \epsilon_\theta(x_t^l, t, z^l) \|_2^2 becomes smaller,

the term Δless-preferred\Delta_{\text{less-preferred}} decreases. This leads to ΔpreferredΔless-preferred\Delta_{\text{preferred}} - \Delta_{\text{less-preferred}} becoming larger, which improves the soft-margin optimization in the loss function L(θ)L(\theta) that we have shown in Proof 1.

We further elaborate on why [Eq.1][Eq.1] is true. In the context of mean squared error (MSE) minimization, the optimal predictor of ϵl\boldsymbol{\epsilon}^l given some information is the conditional expectation:

When conditioned on (xtl,t,c)(\mathbf{x}_t^l, t, c):

ϵθ(xtl,t,c)=E[ϵlxtl,t,c]\boldsymbol{\epsilon}_\theta^\ast(\mathbf{x}_t^l, t, c) = \mathbb{E}\left[ \boldsymbol{\epsilon}^l \mid \mathbf{x}_t^l, t, c \right]

When conditioned on (xtl,t,zl)(\mathbf{x}_t^l, t, z^l):

ϵθ(xtl,t,zl)=E[ϵlxtl,t,zl]\boldsymbol{\epsilon}_\theta^\ast(\mathbf{x}_t^l, t, z^l) = \mathbb{E}\left[ \boldsymbol{\epsilon}^l \mid \mathbf{x}_t^l, t, z^l \right]

The total variance of ϵl\boldsymbol{\epsilon}^l can be decomposed as by the Law of Total Variance (conditional variance formula) [1]:

Var(ϵl)=E[Var(ϵlxtl,t,c)]+Var(E[ϵlxtl,t,c])\operatorname{Var}\left( \boldsymbol{\epsilon}^l \right) = \mathbb{E}\left[ \operatorname{Var}\left( \boldsymbol{\epsilon}^l \mid \mathbf{x}_t^l, t, c \right) \right] + \operatorname{Var}\left( \mathbb{E}\left[ \boldsymbol{\epsilon}^l \mid \mathbf{x}_t^l, t, c \right] \right)

Similarly, when conditioning on zlz^l:

Var(ϵl)=E[Var(ϵlxtl,t,zl)]+Var(E[ϵlxtl,t,zl])\operatorname{Var}\left( \boldsymbol{\epsilon}^l \right) = \mathbb{E}\left[ \operatorname{Var}\left( \boldsymbol{\epsilon}^l \mid \mathbf{x}_t^l, t, z^l \right) \right] + \operatorname{Var}\left( \mathbb{E}\left[ \boldsymbol{\epsilon}^l \mid \mathbf{x}_t^l, t, z^l \right] \right)

Since czlc \subset z^l, the information provided by zlz^l is richer than that of cc. In probability theory, conditioning on more information does not increase the conditional variance:

Var(ϵlxtl,t,zl)Var(ϵlxtl,t,c)[Eq.2]\operatorname{Var}\left( \boldsymbol{\epsilon}^l \mid \mathbf{x}_t^l, t, z^l \right) \leq \operatorname{Var}\left( \boldsymbol{\epsilon}^l \mid \mathbf{x}_t^l, t, c \right) [Eq.2]

This inequality holds because conditioning on additional information (zlz^l) can only reduce or leave unchanged the uncertainty (variance) about ϵl\boldsymbol{\epsilon}^l.

The expected squared error when using the optimal predictor is equal to the conditional variance:

E[ϵlϵθ(xtl,t,c)22]=E[Var(ϵlxtl,t,c)]\mathbb{E}\left[ \left\| \boldsymbol{\epsilon}^l - \boldsymbol{\epsilon}_\theta^\ast(\mathbf{x}_t^l, t, c) \right\|_2^2 \right] = \mathbb{E}\left[ \operatorname{Var}\left( \boldsymbol{\epsilon}^l \mid \mathbf{x}_t^l, t, c \right) \right]

Similarly,

E[ϵlϵθ(xtl,t,zl)22]=E[Var(ϵlxtl,t,zl)]\mathbb{E}\left[ \left\| \boldsymbol{\epsilon}^l - \boldsymbol{\epsilon}_\theta^\ast(\mathbf{x}_t^l, t, z^l) \right\|_2^2 \right] = \mathbb{E}\left[ \operatorname{Var}\left( \boldsymbol{\epsilon}^l \mid \mathbf{x}_t^l, t, z^l \right) \right]

From [Eq.2][Eq.2], we have:

Var(ϵlxtl,t,zl)Var(ϵlxtl,t,c)\operatorname{Var}\left( \boldsymbol{\epsilon}^l \mid \mathbf{x}_t^l, t, z^l \right) \leq \operatorname{Var}\left( \boldsymbol{\epsilon}^l \mid \mathbf{x}_t^l, t, c \right)

Taking expectations on both sides:

E[Var(ϵlxtl,t,zl)]E[Var(ϵlxtl,t,c)]\mathbb{E}\left[ \operatorname{Var}\left( \boldsymbol{\epsilon}^l \mid \mathbf{x}_t^l, t, z^l \right) \right] \leq \mathbb{E}\left[ \operatorname{Var}\left( \boldsymbol{\epsilon}^l \mid \mathbf{x}_t^l, t, c \right) \right]

Therefore,

\mathbb{E}\left[ \left\| \boldsymbol{\epsilon}^l - \boldsymbol{\epsilon}_\theta^\ast(\mathbf{x}_t^l, t, z^l) \right\|_2^2 \right] \leq$$ $$ \mathbb{E}\left[ \left\| \boldsymbol{\epsilon}^l - \boldsymbol{\epsilon}_\theta^\ast(\mathbf{x}_t^l, t, c) \right\|_2^2 \right]
评论

We thank the reviewer for responding to our comment. We would like to further clarify that the Diffusion-DPO loss function does not include logP(xlc)\log P(x^l | c) in the final loss function. In the Diffusion-DPO paper [1], the authors have stated that incorporating logP(xwc)\log P(x^w | c) and logP(xlc)\log P(x^l | c) into the Diffusion-DPO loss function is intractable and inefficient, as we have reiterated in L205-206. For more details on Diffusion-DPO, we refer the reviewers to Section 4 of the Diffusion-DPO paper. In short, the final form of the Diffusion-DPO loss is similar to our DCPO loss, as shown in L211-214.

[1] Diffusion-DPO: https://arxiv.org/abs/2311.12908


The Diffusion-DPO loss L(θ)L(\theta) can be expressed as the following equation:

where, in actuality,

Δpreferred=ϵwϵθ(xtw,t,c)22\Delta_{\text{preferred}} = \| \boldsymbol{\epsilon}^w - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t^w, t, \mathbf{c}) \|_2^2 -

ϵwϵref(xtw,t,c)22\| \boldsymbol{\epsilon}^w - \boldsymbol{\epsilon}_{\text{ref}}(\mathbf{x}_t^w, t, \mathbf{c}) \|_2^2

and

\Delta_{\text{less-preferred}} = \| \boldsymbol{\epsilon}^l - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t^l, t, \mathbf{c}) \|_2^2 - $$ $$\| \boldsymbol{\epsilon}^l - \boldsymbol{\epsilon}_{\text{ref}}(\mathbf{x}_t^l, t, \mathbf{c}) \|_2^2$$ --- **Our Motivation and Methodology.** We are motivated to demonstrate that, by having a larger difference between $\Delta_{\text{less-preferred}}$ and $\Delta_{\text{preferred}}$, a much clearer distinction between preferred and less preferred image-caption pairs contributes to more optimized $L(\theta)$ in terms of soft-margin optimization. Our DCPO paradigm enforces such larger difference between $\Delta_{\text{less-preferred}}$ and $\Delta_{\text{preferred}}$ on the following two factors. On $\Delta_{\text{less-preferred}}$, we replace the original shared caption, a.k.a. the prompt $c$ with a generated caption $z^l$ more suitable for the less-preferred image $x^l$ using $Q(z^l|x^l, c)$, in order to decrease $\Delta_{\text{less-preferred}}$. Likewise, on $\Delta_{\text{preferred}}$, we replace $c$ with an independently generated caption $z^w$ for the preferred image $x^w$ using $Q(z^w|x^w, c)$, in order to increase $\Delta_{\text{preferred}}$. In the following sections, we present the formal proofs on why our methodology leads to a more optimized $L(\theta)$ of a Diffusion-based model and, consequently, better performance in preference alignment tasks. The proofs will be incorporated into the camera-ready version of the paper. --- **Proof 1: Increasing the difference between $\Delta_{\text{preferred}}$ and $ \Delta_{\text{less-preferred}}$ improves the optimization of $L(\theta)$.** For better clarity, the loss function $L(\theta)$ can be written as:

L(\theta) = -\mathbb{E} \left[ \log \sigma \big( -\beta T \omega(\lambda_t) \cdot M \big) \right]

where $\sigma(x) = \frac{1}{1 + e^{-x}} $, a.k.a. the sigmoid function that squashes its input $ x $ into the range $ (0, 1) $ and $\ M = \Delta_{\text{preferred}} - \Delta_{\text{less-preferred}} $, a.k.a. the margin between the respective importance of the preferred and less preferred predictions. Characteristically, the gradient of $ \sigma(x) $ is at its maximum near $ x = 0 $ and decreases as $ |x| $ increases. A larger margin in terms of $M$ makes it easier for the optimization to drive the sigmoid function towards its asymptotes, reducing loss. - When $ M $ is small ($ |M| \approx 0 $): The sigmoid $ \sigma(-\beta T \omega(\lambda_t) \cdot M) $ is near 0.5 (its midpoint). Also, the gradient of $ \log \sigma(x) $ is the largest near this point, meaning the model struggles to differentiate between preferred and less preferred predictions effectively. - When $ M $ is large ($ |M| \gg 0 $): The sigmoid $ \sigma(-\beta T \omega(\lambda_t) \cdot M) $ moves closer to 0 or 1, depending on the sign of $ M $. For a well-aligned model, if the preferred predictions are correct, $ M > 0 $ and $ \sigma(-\beta T \omega(\lambda_t) \cdot M) $ approach 1, thus minimizing the loss. Intuitively, an ideally large $ M $ represents a clear distinction between the preferred image-caption versus the less preferred image-caption. Thus, by maximizing $ M $, we may push the loss $ L(\theta) $ towards its minimum, leading to better soft-margin optimization.
评论

I would like to express my gratitude towards authors again for clarifying the statements.

The term logP(xlc)\log P(x_l|c) is used simply by following the authors' original response stated as hypothesis 1, where logP(xlc)\log P(x_l|c) is one term in the loss function. This is not in conflict with the fact that eventually this term is reduced to L2 loss on noise prediction and ground-truth noise.

In response to Proof 1, I agree that bigger MM does lower the loss, but it simply means L(θzl,zw)<L(θc)L(\theta|z_l, z_w) < L(\theta| c) if indeed MM using (zl,zw)(z_l, z_w) is bigger than using cc. However, I would stick to my opinion that I am not sure what aspect of the optimization is "improved".

To illustrate my point, let's assume a loss function family of L(θa,b,c)=aθ2+bθ+cL(\theta| a, b, c) = a \theta^2 + b \theta + c, by changing a,b,ca, b, c, we can have two losses L1(θ)=θ2L_1(\theta) = \theta^2, L2(θ)=0.5(θ1)21L_2(\theta) = 0.5(\theta-1)^2- 1, and we can show L2L1L_2 \leq L_1. Then I don't see why would I say optimization of L2L_2 is improved compared to L1L_1, as they are two problems, and the optimizer θ2\theta^*_2 isn't necessarily equal to θ1\theta^*_1, nor does performance of model θ2\theta^*2 has to be better than θ1\theta^*_1 since we didn't even introduce the original task at all.

Back to the original paper's setting, the DCPO simply setup another loss function different from Diffusion-DPO, and it is possible that this new loss is lower than the original loss for any θ\theta. But just as the L1L_1 and L2L_2 example above, I don't know why optimizing the DCPO loss would necessarily improve the optimization problem in theory, or lead to "better performance in preference alignment tasks".

We may argue that through experiments, we see that optimizing the new loss gives better results in terms of various evaluation metrics. But still this as a result doesn't prove the motivation is sound in theory.

评论

We appreciate the reviewer's feedback and the acknowledgment that optimizing a diffusion model with the DCPO loss function decreases the loss. However, we are somewhat confused by the contradictory comments made throughout the review process.

In the original review, the reviewer stated:

"From the description of L175-L180, the irrelevant problem is hardly a problem either. It is inherently part of the objective in Eqn (1), where one way of minimizing Eqn (1) is to decrease logPθ(x0:Tlzl)\log P_\theta(x^l_{0:T}|z^l), which makes the model less likely to generate the less preferred image. So to me, this is a desired behavior instead of a problem."

In the reply comment, the reviewer reiterated:

"I can see that a more relevant caption will increase logP(xlc)\log P(x_l|c) to logP(xlzl)\log P(x^l|z^l), which increases the loss."

However, in the final reply, the reviewer also explicitly stated:

"In response to Proof 1, I agree that a larger MM does lower the loss, but it simply means L(θzl,zw)<L(θc)L(\theta|z_l, z_w) < L(\theta|c) if indeed MM using (zw,zl)(z^w, z^l) is larger than when using cc."

These comments seem inconsistent, and we would like to ask for clarification. Specifically, does optimizing a diffusion model with DCPO increase or decrease the loss?

Furthermore, the 'loss family' example provided by the reviewer seems unclear to us. After all, Diffusion-DPO and DCPO share the same primary objective: to obtain a diffusion model that achieves better alignment with the given prompt, the less-preferred image, and the preferred image. Our DCPO methodology indirectly uses the prompt by instead using two respectively generated captions for the two images without additional fine-tuning or training on image captioning whatsoever. If the reviewer's concern relates to the validity of the Markov Decision Process (MDP) formulation of diffusion in the DCPO framework, we refer to [1], which demonstrates that conditioning the denoising process with zwz^w and zlz^l remains a valid MDP for diffusion.

When ΔpreferredΔless-preferred\Delta_{\text{preferred}} - \Delta_{\text{less-preferred}} increases as shown in Proof 1, the loss function decreases, indicating that the neural network ϵθ\epsilon_\theta is improving its ability to predict the added noise ϵ\boldsymbol{\epsilon} during the denoising process. This improvement is critical because accurate noise prediction ensures that each step of the denoising process moves the noisy image closer to its clean form, resulting in higher-quality, more realistic image generation. A lower loss also sharpens the alignment between the model's output and the conditioning input (e.g., captions), as the model better leverages the provided context to guide the generation process. Furthermore, an increase in ΔpreferredΔless-preferred\Delta_{\text{preferred}} - \Delta_{\text{less-preferred}} reflects greater confidence in distinguishing preferred images from less-preferred ones, which translates to better preference-based optimization. Ultimately, minimizing the DCPO loss not only ensures more precise noise prediction but also improves alignment and image generation performance, enabling the model to produce outputs that are both reliable and realistic.

We kindly ask the reviewer to suggest specific experiments or proofs that would address the concerns.


[1] From r to Q∗: Your Language Model is Secretly a Q-Function: https://arxiv.org/abs/2404.12358

评论

continue of Proof 3:


The residual for RθR_\theta is proportional to the misalignment between Q(c)Q(\mathbf{c}) and ϵw\boldsymbol{\epsilon}^w:

Rθ(c)1Q(c). R_\theta(\mathbf{c}) \propto \frac{1}{Q(\mathbf{c})}.

Replacing c\mathbf{c} with zw\mathbf{z}^w (higher QQ) results in a larger proportional reduction:

Rθ(zw)1Q(zw)withΔRθΔQ.R_\theta(\mathbf{z}^w) \propto \frac{1}{Q(\mathbf{z}^w)} \quad \text{with} \quad \Delta R_\theta \propto \Delta Q.

The reference model's residual RrefR_{\text{ref}} depends weakly on Q(c)Q(\mathbf{c}), as it is fixed or less adaptable:

Rref(c)1Qref(c),R_{\text{ref}}(\mathbf{c}) \propto \frac{1}{Q_{\text{ref}}(\mathbf{c})},

where Qref(c)Q_{\text{ref}}(\mathbf{c}) is less sensitive to changes in c\mathbf{c}.

Thus, the proportional improvement in RθR_\theta due to ΔQ\Delta Q is significantly larger than for RrefR_{\text{ref}}.

The preferred difference term is:

Δpreferred=RθRref.\Delta_{\text{preferred}} = R_\theta - R_{\text{ref}}.

As RθR_\theta decreases significantly more than RrefR_{\text{ref}}, the gap RθRrefR_\theta - R_{\text{ref}} becomes larger, increasing Δpreferred\Delta_{\text{preferred}}:

ΔRθΔRref    Δpreferred increases.\Delta R_\theta \gg \Delta R_{\text{ref}} \implies \Delta_{\text{preferred}} \text{ increases.}

The learnable model ϵθ\boldsymbol{\epsilon}_\theta benefits more from the improved conditioning zw\mathbf{z}^w

because of its adaptability and training dynamics. This results in a larger reduction in RθR_\theta compared to RrefR_{\text{ref}}. Mathematically, the relative rate of decrease:

Relative Rate=ΔRθΔRref1,\text{Relative Rate} = \frac{\Delta R_\theta}{\Delta R_{\text{ref}}} \gg 1,

which ensures that Δpreferred\Delta_{\text{preferred}} also increases, hence improving the optimization process in L(θ)L(\theta) and helping the model distinguish predictions on preferred and less preferred image-captions more effectively.


We hope our responses have addressed the follow-up concerns over our motivations and methodology raised by the reviewers. All of the formal proofs in these 4 parts will become part of the finalized version of our paper.

撤稿通知

I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.