SePPO: Semi-Policy Preference Optimization for Diffusion Alignment
We propose SePPO, a method to align diffusion models without reward models or human-annotated data. SePPO outperforms previous methods, achieving a PickScore of 21.57 on Pick-a-Pic and excels in text-to-video tasks.
摘要
评审与讨论
The paper proposes a diffusion model to improve sample quality and diversity in generative tasks. The authors introduce an algorithm that integrates an Anchor-based Adaptive Flipper. To substantiate the claims, a series of comprehensive experiments, including quantitative evaluations against several state-of-the-art models and detailed ablation studies, are presented.
优点
- This paper employs thorough quantitative metrics such as PickScore and HPSv2. The use of ablation studies is commendable, clearly delineating the contributions of individual components of the proposed model.
- The results are well-presented, with clear visualizations and comprehensive tables that facilitate an understanding of performance metrics across different models.
- The proposed method is simple and easy to follow.
缺点
- The proposed method is a bit tricky, which may limit its contribution. Therefore, I suggest the authors conduct a deeper theoretical analysis of the proposed method.
- The insights from Theorem 4.1 are quite intuitive and easy to understand. I suggest the authors put Theorem 4.1 in the Appendix.
- The essential reason why randomly sampling previous checkpoints as a reference model with AAF achieves the best performance is still unclear. I suggest the authors theoretically analyze the effectiveness of the proposed method, which could significantly strengthen this work. At least, the authors need to analyze in which cases, using the latest model is better than randomly sampling previous checkpoints as the reference model.
问题
Please see the Weaknesses.
The paper introduces Semi-Policy Preference Optimization for fine-tuning diffusion models in visual generation, bypassing reward models and human annotations. SePPO uses past model checkpoints to generate on-policy reference samples, replacing “losing” images, and focuses on optimizing only "winning" samples. An anchor-based criterion selectively learns from these reference samples, mitigating performance drops from uncertain quality. SePPO outperforms existing methods in text-to-image and shows strong results in text-to-video benchmarks.
优点
- Practical Approach: The SePPO method offers a practical solution for preference alignment without the need for human annotation or a reward model, which reduces significant labor costs.
- Clear Writing and Presentation: The submission is well-written and formatted, making it easy to follow and understand.
- Effective Sample Filtering: The Anchor-based Adaptive Flipper (AAF) criterion is a useful addition, as it helps to filter uncertain samples and enhances model robustness.
缺点
- Limited Literature Review and Comparison: While preliminary experiments are presented for text-to-video models, the paper lacks a thorough literature review on this topic and has limited comparisons in the text-to-video experiments, making the evaluation seem somewhat incomplete. Improving the Related Work section would strengthen the context and position of SePPO.
- Table Readability: Moving the annotation explanations to the table captions could improve table readability.
- Unclear Justification for Theorem 4.1: The key of instantiating SePPO is the Theorem 4.1, however, it remains a question to me about this rationality. Specifically, if the reference model has a higher probability of generating noise compared to the current model, then in this situation, we should say this model is a better model for generating this image. We cannot assert the quality of this generated image.
问题
Generally speaking, methods like Diffusion-DPO requires human-annotated data pairs and SePPO does not. The upper bound of aligning model outputs using annotated data pairs should be higher than SePPO, which solely relies on the model itself. Can the authors present an explanation for this?
This paper proposes SePPO, leveraging two techniques to improve the previous SPIN-Diffusion method, including 1) randomly selecting the reference policy from all previous checkpoints and 2) A heuristic (anchor-based criterion) to determine whether a reference sample will likely win or lose. The paper performs experiments on both T2I and T2V tasks to demonstrate the effectiveness of their methods by comparing them with several different methods.
优点
-
The paper performs experiments on both T2I and T2V generation tasks.
-
The paper is easy to follow and understand.
缺点
-
Technical contributions: The proposed techniques involve 1) randomly selecting the reference policy from the checkpoint at iteration 0 (DITTO) and the latest iteration t - 1 (SPIN-Diffusion). 2) A heuristic (anchor-based criterion) to determine whether a reference sample will likely win or lose, and the learning loss is adjusted accordingly. Thus, the technical contributions contributions of this work are limited.
-
Incorrect Definition of on-policy examples: The definitions of on-policy and off-policy learning are well-defined in reinforcement learning literature [1]. Specifically, on-policy learning refers to the settings when the training examples are sampled from the current policy () being learned. However, this work treats the reference samples sampled from \pi\_\{ref} as "on-policy" examples (e.g., Line 250 - 252), which is incorrect. In fact, both and are off-policy samples since none of them is sampled from the current policy .
-
Limited Performance Improvement: According to Tables 1 and 2, the performance improvement of the SePPO over SPIN-Diffusion is trivial in terms of PickScore and HPSv2 score. The improvement is only obvious when evaluating with ImageReward. Additionally, SPIN-Diffusion even outperforms the proposed SePPO by an obvious margin in terms of Aesthetic score. Therefore, I would recommend conducting a human evaluation to corroborate the results as in [2].
-
Evaluation protocol for video generation tasks: The metrics used in Table 3 are not meaningful enough. As the model is trained on ChronoMagic-Bench-150, I recommend reporting the results by following the evaluation protocol in Tables 3 & 4 of the ChronoMagic-Bench paper [4].
-
Missing Citations: Please cite the related works [2] and [3], which tackle T2I and T2V model alignment by learning from reward models.
Minor points
-
Line 025: "winning or losing images" --> "winning or losing examples". Since the proposed method is not limited to image generation, please revise similar errors throughout the paper.
-
I recommend not using too many subsubsections. Furthermore, avoid using unnecessary new lines when formalizing the optimization problems (e.g., Equation 5). If you are worried about the page limit, please include more qualitative examples in the main text.
-
I suggest selecting a different abbreviation for your method. PPO [5] in is widely recognized as an algorithm focused on learning a reward function. Since your SePPO is in the self-play finetuning family and is unrelated to PPO, using this acronym may lead to confusion.
[1] Sutton, Richard S. "Reinforcement learning: An introduction." A Bradford Book (2018).
[2] Li et al., "Reward Guided Latent Consistency Distillation", TMLR 2024
[3] Li et al., "T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback", NeurIPS 2024
[4] Yuan et al., "ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation". NeurIPS 2024
[5] Schulman et al., "Proximal Policy Optimization Algorithms".
问题
Please refer to the Weakness section.
This paper proposes SePPO, a novel preference optimization method for aligning diffusion models with human preferences without requiring reward models or paired human-annotated data. The key innovations are: 1) Using previous checkpoints as reference models to generate on-policy reference samples, 2) Introducing a strategy for reference model selection that enhances policy space exploration, and 3) Developing an Anchor-based Adaptive Flipper (AAF) to assess reference sample quality. The method shows strong performance on both text-to-image and text-to-video generation tasks, outperforming previous approaches across multiple metrics.
优点
- The paper proposes a combination of semi-policy optimization with the AAF mechanism without requiring reward models or paired human-annotated data.
- Comprehensive Empirical Validation: The work provides comprehensive experimental validation across both text-to-image and text-to-video domains.
缺点
- I find the definition of "better" (L277 in bold) to be confusing and the same term shows up in Theorem 4.1 seems lacking rigor. I think what the authors mean is that "closer" to the preferred sample , but closer to does not necessarily mean better since it depends on the metric considered. Given a reward function where , whether a new sample has a higher reward is dependent on the reward landscape, not how close it is to .
- Given 1, I think the main spirit of the proposed method is to fit the preferred distribution, similar to SPIN. In that sense, I am confused about why the proposed method is expected to do better since the advantage of the proposed method compared to SPIN is not clearly discussed in the paper. For example, what is missing in SPIN that the proposed method can do?
- Lack of human evaluation.
问题
See weaknesses.
I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.
We would like to appreciate the reviewers' efforts and the valuable suggestions. We will improve the paper and address all issues accordingly in the next submission.