Iterative DPO with An Improvement Model for Fine-tuning Diffusion Models
摘要
评审与讨论
The paper introduces the drawback of standard DPO methods that they are limited by their training data as the motivation of the paper.
To be specific: It first use pre-collected paired data to train a model similar to instruct-p2p (improvement model), which can be used to enhance the quality of input images.
The proposed method iteratively trains the diffusion model and generate new "bad" samples with the diffusion model. The improvement model is used to generate "good" data given the new model.
优点
-
Reasonable storytelling and clear figures.
-
The method is easy to understand and implement.
缺点
Validation
-
The performance improvement is very minor. From Table 4, we can see that the win-rate improvement is at most 0.9% compared to SPIN. I also think the presented paired visual comparisons in the paper (those figures) are not impressive.
-
The validation is not sufficient. It only validates performance on stable diffusion 1.5. Many works including Diffusion-DPO report their performance on both stable diffusion 1.5 and stable diffusion XL. The generation quality of stable diffusion 1.5 is not good and therefore improving generation quality from it is much easier than stable diffusion XL. It makes me suspicious about its effectiveness on better and larger models.
-
Validation of the effectiveness of the improvement model. I didn't find any validation on how much the improvement model can improve the given images. (I am sorry if I missed them.) The only available evaluation on the improvement seems to be the Table 3, which can not reflect the ability of the improvement model. The improvement model even performs worse than SPIN on test set. How can we expect it to improve the quality of generated images.
-
Many experimental details are not clearly stated. How is the win-rate calculated in Table 4?
-
Lack of user study including comparison and details.
Reasonability
-
Since all good images are generated with the improvement model, I think the generation quality might be upper-bounded by the improvement model. I am suspicious about the reasonability of the proposed method.
-
Generating new samples during training takes a lot of computation and time. I noticed in the paper that the authors train the improvement model 200K steps with batch size 2048, which is a very costly training compared to standard diffusion-DPO (They train their model with only 2K iterations with a similar batch size 2048). The very high training cost might also be the reason that they do not train their model on SDXL or even larger models.
Presentation
- Most presented images are not visually appealing and their prompts are generally very short. It can not reflect the text-image alignment.
Due to the above reason, I think the paper still has a relatively large gap for publication and might not meet the bar of ICLR.
问题
I suggest the authors validating their method on larger models like Stable Diffusion XL and Flux. Otherwise I am deeply suspicious of its effectiveness on larger models. They should show the potential of scaling up of their method for practical application.
In this paper, authors propose a diffusion model alignment method with a limited number of preference datasets. Specifically, an improvement model is trained for generating high-quality samples. Then DPO is achieved on an extended synthetic dataset.
优点
The method is easy to understand. The visualization is good. The method is reasonable.
缺点
- The authors should provide precise definitions and annotations for each concept, such as and .
- The proposed method introduces a semi-supervised learning framework to utilize a preference dataset for training a "high-quality" model, followed by fine-tuning the diffusion model using generated high- and low-quality data pairs. However, this approach appears to complicate the alignment process for diffusion models. A more straightforward solution would be to train a reward model directly, enabling reinforcement learning without the need to construct a high-quality model and apply DPO in such a rudimentary way.
- Additionally, the performance gains achieved by the proposed method appear limited. As shown in Table 1, the results are only comparable to those of SPIN.
- The authors should refine the contributions section, as the current version lacks clarity and specificity.
- The authors should consider including a comparison with alignment methods based explicitly on reward model [1]. [1] Zhang, Y., Tzeng, E., Du, Y., & Kislyuk, D. (2024). Large-scale reinforcement learning for diffusion models. arXiv preprint arXiv:2401.12244, 4.
问题
Please refer to the weakness section. Moreover, authors should illustrate the detailed reasons to choose such a complicated alignment process instead of directly training a reward model and applying RL.
This paper proposes an online DPO method to iteratively improve the preference of the diffusion model, and expand the limitation of fixed size preference dataset in optimization. The preference pairs are generated from the original diffusion model and the preference improvement model that trained on the offline preference dataset. Experiments show the effectiveness of the proposed method.
优点
- This paper proposes an online DPO method with preference improvement model to expand the size limitation of the preference dataset in optimization.
- Muti-task learning objectives are proposed for the preference improvement model for ensured quality and diversity.
- Experiments show the effectiveness of the proposed method.
缺点
- The main idea of this work is to provide online preference positive samples through constructing the preference improvement model that maps the outputs from the original diffusion model to high-quality ones. It is still under the static Diffusion-DPO framework without dynamic adaption, which is somewhat less convincing.
- It is wondered whether the preference improvement model needs to be evolved in iteration, as the fixed improvement model can be hard to adapt to the shifted distribution of the evolved diffusion model, and the positive samples provided by consequent preference improvement are doubted to bias from the original user preference. Additionally, the fixed improvement model is still setting the upper bound of the optimized diffusion model, with only enlarged optimization samples.
- It is recommended to provide the per-iteration result to monitor the diffusion model behavior.
- The role of guidance weights in yielding positive samples is not well illustrated with corresponding qualitative or quantitative results. Basically, we have no idea about the behavior of the improvement model in terms of its multi-objective learning. The validation experiments should be provided.
问题
Is the diffusion model sensitive to the CFG weights of positive samples in preference optimization?
This paper introduces an approach for online iterative optimization of the diffusion models without introducing extra annotation of the online data. It uses the generated images from the improved models as winning samples to continue improving the base model in multiple iterations.
优点
The writing is relatively clear, and the investigation is fairly thorough.
缺点
Idea is unreasonable: Using the generated images from the improved models as winning samples is not a reasonable approach. Due to the instability of the generative model, even with the same seed, it is difficult to ensure that the images generated by the improved models are consistently better in quality than those from the base model. This introduces a significant amount of noisy data, which can severely affect the quality of model training. Furthermore, the method involves multiple iterations of training, which exacerbates this issue.
Lack of novelty: The paper merely proposes an iterative DPO method without providing a well-thought-out solution to the core problem, making its novelty and contributions quite limited.
Unreasonable experiments: Firstly, a considerable number of the visual examples presented in the paper do not demonstrate the advantages of the proposed algorithm. The generated images do not align well with the prompts, and the improvements over the baseline model seem quite limited. Does this suggest that the method’s effectiveness is rather underwhelming? To verify the validity of this approach, I think it is necessary to conduct experimental analysis on the number of iterations.
问题
You need to provide a detailed explanation addressing the issues raised in Weakness 1, specifically whether this situation was encountered during implementation.
Also, conduct experiments on the number of iterations to verify the reasonableness of the iterative training process.
伦理问题详情
This paper has no ethical concerns.
I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.