Self-Play Fine-tuning of Diffusion Models for Text-to-image Generation
摘要
评审与讨论
The paper introduces a novel method called SPIN-Diffusion for fine-tuning text-to-image diffusion models. SPIN-Diffusion uses a self-play mechanism where the model competes against its earlier versions to iteratively improve its performance. This approach eliminates the need for human preference data, which is a significant requirement for traditional reinforcement learning-based fine-tuning methods. The experiments on the various dataset demonstrate that SPIN-Diffusion outperforms existing supervised fine-tuning methods and reinforcement learning-based approaches, achieving higher human preference alignment and visual appeal with less data.
优点
S1: The paper presents an innovative self-play fine-tuning method that does not rely on human preference data, addressing a significant limitation in current fine-tuning approaches.
S2: Extensive experiments are conducted, showing that SPIN-Diffusion outperforms both supervised fine-tuning and reinforcement learning-based methods in terms of human preference alignment and visual appeal.
S3: The theoretical analysis provides a strong foundation for the proposed method, demonstrating its convergence and superiority over traditional supervised fine-tuning methods.
S4: The paper effectively communicates the technical challenges and solutions, making the methodology accessible to readers.
缺点
W1: The paper lacks a comparison with traditional fine-tuning methods for diffusion models, e.g., LoRA.
W2: The computational overhead of the self-play mechanism is high, requiring 5-10 times more training time compared to baselines, which might limit its practical application.
W3: The motivation of this method is unclear to me. It would paint a more holistic picture of the problem and solution.
W4: The paper assumes that the data distribution can be adequately represented by the parameterized family, which may not hold in all practical scenarios.
W5: The evaluation is primarily focused on a single dataset (Pick-a-Pic), and additional benchmarks could strengthen the generalizability of the results.
W6: The font size of Figure 1 is too small.
问题
-
Why should fine-tuning the DMs use RL methods? Could you please discuss the advantages of fine-tuning DMs with RL compared to traditional methods?
-
As the authors claimed, "In many datasets including the community-sourced ones featuring custom content, it is often the case to have only one image associated with each prompt. This makes RL fine-tuning infeasible."I do not understand why this approach is considered infeasible. In my prespective, one image and its prompt can be considered as the observation of agents. The other processes are similar to the standard RL paradigm. Therefore, it would be greater to discuss more.
局限性
Please refer to weaknesses.
Q1: The paper lacks a comparison with traditional fine-tuning methods for diffusion models, e.g., LoRA
A1: While LoRA is a parameter-efficient fine-tuning method that focuses on reducing trainable parameters under resource constraints, it is orthogonal to SPIN-Diffusion, which utilizes a self-play mechanism for fine-tuning. According to your suggestion, we have provided SFT (LoRA) fine-tuning results as follows. We can see that full fine-tuning generally surpasses the performance of LoRA fine-tuning.
| Method | HPS | Aesthetic | ImageReward | PickScore | Average |
|---|---|---|---|---|---|
| SFT (full) | 0.2749 | 5.9451 | 1.1051 | 21.4542 | 7.1948 |
| SFT (LoRA) | 0.2745 | 5.8573 | 1.1393 | 21.4121 | 7.1708 |
| We will add this additional experiment result to the revision. |
Q2: The computational overhead of the self-play mechanism is high
A2: As this is a common concern raised by all reviewers, we have provided a general response regarding the sampling overhead. In summary, we have implemented some of the “advanced sampling acceleration techniques” and reduced the sampling overhead by 83%.
Q3: The motivation of this method is unclear to me
A3: Thank you for your comments. Here is the motivation for our work. The standard SFT method for diffusion models suffers from low alignment with human preferences and low data efficiency due to two main reasons: (1) it does not directly optimize for alignment with human preferences, and (2) only one round of training can be performed. To address this limitation, prior works (Fan et al., 2023; Black et al., 2023; Wallace et al., 2023) have proposed to use RL fine-tuning (RLHF) to directly align the diffusion model with human preferences. However, RL fine-tuning methods also have limitations: they either require an external reward function trained on additional data (Fan et al., 2023; Black et al., 2023) or rely on expensive human-annotated winner/loser paired images for each prompt (Wallace et al.). In order to overcome the limitations of existing fine-tuning methods, we propose SPIN-Diffusion, which overcomes the drawbacks of both SFT and RLHF through a self-play mechanism. Compared with SFT, our method is more data-efficient, by repeatedly using the prompts from the SFT dataset to improve the model through self-play. Compared with RLHF methods, our method does not need external reward models or expensive human-annotated winner/loser pairs. We will highlight the motivation of our work in the revision.
Q4: The paper assumes that the data distribution can be adequately represented by the parameterized family, which may not hold in all practical scenarios
A4: Thank you for your suggestion. It is indeed a common assumption in diffusion model algorithms that the parameterized family is expressive enough.For example, DDPO (Black et al., 2023), DPOK (Fan et al., 2023) and Diffusion-DPO are all built upon the assumption of an expressive reward model to capture data preferences accurately. To our knowledge, our work provides the first convergence guarantee along this line of research. While this assumption is solely introduced for theoretical analysis, we believe that the expressiveness of large neural networks generally satisfies this assumption in practice.
Q5: The evaluation is primarily focused on a single dataset
A5: We believe this is a misunderstanding. Our evaluation is in fact performed on three datasets. In Section 5.1 “We use the Pick-a-Pic test set, PartiPrompts (Yu et al., 2022) and HPSv2 (Wu et al.,2023) as our evaluation benchmarks.”. Due to space limit of the main text, additional results for PartiPrompts and HPSv2 are provided in Section B.3 “Evaluation on Other Benchmarks” of our appendix.
Q6: The font size of Figure 1 is too small
A6: Thank you for your feedback. We have increased the font size in Figure 1 for better readability in our revision, which is now available in the uploaded PDF.
Q7: Could you please discuss the advantages of fine-tuning DMs with RL compared to traditional methods?
A7: We do not view our method as an RL fine-tuning method. Here we will highlight the advantage of self-play fine-tuning for diffusion models over both SFT and RL fine-tuning. Standard SFT of diffusion models maximizes the log-likelihood of the training data. However, it cannot directly optimize the diffusion model’s performance in terms of quality indicators, as log-likelihood is not directly related to any human-perceived quality indicators. RL fine-tuning, on the other hand, overcomes this limitation by utilizing a reward function to maximize or by using human-annotated winner/loser image pairs as feedback, thereby optimizing the model to generate images with higher rewards (e.g., aesthetic score, image quality, etc.). In contrast, self-play fine-tuning uses an implicitly defined reward function integrated within the diffusion model's training process, allowing it to fully utilize the SFT dataset without requiring an additional reward model or human feedback.
Q8: I do not understand why this approach is considered infeasible
A8: We would like to clarify that “RL fine-tuning is infeasible” because of the lack of reward information or human feedback. As you mentioned, the prompt and the image can be considered as observations, but they do not directly provide a reward. As previously explained, RL fine-tuning methods rely on a reward function to optimize performance or on the human-annotated winner/loser image pairs as feedback. In contrast, SPIN/SPIN-Diffusion uses a self-play mechanism, which only requires a high-quality SFT dataset, removing the need to train a reward model that accurately reflects human preferences or human-label winner/loser pairs. Theoretically, we prove that SPIN-Diffusion performs distribution-level matching to the target data distribution.
Thanks for ur responses! I have raised my score.
Thank you for your support!
This paper introduces a method called self-play fine-tuning for diffusion models (SPIN-Diffusion), where the model engages in a competitive process with its earlier versions, driving iterative self-improvement. This method presents an alternative to conventional supervised fine-tuning and RL strategies. Experimental results on the Pick-a-Pic dataset show that SPIN-Diffusion outperforms previous supervised fine-tuning in human preference alignment and visual appeal. Key contributions of this work include the introduction of SPIN-Diffusion and its empirical validation demonstrating superior performance compared to existing fine-tuning methods.
优点
-
The paper for the first time applies SPIN to diffusion model to my knowledge.
-
Theoretical analysis shows that the proposed approximate SPIN loss is an upper bound of exact SPIN loss.
缺点
-
The main contribution of this work is an approximate SPIN loss compared to the previously proposed exact SPIN loss. The major modification is moving the average over sampling steps outside the loss function, resulting in an upper bound. I kindly argue this improvement is straightforward when transferring SPIN from LLM to diffusion model, without much insights into diffusion model itself.
-
Since the approximate SPIN loss is proposed for data/memory/time efficiency, I didn't see any computing efficiency comparisons of the approximate and exact SPIN loss in the main paper or supplements. As noted by the authors, "it requires additional sampling overhead, approximately 10 times of the training time when using traditional DDIM sampling." This additional computing cost significantly limits the method in practice.
-
Assumption 4.1 assumes the loss being monotonically decreasing and convex. It seems like the SPIN loss proposed in Equations. 3.8 and 3.9 do not meet this strong assumption. It makes the analysis results less convincing.
-
Quantitative performance improvement (Table 1) is mild.
-
It's hard to tell which method generates the best pictures in qualitative comparison (Figure 2).
问题
- The "(ours)" notations in the tables are a bit confusing, may misleading the audience considering them as a method proposed in this work.
After rebuttal: Some of my initial concerns have been addressed, while I still hold concerns on the method insight, time efficiency, and quantitative improvements. By comprehensively considering the cons and pros of this paper, my final raiting would be borderline accept.
局限性
N/A
Q1: The main contribution of this work is an approximate SPIN loss compared to the previously proposed exact SPIN loss. The major modification is moving the average over sampling steps outside the loss function, resulting in an upper bound. I kindly argue this improvement is straightforward when transferring SPIN from LLM to diffusion model, without much insights into diffusion model itself.
A1: We appreciate your feedback, but we would like to emphasize that the derivation of SPIN loss for diffusion models is highly non-trivial. One of the novelty of our approach lies in adapting the SPIN objective for diffusion models by considering the main player (reward function) across the full trajectory (as in Equation 3.2), rather than focusing solely on the final state as done in prior work (e.g., Fan et al., 2023; Black et al., 2023; Wallace et al., 2023). This modification allows us to formulate an exact objective function up to Equation (3.8), which departs from Wallace et al., 2023 that only considers the end state for rewards or preferences.
Moreover, there seems to be a misunderstanding about our approximate training objective. The exact computation of Equation (3.8) is impractical due to two primary constraints: the large trajectory length , which would require an impractical amount of GPU memory when the loss is summed over . Additionally, the required samples from a reverse process are not readily accessible. Consequently, our approximation strategy includes adopting an upper bound and using samples from the forward process as practical surrogates for the unavailable backward process samples.
This approximation is directly motivated by the nature of diffusion models, which inherently decouple operations on a per-time-step basis. We provide a theoretical justification for our approximation method in Section 4, ensuring that our approximations are both practical and theoretically sound.
Thank you again for your suggestion. In the revision, we will highlight the insights from our derivation related to diffusion models.
Q2: Since the approximate SPIN loss is proposed for data/memory/time efficiency, I didn't see any computing efficiency comparisons of the approximate and exact SPIN loss in the main paper or supplements. As noted by the authors, "it requires additional sampling overhead, approximately 10 times of the training time when using traditional DDIM sampling." This additional computing cost significantly limits the method in practice.
A2: In terms of training time, our approximate SPIN loss is approximately 2 times the training time of SFT loss. The additional sampling overhead, as in our discussion of limitations, primarily arises from the generation of synthetic data. This inference process, apart from training, is flexible, parallelizable, and distributable across various computing engines, from large GPU clusters to standard home laptops. To address the sampling overhead, we have explored advanced sampling algorithms and optimizations at both the software and hardware levels. These efforts have successfully reduced the sampling time by 83%. Further details can be found in the general responses to all reviewers.
Q3: Assumption 4.1 assumes the loss being monotonically decreasing and convex. It seems like the SPIN loss proposed in Equations. 3.8 and 3.9 do not meet this strong assumption. It makes the analysis results less convincing
A3: We believe this is a misunderstanding. Assumption 4.1 is made on , rather than the SPIN-Diffusion loss in Equations 3.8 and 3.9. More specifically, we choose to be the logistic loss (in SPIN-Diffusion experiments), which is monotonically decreasing and convex. Other losses such as correlation loss and hinge loss also satisfy this assumption. So our theoretical analysis indeed holds for SPIN-Diffusion. We will clarify it in the revision.
Q4: Quantitative performance improvement (Table 1) is mild.
A4: There might be a misinterpretation regarding the performance. The SPIN-Diffusion model in fact significantly outperforms baselines SD-1.5 and Diffusion-DPO, with aesthetic scores improving from 5.7691 to 6.2481 and PickScores reaching 22.0024, even exceeding winner images from the Pick-a-Pic test set, which have an aesthetic score of 5.985 and a PickScore of 21.87.
The improvement is more obvious in Figure 1 and Tables 3, 4, where the winning rate over SD-1.5 reaches 91.6%.
Q5: It's hard to tell which method generates the best pictures in qualitative comparison (Figure 2).
A5: Images generated by SPIN-Diffusion iterations are generally more aesthetically pleasing. To offer a more objective comparison, we assessed the aesthetic scores of the 3 images in Figure 2. The results indicate that SPIN-Diffusion consistently outperforms SD-1.5, SFT, and Diffusion-DPO in visual quality:
| SD-1.5 | SFT | Diffusion-DPO | SPIN-Diffusion Iter1 | Iter2 | Iter3 | |
|---|---|---|---|---|---|---|
| Boy | 6.171 | 6.096 | 6.072 | 6.158 | 6.407 | 6.831 |
| Castle | 6.180 | 6.346 | 5.995 | 6.886 | 6.993 | 6.940 |
| Eagle | 4.927 | 5.428 | 5.289 | 5.601 | 6.103 | 6.189 |
For additional qualitative comparisons, please refer to Figures 8, 9, 10, and 11 in the appendix.
Q6: The "(ours)" notations in the tables are a bit confusing, may misleading the audience considering them as a method proposed in this work.
A6: Thank you for your suggestion. We've replaced "(ours)" in tables and figures with “(reproduced)” to clearly distinguish between existing checkpoints and our reproductions.
Thank you for the response! The authors addressed my concerns in Q3, Q5, Q6.
I still hold my concerns for Q1, Q2, Q4. For Q2, it's fair that we only consider the method presented in the initial submission, whose efficiency is 146h vs. 20h (SFT). It's a trade-off stuff, whether it deserves to use much more time for the performance improvement.
For Q4, if we compare SFT vs SPIN-Diffusion, most of the metrics are close.
Thank you for your reply. We’re glad that we have resolved your concerns in Q3, Q5 and A6. We would like to further clarify and address your remaining concerns in Q4, Q2 and Q1.
For Q4: We would like to clarify that SPIN-Diffusion's performance actually significantly outperforms that of SFT in 3 out of 4 metrics, rather than being close to each other. Both models use the same base model, SD-1.5, and are trained on the same dataset. SPIN-Diffusion achieves improvements over SD-1.5 that are approximately three times that of SFT in the Aesthetic and PickScore metrics, while also maintaining a lead in HPS and ImageReward.
| HPS | Aesthetic | ImageReward | PickScore | |
|---|---|---|---|---|
| SD-1.5 | 0.2699 | 5.7691 | 0.8159 | 21.1983 |
| SFT | 0.2749 () | 5.9451 () | 1.1051 () | 21.4542 () |
| Diffusion-DPO | 0.2753 (/\textcolor{purple}{+8\\%}) | 5.8918 (/\textcolor{purple}{-30.3\\%}) | 1.0495 (/\textcolor{purple}{-19.2\\%}) | 21.8866 (/\textcolor{purple}{+168\\%}) |
| SPIN-Diffusion | 0.2759 (/\textcolor{purple}{+20\\%}) | 6.2481 (/\textcolor{purple}{+172\\%}) | 1.1239 (/\textcolor{purple}{+7\\%}) | 22.0024 (/\textcolor{purple}{+214\\%}) |
Note: The increments marked in red represent improvement relative to SD-1.5. The increments/decrements marked in purple indicate changes relative to the increments of SFT over SD-1.5.
Moreover, according to Kirstain et al. (2023), PickScore’s ratings, strongly correlates with real users’ Elo ratings (0.790 ± 0.054), while ImageReward (0.492 ± 0.086) and HPS (0.670 ± 0.071) correlates less. Therefore, the 214% increase in PickScore gains achieved by SPIN-Diffusion over those of SFT should be considered very significant.
In addition, the other closely related method Diffusion-DPO cannot outperform SFT consistently, even if it uses more data. This further suggests the advantage of SPIN-Diffusion.
For Q2: We understand your concern on the performance-efficiency trade-off. However, we want to clarify that the computational time reduction we presented during the rebuttal is solely due to implementation improvements in the stable diffusion (SD) pipeline, leveraging the latest tools from updated versions of PyTorch and the HuggingFace diffusers library. These improvements do not need to change our method at all. Below, we provide a summary of the improvements achievable through basic implementation enhancements:
| Sampling Time (Per 2048 Images) | Training Time (Per 2048 Images) | Total Time | Data | |
|---|---|---|---|---|
| SPIN-Diffusion (Original SD Pipeline) | 342s | 73s | 146h | SFT |
| SPIN-Diffusion (Improved SD Pipeline) | 56s | 73s | 38h | SFT |
| SFT | - | 37s | 20h | SFT |
As you can see, the total computational time is less than twice that of SFT.
In addition, we believe whether the extra time for performance improvement is worthwhile depends on the user and the specific application scenarios. For users with enough computing resources such as many AI companies, resources are usually not the biggest concern, because the extra time can be mitigated by using more GPUs. GPT-4, for example, required millions of GPU hours to achieve industry-leading performance.
For Q1: Regarding your comment that "I kindly argue this improvement is straightforward when transferring SPIN from LLM to diffusion model, without much insights into diffusion model itself," we agree that the SPIN-Diffusion loss can be interpreted in your way; however, there are also significant insights from the diffusion model itself. In terms of derivation, we start with the SPIN objective for diffusion models by considering the main player (reward function) across the full trajectory . We formulate an exact objective function up to Equation (3.8), and then develop a practical approximation strategy specifically tailored to diffusion models, which is not present in SPIN for LLMs. In terms of theoretical analysis, Theorem 4.2 in our paper suggests that the optimization process of the approximate loss ends when the score matching loss reaches optimality. In contrast, the corresponding analysis in SPIN for LLMs only suggests that the optimization process of the exact loss ends when . This distinction is quite significant because it offers a precise and measurable criterion for convergence within the framework of diffusion models.
Thank you for the further response! My concerns of Q1, Q2, and Q4 still remain. However, by considering the cons and pros of this paper again, I would lift my final rating to 5.
Thank you for increasing the rating. We are committed to incorporating all the clarifications, discussions, and additional experimental results from the rebuttal into the final version to fully address your concerns.
- This paper introduces SPIN-Diffusion, a new self-play fine-tuning technique for diffusion models that improves iteratively by competing with previous versions.
- They show that SPIN-Diffusion outperforms existing supervised and reinforcement learning fine-tuning methods in aligning with human preferences and enhancing visual appeal starting from the first iteration.
- This method is more data-efficient, achieving superior results with fewer data, which is beneficial in cases where there are limited images available per text prompt.
- The paper uses a competitive setup between two iterations of the model to generate and evaluate images, considering all generated images in the evaluation process, not just the final product.
- Their experiments on the "Pick-a-Pic" dataset demonstrate that SPIN-Diffusion consistently surpasses other methods in multiple performance metrics through successive iterations.
- The approach is cost-effective and offers a practical solution for improving diffusion models, particularly useful in environments with restricted data access.
优点
- The extension of SPIN to diffusion models is well-formulated for the problem at hand.
- Theoretical explanations are detailed and well-supported.
- The design of an approximate version of the objective function, considering computational efficiency, appears practical.
- The method shows practical utility by outperforming previous methods that required "loser" samples within a few iterations.
缺点
- The sampling overhead is significant, requiring 5-10 times more training time.
- There is insufficient explanation regarding the assignment of the hyperparameter and its variation across iterations.
- If a stronger approximation is applied and the sampling overhead is reduced by focusing on trajectories rather than all time steps, it might be possible to compare improvements with significantly reduced training time. Showing improvements within a setup that grants no more than twice the training time compared to SFT could have demonstrated the efficacy of SPIN-Diffusion more clearly.
问题
- I would like to see a practical comparison of computation time in GPU hours between SPIN-Diffusion and other methods.
局限性
- The sampling overhead is significant, requiring 5-10 times more training time.
Q1: The sampling overhead is significant, requiring 5-10 times more training time.
A1: Thank you for highlighting this concern. Since this is a common concern among all reviewers, during the rebuttal period, we have worked on practical solutions addressing the sampling overhead problem, and listed the results in the general response. In summary, the current sampling time can be reduced to 7 hours vs. originally 43 hours per iteration. Further improvements are possible with more software/hardware/algorithm level designs.
Q2: There is insufficient explanation regarding the assignment of the hyperparameter and its variation across iterations.
A2: Thank you for your insightful comment regarding the hyperparameter . Due to space constraints, we deferred the detailed discussion of hyperparameters to Section B.1 of the appendix. To summarize:
-
values: First iteration: 2000, Second and third iterations: 5000
-
Selection process: We conducted a grid search over {2000, 5000, 10000} to determine these values.
-
Rationale: Our experiments revealed that later iterations typically benefit from more conservative updates, hence the larger value (5000) for the second and third iterations.
Q3: If a stronger approximation is applied and the sampling overhead is reduced by focusing on trajectories rather than all time steps, it might be possible to compare improvements with significantly reduced training time. Showing improvements within a setup that grants no more than twice the training time compared to SFT could have demonstrated the efficacy of SPIN-Diffusion more clearly.
A3: Thank you for your suggestion. We would like to clarify that the current training time of our method is already no more than twice that of SFT. In Section 3.4, in addition to formulating the problem with respect to uniformly sampled , we further approximate the reverse timesteps in the expectation to the forward process timesteps. This allows the application of the same training mechanism of SFT, i.e., the inputs of the model are two images (real vs. generated), a batch of data consisting of noisy samples ’s from randomly sampled ’s. In addition, in (3.1) and (3.2) when we formulate the problem, we have already used a trajectory-wise characterization of the IPM and reward function. The trajectory-wise derivation and the approximation in Section 3.4 together ensure that the training time of our algorithm is about twice that of SFT.
In addition, as we mentioned before, we have managed to reduce the sampling time by 83% during our rebuttal, and now the sampling time is less than the training time.
Q4: I would like to see a practical comparison of computation time in GPU hours between SPIN-Diffusion and other methods.
A4: Following your suggestion, we have provided detailed statistics on the overhead and dataset requirements during each stage of training/sampling. The results were all obtained using 8 A100 GPUs (80 GB memory). From the table below, we can see that by applying fast sampling acceleration techniques, as detailed in our general response, we can control the total overhead of our algorithm within a reasonable amount.
| Method | Sampling Time (Per 2048 Images) | Training Time (Per 2048 Images) | Sampling Time (Iter 1) | Training Time (Iter 1) | Sampling Time (Iter 2) | Training Time (Iter 2) | Sampling Time (Iter 3) | Training Time (Iter 3) | Total |
|---|---|---|---|---|---|---|---|---|---|
| SPIN-Diffusion (Initial Submission) | 342s | 73s | 43h | 1h | 43h | 10h | 43h | 6h | 146h |
| SPIN-Diffusion (Fast Sampling) | 56s | 73s | 7h | 1h | 7h | 10h | 7h | 6h | 38h |
| SFT | - | 37s | - | 20h | - | - | - | - | 20h |
Note: The training times listed depend on the number of steps trained during each iteration, which are selected by the validation results.
I appreciate the authors' efforts. My concerns seem to have been addressed. I intend to maintain my current score.
Thank you for your feedback; we appreciate your continued support!
We sincerely thank the reviewers for the constructive feedback! To address some common concerns, we summarize the improvements on sampling overhead that we have done during rebuttal period as follows:
Sampling Overhead: By using batching, torch precompiling, and DPMSolver, we reduced the sampling time from 43 hours to 7 hours, achieving an 83% reduction in sampling time.
The improvements are summarized in the table below:
| Sampling Time Per 2048 Images | Training Time Per 2048 Images | Sampling Time Per Iteration | |
|---|---|---|---|
| Initial Submission | 342s | 73s | 43h |
| Revision (Batch + Precompile) | 136s | 73s | 17h |
| Revision (Batch + Precompile + DPMSolver) | 56s | 73s | 7h |
These results were obtained using a machine with 8 × A100 GPUs (80G memory per GPU), with samples distributed to the GPUs via data parallelism. The specific optimizations are explained as follows:
- Batching: Allows the diffusion score network’s input to be a batch of samples instead of a single sample. In our revision, we use a batch size of 64.
- Torch Compile [1]: A feature offered by PyTorch 2.0 that saves time during repeated inference by precompiling the code for efficient execution.
- DPMSolver [2,3]: A high-order diffusion ODE solver introduced in recent research. From our experiments, DPMSolver with 20 steps of reverse sampling exceeds the performance of PNDMSolver (default Stable Diffusion Scheduler) with 50 steps.
There are other techniques that could potentially further reduce sampling overhead. On the software level, approaches such as memory-efficient attention backend, Nvidia TensorRT, and DeepSpeed Inference can be explored. On the algorithm level, methods like UniPC, EDM, and DEIS offer promising improvements. These techniques are orthogonal to our efforts in improving the performance of fine-tuning diffusion models, and therefore we decide to explore them as a future work.
[1] Von Platen et al. Diffusers: State-of-the-art diffusion models.
[2] Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., & Zhu, J. (2022). Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.
[3] Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., & Zhu, J. (2022). Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models.
The paper proposes an approach for fine-tuning image diffusion models, based on the SPIN (self-play) approach previously proposed for language models. Modifying the method for diffusion requires non-trivial technical innovations. The method is evaluated on several datasets and is shown to outperform alternative fine-tuning methods like SFT and DPO.
After considering the reviews, rebuttals, discussion, and the paper itself, here is a brief list of key pros and cons.
Pros:
- Interesting technical contribution
- Good empirical results on several datasets according to "in-silico" metrics
Cons:
- The method is substantially slower than other fine-tuning approaches -> this was largely addressed in the rebuttal/discussion
- Image quality improvements do not appear to be breathtaking - neither qualitative nor quantitative
- No human evaluation
- Only tested with one - and fairly old - base model
Overall, even though the evaluation could be more thorough and human evaluation would help a lot, the paper has its merit, presents an interesting method, a clear story, and a fairly thorough evaluation, so I recommend acceptance