Thank you for your reply. We are glad to provide further clarification to address your concerns.

In reference to Tab. 2, could you provide visualized comparison results of YOSO and YOSO-LoRA at different steps (1 & 4) to confirm the performance improvement mentioned in your reply?

Thanks for your advice. We included a visualization comparison in the Figure 10 located in the appendix. It can be seen that 4-step samples have better visual quality compared to 1-step samples.

Regarding the zero-shot one-step 1024 resolution, in your response, does "YOSO's rapid convergence (requiring only around 20k iterations)" refer to the learning of the merged parameters for the 1024 model? Therefore, does "zero-shot" here not imply no specific training is needed?

The "YOSO's rapid convergence (requiring only around 20k iterations)" refers to the training of YOSO on 512 resolution by initializing from PixArt--512. The merged model does not need extra training, obtaining by . In particular, "zero-shot" indicates we can train YOSO on 512 resolution by initializing from PixArt--512, then merge with PixArt--1024 for one-step generation on 1024 resolution without extra training.

The clarity of the narrative in this paper needs improvement; currently, the narrative logic resembles an experimental report. When encountering issues in a new task (unconditional and text-to-image generation / finetuning) or a new model (U-Net or DiT based), a specific strategy is proposed to address the problem. This approach may not be conducive to the coherence expected in a complete paper.

We clarify that our main focus is text-to-image generation. Unlike existing works [a,b,c] that focus on adapting previous methods [d,e,f] to text-to-image generation, our YOSO is a new model. To validate the effectiveness of our YOSO framework, we fist study a simpler task (i.e., unconditional generation). In addition, we adapt YOSO to text-to-image generation with additional techniques. Our goal is to clearly showcase the fundamental design of our YOSO framework, free from the complexities introduced by additional techniques necessary for text-to-image generation.

[a] One-step Diffusion with Distribution Matching Distillation, CVPR 2024.

[b] SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation, CVPR 2024.

[c] InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation, ICLR 2024.

[d] ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation, NeurIPS 2023.

[e] Diff-Instruct: A Universal Approach for Transferring Knowledge From Pre-trained Diffusion Models, NeurIPS 2023.

[f] Learning to Generate and Transfer Data with Rectified Flow, ICLR 2023.