Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models
We improve quantitative and qualitative results of image-generating diffusion models by applying classifier-free guidance in a limited interval.
摘要
评审与讨论
This paper discovers that the effect of classifier-free guidance in image-generating diffusion models varies with noise levels. It proposes a very simple method of performing guidance only at intermediate noise levels, demonstrating that this method can improve sample quality, diversity, and sampling cost both quantitatively and qualitatively. The quantitative effects are shown using EDM2 and DiT models trained on ImageNet, while the qualitative effects are demonstrated using the pre-trained Stable Diffusion XL.
优点
Originality
- The toy examples in Figures 1 and 2 are very intuitive and help in understanding the behavior of guidance at high, middle, and low noise levels. This interpretation of classifier-free guidance is novel.
Significance
-
The method is very simple and can be easily applied to other diffusion models.
-
The FID and FD_{DINOv2} in Table 1 are strong.
Quality
-
The ablation studies in Figures 3 and 4 effectively demonstrate the efficacy of the proposed method.
-
The qualitative results in Figures 5 through 8 clearly show the behavior of guidance at high, middle, and low noise levels in Stable Diffusion XL.
缺点
While the paper shows good intuition and strong results, several questions remain.
-
Does the optimal interval vary with the model's performance? The results in Table 1 imply this, as the optimal intervals for EDM2 and DiT differ, and the optimal guidance weight for DiT is higher. If the optimal interval varies with model performance, the proposed method will require an exhaustive hyperparameter search for each diffusion model, which is inconvenient.
-
Will guidance at high noise levels be detrimental for conditions denser than the class and text conditions considered in this paper? Examples include InstructPix2Pix [A] using image conditions, ControlNet [B] using various spatial conditions, and Stable Video Diffusion [C] generating videos from images. Additionally, a text-to-image model based on the T5 encoder (DeepFloyd IF [D]), which provides richer text representation than the CLIP text encoder of Stable Diffusion XL, is also an example.
[A] Brooks et al., InstructPix2Pix Learning to Follow Image Editing Instructions, CVPR 2023.
[B] Zhang et al., Adding Conditional Control to Text-to-Image Diffusion Models, ICCV 2023.
[C] Blattmann et al., Scaling Latent Video Diffusion Models to Large Datasets.
[D] DeepFloyd, https://huggingface.co/DeepFloyd
问题
Looking at Figures 1 and 2, the guidance scale at each interval significantly impacts the ODE trajectory. Do the authors think this tendency would also be observed in flow matching generative models [E] (or rectified flows [F]) that learn straight trajectories, compared to previous VP and VE diffusion models?
[E] Lipman et al., Flow Matching for Generative Modeling, ICLR 2023.
[F] Liu et al., Learning to Generate and Transfer Data with Rectified Flow, ICLR 2023.
局限性
The authors mention in the paper checklist that they discussed limitations in section 4, but the limitations are not clearly presented in that section. It is recommended to address the limitations in the conclusion section as well.
Thank you for the review. We will next address the explicit questions:
“Does the optimal interval vary with the model's performance? The results in Table 1 imply this, as the optimal intervals for EDM2 and DiT differ, and the optimal guidance weight for DiT is higher. If the optimal interval varies with model performance, the proposed method will require an exhaustive hyperparameter search for each diffusion model, which is inconvenient.”
Yes, based on our experiments with EDM2 and DiT the optimal guidance interval appears to be model specific, and it probably depends on the model’s performance as well. Finding the optimal sampling hyperparameters requires only model evaluations, not re-training, but it can indeed become somewhat expensive if done in a naive fashion. However, there are several ways to reduce the cost.
First, the upper and lower guidance limits can be determined separately, without the need for a two-dimensional search. This happens by first establishing the optimal upper limit by keeping the lower limit as zero, which is very close to optimal except for the computational cost (see Figure 4, right). Once the optimal upper limit is known, the lower limit can be determined afterwards. Second, the lower limit seems to affect the output in a monotonic fashion, making binary search applicable there. Finally, one can reduce the sample size of FID evaluation from 50k to, say, 5k, at least for an initial run, which accelerates the process by 10x.
“Will guidance at high noise levels be detrimental for conditions denser than the class and text conditions considered in this paper? Examples include InstructPix2Pix [A] using image conditions, ControlNet [B] using various spatial conditions, and Stable Video Diffusion [C] generating videos from images. Additionally, a text-to-image model based on the T5 encoder (DeepFloyd IF [D]), which provides richer text representation than the CLIP text encoder of Stable Diffusion XL, is also an example.”
Thus far, we have not experimented with conditioning signals other than class and text conditioning using CLIP. It is likely that major changes to the conditioning scheme will impact the optimal guidance interval.
“Looking at Figures 1 and 2, the guidance scale at each interval significantly impacts the ODE trajectory. Do the authors think this tendency would also be observed in flow matching generative models [E] (or rectified flows [F]) that learn straight trajectories, compared to previous VP and VE diffusion models?”
As discussed in [1], the trajectories generated by flow matching (FM) for a white noise latent follow the score function of progressively less noisy distributions (Eq. 7 in [1]), similar to diffusion. Thus, guidance has a similar statistical interpretation and behavior in both diffusion and flow matching.
The trajectories in FM are straight in the sense that they don't take entirely unnecessary detours, but they are still necessarily curved for any non-trivial data. This can be seen, e.g., in the first rectified flows of Figs. 3 and 4 of [2]. The EDM sampler used in our experiments is similarly designed to avoid unnecessary trajectory curvature to allow efficient sampling. We therefore expect that our findings translate to flow matching.
That said, FM allows for curious variants where the latent distribution is, e.g., another set of images instead of the typical Gaussian noise. To our knowledge, all state of the art applications of FM use noise latents, but the behavior of CFG in these more exotic settings seems to be a largely unexplored topic, and it is unclear if it (or guidance interval) would be applicable at all. Similarly, iterated reflows [2] would potentially behave differently.
[1] Zheng et al.: Guided Flows for Generative Modeling and Decision Making. Arxiv preprint, 2023.
[2] Liu et al.: Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow. In Proc. ICLR, 2023.
Thank you for the detailed response. It would be helpful if the paper include a discussion on reducing the cost of hyperparameter search.
The paper provides valuable intuitions about the guidance of diffusion models and proposes a simple yet effective solution, which I believe makes it suitable for publication in NeurIPS.
However, since the proposed method still requires hyperparameter search depending on the model and conditioning scheme, it doesn't seem to be an optimal solution. Therefore, I will keep my rating.
This paper investigates the behavior of classifier-free guidance and proposes an adjustment in its application during sampling. Instead of applying a constant weight for the guidance scale across all sampling steps, the authors suggest that the guidance should be deactivated at high and low noise scales and applied only at intermediate levels. Experimental results with EDM2 and SDXL models indicate that guidance at high noise scales is detrimental, while at lower noise levels, its contribution is negligible. Thus, the guidance interval should be incorporated as part of the sampling hyperparameters for conditional diffusion models.
优点
- As classifier-free guidance is an essential part of all diffusion models, the paper takes an important step to reduce the harmful effects of high guidance scales.
- The proposed method is straightforward to implement and can be integrated into any model, making it valuable to the broader NeurIPS community.
- The authors provide thorough experiments analyzing the impact of applying CFG over an interval.
- The effectiveness of the method is also shown on toy datasets.
- The paper is well-written and easy to understand.
缺点
The main weakness of the work is a lack of discussion of existing works that address the issues of high-guidance scales. For example, the concept of using a non-constant guidance weight for CFG is also introduced in [1] (referred to as dynamic-CFG), and the current method of the paper is dynamic-CFG with a specific weight schedule (namely for and 1 otherwise). In addition, similar ideas have been explored in [1, 2, 3] in the form using a linear/cosine schedule for the guidance weight or annealing the conditioning vector to improve the diversity of generations at higher guidance scales. Moreover, half of the conclusion of the paper is already stated in [4] (i.e. guidance does nothing at lower noise levels). Accordingly, the paper would benefit from a more comprehensive related work section.
[1] Sadat, S., Buhmann, J., Bradley, D., Hilliges, O. and Weber, R.M., CADS: Unleashing the Diversity of Diffusion Models through Condition-Annealed Sampling. In The Twelfth International Conference on Learning Representations (2024).
[2] Chang H, Zhang H, Barber J, Maschinot AJ, Lezama J, Jiang L, Yang MH, Murphy K, Freeman WT, Rubinstein M, Li Y. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704. 2023 Jan 2.
[3] Gao, Shanghua, et al. "Masked diffusion transformer is a strong image synthesizer." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.
[4] Castillo A, Kohler J, Pérez JC, Pérez JP, Pumarola A, Ghanem B, Arbeláez P, Thabet A. Adaptive guidance: Training-free acceleration of conditional diffusion models. arXiv preprint arXiv:2312.12487. 2023 Dec 19.
问题
-
Is Figure 3 compiled with a fixed EMA profile for the EDM2 model or is the EMA profile also optimized along with the guidance parameters?
-
Can you explain lines 112-118 in more detail? Does that mean you always use the low and high guidance scales based on the sampling discretization? This part was not very clear to me.
-
Can you also plot Precision/Recall curves for the method to show the impact of interval guidance on quality and diversity in isolation?
-
One major advantage of high classifier-free guidance is image-text alignment. Does the proposed method have a noticeable impact in this area? For example, in Figure 7, it seems that some parts of the prompt are more strongly emphasized by the normal CFG (e.g., the sky is orange).
-
Does the optimal guidance interval change when modifying the sampling algorithm, or are the thresholds mainly model dependent?
局限性
The authors have addressed limitations and social impact of the work.
Thank you for the review. We will next address concerns and the explicit questions:
“The main weakness of the work is a lack of discussion of existing works that address the issues of high-guidance scales.”
Thank you for providing pointers to additional relevant previous work. We are glad to cite them. All of these papers propose guidance schedules that are continuous at high noise levels to enhance the results of their (unrelated) main methods, without evaluating the effects of these in isolation (apart from [1]). While our guidance interval can be seen as a step function schedule, this schedule was not advocated in any of these works. The closest to our method is Dynamic CFG [1] that modulates guidance weight linearly in an interval that mostly covers high noise levels. In contrast, we found guidance to be strictly detrimental in these regions, and disabling it there leads to our state-of-the-art FIDs.
In our initial experiments, we tried a linear schedule before finding it to be clearly worse than the simpler guidance interval. With EDM2-S, we obtained FID 2.02 at best with a linear schedule, whereas guidance interval achieves FID 1.68. Therefore, we focused the paper specifically on studying the guidance interval and its effects on the resulting image distribution. We also provide novel data confirming that guidance is harmful at high noise levels through our toy example and experiments with EDM2 and SD-XL.
[1] Sadat et al.: CADS: Unleashing the Diversity of Diffusion Models through Condition-Annealed Sampling. In Proc. ICLR 2024.
“1. Is Figure 3 compiled with a fixed EMA profile for the EDM2 model or is the EMA profile also optimized along with the guidance parameters?”
We use a fixed EMA profile in Figure 3. For EDM2-S, we use EMA lengths 2.5% and 8.5% respectively for FID and FD_DINOv2, and for EDM2-XXL, we use EMA lengths 1.5% for both FID and FD_DINOv2. To determine these, we ran a sweep over the EMA lengths and found that the EMA length that works best with our guidance interval also works best with CFG.
“2. Can you explain lines 112-118 in more detail? Does that mean you always use the low and high guidance scales based on the sampling discretization? This part was not very clear to me.”
This is correct. This part means that we either enable or disable CFG at the noise levels given by the noise level discretization of the sampler.
“3. Can you also plot Precision/Recall curves for the method to show the impact of interval guidance on quality and diversity in isolation?”
We are happy to add a plot with precision and recall curves. They clearly show that our approach allows increasing precision with a smaller penalty in recall in comparison to CFG.
“4. One major advantage of high classifier-free guidance is image-text alignment. Does the proposed method have a noticeable impact in this area? For example, in Figure 7, it seems that some parts of the prompt are more strongly emphasized by the normal CFG (e.g., the sky is orange).”
We have not performed a quantitative experiment using text-image alignment metrics, but generally, good text-image alignment seems to require roughly similar (i.e., quite high) guidance weights for both traditional CFG and our method. Prompts that benefit from added saturation and bright colors may work better with traditional CFG, as the artifacts that our method aims to remove may end up strengthening the alignment.
“5. Does the optimal guidance interval change when modifying the sampling algorithm, or are the thresholds mainly model dependent?”
The guidance interval is not overly sensitive to the sampler parameters, at least as long as the sampling algorithm remains the same (see 3rd paragraph of 4.2). We did not experiment with different sampling algorithms (DDIM, Euler, etc.) because the current state-of-the-art results are obtained with a 2nd order Heun sampler with 32 step deterministic steps. We expect that the optimal guidance interval depends mainly on the model.
I would like to thank the authors for taking the time to respond to my comments. Since my questions have been addressed in the rebuttal and I believe the paper represents a valuable step toward better usage of CFG for the community, I would like to increase my score from 6 to 7.
This paper explains applying guidance in a limited interval improves sample and distribution quality in diffusion models, as shown in this title.
优点
The author provided intuitive figures, Fig.1 and 2, helped to understand the core of this paper easily. The author provided a lot of experiments and results to support his idea.
缺点
This paper provided an optimization approach in improving diffusion with guidance. The author proposed, doing guidance in the middle stage is good. However, this work does not provide a novel idea or fundamental issue in diffusion with guidance. There are more fundamental problems needed to be solved. Despite the author showed good experiments and good writing, unfortunately, the contribution of this paper is minor.
问题
No questions
局限性
N/A
Respectfully, this terse review appears to present a personal preference, but makes no factually supported arguments to rebut. CFG is a crucial but poorly understood component of diffusion image generators, and despite its apparent simplicity, our technique extracts its key benefits while significantly suppressing its downsides.
This paper received split recommendations after the post-rebuttal discussion. The rebuttal addressed most reviewer concerns about missing references, experimental details, and technical clarity. After an engaged author-reviewer discussion, most remaining doubts were clarified, and only Reviewer 2mKR recommended rejection on the grounds of insufficient technical contribution. Although the paper does not offer a theoretical breakthrough in diffusion with guidance, it presents a simple, generic, and effective solution that can practically impact the widely studied topic of diffusion model inference. Reviewer comments and corrections should be carefully included in the final version.