Guiding a Diffusion Model with a Bad Version of Itself
Guiding a diffusion model with a smaller, less-trained version of itself leads to significantly improved sample and distribution quality.
摘要
评审与讨论
The paper introduces a novel conditioning method as an alternative to Classifier-Free Guidance (CFG), which allows for better control over image quality generation without compromising data diversity. This approach involves guiding the diffusion model generation with a lower-quality model (either less trained or with reduced capacity) instead of an unconditional model. The authors compare CFG with their guidance approach on ImageNet.
优点
- The paper is well-written and clear.
- There is a thorough analysis of CFG behavior and its limitations. The toy example comparing CFG with the authors' method is particularly insightful.
缺点
- The results on images do not clearly demonstrate the distribution coverage shown in the toy example. It appears that the low-quality model provides low-frequency guidance during generation, while the high-quality model focuses more on details. This approach results in sacrificing diversity for quality, contrary to what is depicted in Fig. 1. Additionally, Table 1 should include the Inception score in addition to the FID. Peculiar also the choice of omegas in Table 1; I wonder how the authors choose these values. A fair comparison between CFG and their method would be sufficient with omega={1,1.5,2,2.5,...}
- In CFG, only one model needs to be trained to achieve conditioning. In this new approach, achieving greater diversity requires training two distinct models. In line 174, the authors mention: "such as low capacity and/or under-training." I would expect that the low-capacity model could function in this auto-guiding setup, but an under-trained approach might face significant generalization issues. The key question is: "How can we determine the training data size needed for the lower quality version to enhance a well-trained model?" Considering model degradation, model quantization could be a viable solution, potentially eliminating the need for an extra model. Demonstrating how an f16 model could enhance the diversity of an f32 model using this auto-guiding method would be an interesting experiment.
问题
- In the CFG paper, a complete table of FID and IS values at various omega settings is provided (omega={1,1.5,2,2.5,3,3.5,4...}). I would like to see a similar comparison between the auto-guidance approach and CFG.
- I am interested in seeing the use of a quantized model as the low-quality model for auto-guidance.
- Additional examples of generated images or access to the code would be beneficial for comparing the auto-guidance method with CFG.
局限性
The method requires training two diffusion models as opposed to just one with CFG. This difference is critical when scaling up the training of foundation models.
Thank you for the review. Regarding the concerns and questions:
The results on images do not clearly demonstrate the distribution coverage shown in the toy example. It appears that the low-quality model provides low-frequency guidance during generation, while the high-quality model focuses more on details. This approach results in sacrificing diversity for quality, contrary to what is depicted in Fig. 1. Additionally, Table 1 should include the Inception score in addition to the FID. Peculiar also the choice of omegas in Table 1; I wonder how the authors choose these values. A fair comparison between CFG and their method would be sufficient with omega={1,1.5,2,2.5,...}
As far as we know, there is no simple relationship between the frequency bands of the sampled image and the roles of the main and guiding models. Given that FID is very sensitive to diversity [1], our lower FIDs are a strong indication that diversity is not lost. We did not measure IS separately, as it is known to be largely consistent with FID (see Figs. 5a, 11b of the EDM2 paper [2]), at least with the EDM2 models that we use in the quantitative measurements.
Guidance weights in Table 1 are the ones that gave the best results according to the hyperparameter search outlined in Appendix B.1. We are happy to add a table or a plot with a range of guidance weights in the appendix.
[1] Kynkäänniemi et al.: Improved precision and recall metric for assessing generative models. In Proc. NeurIPS, 2019.
[2] Karras et al.: Analyzing and improving the training dynamics of diffusion models. In Proc. CVPR, 2024.
In CFG, only one model needs to be trained to achieve conditioning. In this new approach, achieving greater diversity requires training two distinct models. In line 174, the authors mention: "such as low capacity and/or under-training." I would expect that the low-capacity model could function in this auto-guiding setup, but an under-trained approach might face significant generalization issues. ...
With autoguidance, the majority of the benefits can be obtained by using an earlier training snapshot of the main model as the guiding model (Table 1, row “reduce training only”, also Section 5.1), in which case no additional training is required. Under-training is therefore a practical approach for creating effective guiding models.
In contrast, reducing the amount of training data for the guiding model did not seem to yield a benefit (see end of Section 5.1). We did not consider reducing the amount of training data for the guiding model as a goal in itself, as the full dataset is used for training the main high-quality model in any case. That said, it may be possible to reduce the data at least somewhat when training the low-quality model without ill effects.
Q1. In the CFG paper, a complete table of FID and IS values at various omega settings is provided (omega={1,1.5,2,2.5,3,3.5,4...}). I would like to see a similar comparison between the auto-guidance approach and CFG.
We are happy to add a table or a plot comparing the FID of autoguidance and CFG across a range of guidance weights in the final version.
Q2. I am interested in seeing the use of a quantized model as the low-quality model for auto-guidance.
According to our initial tests, increased quantization does not yield a model that could be used as the low-quality guiding model (see end of Section 5.1).
Q3. Additional examples of generated images or access to the code would be beneficial for comparing the auto-guidance method with CFG.
The code will be released after the review cycle.
The method requires training two diffusion models as opposed to just one with CFG. This difference is critical when scaling up the training of foundation models.
Having to train a separate guiding model in order to obtain full benefits of our method is indeed a limitation, but when using a smaller model and shorter training time for the guiding model, the additional training cost is modest. For example, the EDM2-M model trains approximately 2.7x as fast as EDM2-XXL per iteration, and we train it for 1/3.5 of iterations, so the additional cost is around +11% of training the main model. For the EDM2-S/XS pair used in most of our experiments, the added training cost is only +3.6%. We shall clarify this in the paper.
Also, as discussed above, using an earlier training snapshot of the main model as the guiding model yields most of the benefits of autoguidance without requiring any additional training.
I would like to thank the authors for their effort in the rebuttal. With the expectation that the authors will include the evaluation with omega = {1, 1.5, 2, 2.5, 3, 3.5, 4...} in the final manuscript and that the code will be made publicly available, I am raising my score to 7.
This paper presents a novel perspective on classifier-free guidance (CFG). It improves the generation quality by directing the generative model towards high-probability regions. The authors identify that this improvement stems from the quality difference between the conditional and unconditional components in CFG. Building on this insight, the paper introduces autoguidance, a new sampling algorithm that utilizes both the diffusion model and a bad version of it. Experimental results show the superiority of this method.
优点
-
The paper employs an intuitive toy model to support its empirical findings. The specific long-tail, tree-shaped mixture of Gaussian distributions used in this model could be beneficial for future research on generative models.
-
The empirical observations and proposed methods are coherent. The paper uncovers a new mechanism within CFG and enhances it through the proposed method.
-
The proposed method is both simple and powerful, significantly improving the SOTA generation quality on the ImageNet dataset. This new approach has the potential to inspire further related research.
缺点
- For the experiment, the paper lacks some quantitative comparison for their proposed method of text-to-image diffusion model.
问题
-
I am still not quite clear about the necessity of the similar degradation of and empirically. (I am convinced by your synthetic experiment). Even if and suffer from different kinds of degradation, empirically the ratio of them is still possible to pull the sampling trajectory towards the high-likelihood region. If possible, could you give some toy mathematical examples about this?
-
For Figure 1(e), the author applies autoguidance to the toy model. Could you also visualize the , and in the autoguidance setting? This will help us better visualize what the similar degradation looks like.
局限性
The paper covers limitations and societal impact.
Thank you for the review. Regarding the questions:
Q1. I am still not quite clear about the necessity of the similar degradation of and empirically. (I am convinced by your synthetic experiment). Even if and suffer from different kinds of degradation, empirically the ratio of them is still possible to pull the sampling trajectory towards the high-likelihood region. If possible, could you give some toy mathematical examples about this?
Let us construct a toy scenario where the ratio of two differently degraded densities would yield a misleading guidance signal. Assume that the true distribution is a unit 2D Gaussian with diagonal covariances , and the two degraded versions have diagonal covariances and for some relatively small . Now, guidance between these densities would push the samples inward along one axis and outward (towards lower likelihood) along the other, despite them both being centered around the correct distribution. Similarly, offsetting the means to unrelated directions would induce an overall force towards some global direction, rather than consistently towards the origin.
In a more abstract sense, the beneficial degradations appear to push and spread the densities along locally consistent directions as a function of the degradation strength, but this is ultimately an empirical finding. On the other hand, entirely different types of degradations have a lot of room for mutually inconsistent behavior.
Q2. For Figure 1(e), the author applies autoguidance to the toy model. Could you also visualize the , and in the autoguidance setting? This will help us better visualize what the similar degradation looks like.
The probability ratio in the region shown in Figure 2 looks fairly similar with CFG and autoguidance, but we could try to construct a visualization that focuses on the regions with visible differences.
Thanks for the author's response, I will keep my score.
The paper proposes autoguidance, a new method that simulates the behavior of classifier-free guidance by using a worse version of the model itself instead of an unconditional module. The authors demonstrate that inconsistencies between the predictions from the conditional and unconditional parts of CFG are responsible for some of its shortcomings such as lower variation in generated results. By using a worse version of the same conditional model, the authors show that such inconsistencies will be reduced, and sampling trajectories will converge toward samples that are closer in distribution to the data. Therefore, the paper concludes that compared to CFG, autoguidance improves quality without sacrificing diversity.
优点
-
The paper studies an important topic. Since CFG is widely used in current diffusion models, overcoming its shortcomings will have a noticeable impact in the future.
-
The method is well-motivated through controlled experiments that shed light on the behavior of CFG and how autoguidance improves it in those aspects.
-
The experiments are well-organized and clearly demonstrate the impact of different components in autoguidance.
-
The paper is well-written and enjoyable to read.
缺点
-
More visual examples are needed to show how the diversity of generations changes as the guidance scale increases. In the final version, please include a batch of examples with a fixed condition and compare the sampling with CFG and autoguidance to better demonstrate the disentanglement between image quality and diversity in autoguidance.
-
The method is not readily applicable to pretrained diffusion models such as Stable Diffusion. This might limit the current use cases of autoguidance. However, this issue will likely not persist in the long run, as we may see the release of pretrained models compatible with autoguidance. Therefore, this weakness does not affect the long-term impact of the paper.
问题
-
Can you provide precision/recall (PR) curves for your method vs. CFG? While FID considers both aspects, the PR curve shows the impact of guidance on quality and diversity more directly.
-
In addition to improvement in quality, CFG is also heavily used to improve text-image alignment. Can you provide a more detailed experiment on how autoguidance affects this aspect? From the images, it seems that some degree of CFG is still needed for optimal text-image alignment.
-
Can you provide some intuition on how to choose the guiding model besides grid-search? It seems very costly to train multiple different models just to see which one works better as the guidance model, especially if the same method is applied to text-to-image models trained on massive datasets.
-
How does the method compare to algorithms designed for increasing the diversity of CFG, e.g., [1, 2, 3]? In other words, how much of the improvement comes from increased diversity and how much comes from the fact that autoguidance provides better image quality overall (due to fewer inconsistencies between the updates)? It would be great if the authors could provide a section studying this in the final version. Currently, it is only mentioned that autoguidance is beneficial at all noise scales compared to CFG, but I believe a more detailed comparison with visual examples would strengthen the paper.
[1] Kynkäänniemi T, Aittala M, Karras T, Laine S, Aila T, Lehtinen J. Applying guidance in a limited interval improves sample and distribution quality in diffusion models. arXiv preprint arXiv:2404.07724. 2024 Apr 11.
[2] Wang X, Dufour N, Andreou N, Cani MP, Abrevaya VF, Picard D, Kalogeiton V. Analysis of Classifier-Free Guidance Weight Schedulers. arXiv preprint arXiv:2404.13040. 2024 Apr 19.
[3] Sadat S, Buhmann J, Bradley D, Hilliges O, Weber RM. CADS: Unleashing the Diversity of Diffusion Models through Condition-Annealed Sampling. InThe Twelfth International Conference on Learning Representations.
局限性
The submission has properly discussed the limitations and social impacts.
Thank you for the review. Regarding the concerns and questions:
More visual examples are needed to show how the diversity of generations changes as the guidance scale increases. In the final version, please include a batch of examples with a fixed condition and compare the sampling with CFG and autoguidance to better demonstrate the disentanglement between image quality and diversity in autoguidance.
We will include a grid of fixed-condition examples between CFG and autoguidance in the final revision.
Q1. Can you provide precision/recall (PR) curves for your method vs. CFG? While FID considers both aspects, the PR curve shows the impact of guidance on quality and diversity more directly.
We did not measure precision and recall, so we unfortunately don’t have the data necessary to produce such curves.
Q2. In addition to improvement in quality, CFG is also heavily used to improve text-image alignment. Can you provide a more detailed experiment on how autoguidance affects this aspect? From the images, it seems that some degree of CFG is still needed for optimal text-image alignment.
We did not run any prompt alignment metrics, so we have no quantitative data about this. Intuitively, it seems probable that the effect of autoguidance on prompt alignment is smaller than with CFG, because both models are being conditioned with the text prompt. However, as the guiding model is smaller and/or less trained, it probably responds to the condition less strongly than the main model, and thus the prompt is probably emphasized to some degree as the guidance weight is increased.
In the paper, we advocate mixing autoguidance with CFG for further creative control and provide a simple method for doing so (Appendix B.2).
Q3. Can you provide some intuition on how to choose the guiding model besides grid-search? It seems very costly to train multiple different models just to see which one works better as the guidance model, especially if the same method is applied to text-to-image models trained on massive datasets.
Based on our experiments, a model of a third to a half the size of the main model is a good starting point, and the evaluations should begin around 1/16 of training iterations, or perhaps even earlier for very small models. As seen in Figure 3(a, b), doubling or halving the capacity or training time doesn’t result in any sort of catastrophic quality drop, so these parameters are not overly sensitive. That said, we do not have enough data at this point to establish proper scaling laws.
Q4. How does the method compare to algorithms designed for increasing the diversity of CFG, e.g., [1, 2, 3]? ...
So far we have compared autoguidance only with the interval method [1], which we did not find benecifial in combination. A key benefit from these schedules appears to be the suppression of CFG at high noise levels, where its image quality benefit is overshadowed by the undesirable reduction in variation that is caused by large differences in the content of the differently conditioned distributions. In contrast, autoguidance is not expected to suffer from this problem at high noise levels, as both models target the same distribution. Nevertheless, exploring further options would be a natural topic for a follow-up paper; we shall include this in the future work section
I thank the authors for providing detailed answers to my questions. I believe this is a strong paper that would benefit several applications of diffusion models. Therefore, I would like to keep my score.
All reviewers agreed that this work presents a novel perspective on classifier-free guidance in diffusion models, a simple and powerful new method (autoguidance), and an extremely clear presentation as well as insightful set of experiments. We believe this paper is a valuable contribution to understanding and improving guidance in diffusion models.
Congratulations on your best paper nomination! This is an excellent and innovative contribution that tackles the CFG issue by aligning both models to the same task. It’s great to see growing attention in this direction.
The authors note that does not form a valid heat diffusion of , which can cause issues like distorted trajectories and color over-saturation. They proposed auto-guidance method elegantly addresses these problems at high noise levels.
While auto-guidance is a strong approach, there are additional methods to address these challenges beyond the noise level-dependent guidance cited. Another approach—‘Characteristic Guidance’ [1]—directly modifies the CFG formulation to fix such irregularities, from mode dropping to over-saturation.
We kindly suggest acknowledging this complementary work. Doing so will help place auto-guidance more accurately within the broader literature and highlight the full range of solutions to CFG’s known drawbacks.
[1] Zheng, C. & Lan, Y.. (2024). Characteristic Guidance: Non-linear Correction for Diffusion Model at Large Guidance Scale. Proceedings of the 41st International Conference on Machine Learning, in Proceedings of Machine Learning Research 235:61386-61412 Available from https://proceedings.mlr.press/v235/zheng24f.html.
Thank you for the pointer. We'll take this into account when we update the final camera-ready revision.
Apologies for missing your earlier reply—thank you for considering the suggestion. Appreciate your team’s openness to including related work!