PaperHub
8.2
/10
Spotlight3 位审稿人
最低5最高5标准差0.0
5
5
5
3.7
置信度
创新性3.0
质量3.0
清晰度2.7
重要性2.7
NeurIPS 2025

Towards Understanding the Mechanisms of Classifier-Free Guidance

OpenReviewPDF
提交: 2025-04-18更新: 2025-10-29

摘要

关键词
diffusion modelclassifier free guidancemechanism interpretation

评审与讨论

审稿意见
5

This paper investigates the phenomenon of classifier-free guidance via the lens of linear diffusion models. The authors first show empirically that a linear diffusion model can approximate the results of diffusion models, then start the analysis from the linear diffusion model. The main proposition from the authors is that the CFG enables the sampling process towards the following three terms: (1) mean-shift towards the class mean; (2) a positive Contrastive Principal Components (CPC) term amplifying the class-specific features; and (3) a negative CPC term that suppresses generic features prevalent in unconditional data. Finally, the authors also discuss the extension to non-linear cases.

优缺点分析

I have worked on some practical aspects of diffusion models, but not the theoretical side. I tried my best to follow the maths and compared them with my observations in actual experiments. Based on these, I have the following comments.

Strengths

  • It is intriguing to see the connection between diffusion models and a simplified linear diffusion formulation (Fig. 1).
  • The observation and statistics of class specifity is reasonable and make the paper well-motivated (Sec. 3.1).
  • The authors proposed the three terms explaining the intra-class and inter-class relationships, which aligns well with the intuition of how CFG assists generation.
  • The experiments conducted by the authors, ruling out the assumptions and factors gradually, further reveal the insights of understanding the effects of the three terms.

Weaknesses

I am not working on the theoretical side of the diffusion models, so my question might not apply to the theory researchers. Specifically, I have reserved opinions toward the Equation 12, the ODE formula splitting into the three terms proposed by the authors.

My concern is that: under the assumptions of the linear model and Gaussian distribution, it seems intuitive that a CFG formula SHOULD correspond to the three terms: p(xc)p(x|c) --> mean shift, p(xc)p(xϕ)p(x|c) - p(x| \phi) the components between class-specific features and unconditional features. It is true that calculating the weights of the terms is challenging, but I am curious about how do the theory researchers judge that a conclusion is non-trivial?

问题

See weaknesses above

局限性

Yes, the authors have discussed limitations.

最终评判理由

The author's rebuttal address my concerns and better clarifies the contributions. Therefore, I would increase my score.

格式问题

No formatting concerns

作者回复

Thanks for the thoughtful feedback!

We agree that, once one adopts the optimal linear assumption, the three‑term decomposition in Eq. (12) follows naturally. Our contribution, however, does not rest on the algebraic split alone but on the framework and the empirical validation that follow from it:

(i) We Introduce the optimal linear score model as a controlled test‑bed for guidance. Our first key finding is that CFG manifests similar visual perceptual improvement on the linear diffusion models, making it a meaningful stand‑alone playground for analysis.

(ii) We show that the linear model accurately mimics real diffusion models at high–to-moderate noise levels. We believe a good model need not be complex but it must match practice in certain aspects. The linear model does so in high-to-moderate noise levels, yet remains analytically transparent.

(iii) We Map linear‑world insights back to nonlinear models and delineate the failure cases. We identify where the linear view breaks (low‑noise levels). But by showing that the proposed heuristic guidance in eq.17 could lead to similar effects as the actual CFG, we believe the insights from the linear model can still shed light on the practical settings.

(iv) The Insights Generate New, Actionable Research. We believe our findings suggest several potential avenues for future research both in application side and theoretical side:

  • Our observation that applying the mean-shift guidance to the diffusion trajectory yields improved conditional samples implies the existence of local class-specific clusters in the noisy data manifold learned by diffusion models. Specifically, in our extra experiments after the submission of the original draft, we find that simply initializing noise xTx_T from a mean-shifted Gaussian distribution N(γ(μcμuc),σ2(t)I)\mathcal{N}(\gamma(\mu_c-\mu_{uc}), \sigma^2(t) I), with no additional guidance applied, can improve the overall generation quality. The existence of the class-specific clusters has not been widely discussed or known in the literature, and we hope our results will motivate further theoretical investigation into the geometry of conditional diffusion models.

  • The main insight is that CFG improves the sample quality by (i) enhancing class-specific information and (ii) suppressing generic features prevalent in the unconditional data. We hypothesize that explicitly incorporating this dual objective into the diffusion model’s training process may lead to models capable of generating high-quality images without the need for guidance. Indeed, we have made progress in this direction in our ongoing project.

  • Our findings also reveal that PCA may not be robust in capturing class-specific patterns. As mentioned in the conclusion section of our paper, many recent works (e.g., [2], [3]) apply PCA in an embedding space to extract class-related concepts or structural information. Given that our results suggest CPCA is more effective for uncovering class-specific patterns, we believe investigating CPCA’s utility in these existing methods could be a promising research direction.

In summary, our contribution is the systematic hypothesis, test and refine loop that bridges a tractable model and real-world diffusion models, clarifying what transfers, what doesn't and why. We therefore see the work less as pure theory and more as theory‑guided experimentation that advances practical understanding of CFG.

评论

Thanks for your classification! I think your explanation makes sense and will increase my score. Basically, I think CFG is a very mysterious thing for generative models, and this paper shows a unique way to study this. I will increase my score.

评论

Thank you very much for revisiting our rebuttal and for the encouraging feedback. We are glad the additional clarification addressed your concerns. Your recognition of the paper’s contribution—and your decision to revise the score—are greatly appreciated!

审稿意见
5

The paper provides an in-depth analysis of Classifier-Free Guidance (CFG) in diffusion models, uncovering a bias caused by mean shift when the zero-mean assumption in data samples is violated. It proposes tuning the CFG coefficient to compensate for this bias. Empirical experiments support these findings, demonstrating the mean-shift phenomenon and offering a simple, practical adjustment to correct it.

优缺点分析

Strengths

The analysis of CFG is thorough. The authors first demonstrate that naive conditional sampling tends to yield low-quality samples that lack distinct class-specific features, which they attribute to the overlap in class covariance matrices.

Weakness

  • The setup for linear Gaussian diffusion is not clearly defined—how well does it approximate real-world diffusion processes?

  • The rationale for applying Contrastive Principal Component Analysis (CPCA) is unclear; further justification for its suitability would strengthen the analysis.

  • The study evaluates only EDM as the diffusion model. It's uncertain whether the observed effects generalize to other types of diffusion models.

  • The paper lacks actionable insights or design implications for improving existing Classifier-Free Guidance (CFG) based on its findings.

  • There is insufficient quantitative evaluation of how varying CFG configurations affect generation quality.

问题

Questions:

  • In line 171, as discussed in Section 2.1: For high to moderate noise levels, deep diffusion models can be well approximated by linear models, particularly when network capacity is limited and training is insufficient. Could you provide both an intuitive and a quantitative justification for why the linear approximation is considered sufficiently accurate in this regime?

  • In line 178, it is stated that "CFG significantly increases the inter-class separation." Given that FID is computed from the overall sample mean and covariance, how does increased inter-class separation affect the overall FID score? Could this separation inadvertently inflate FID despite visual improvements?

  • Based on the insights presented in this paper, do you envision any new directions or improvements to the design of existing CFG methods?

局限性

Limitations:

  • The study evaluates only EDM as the diffusion model; it remains unclear whether the observed effects generalize to other diffusion architectures.

  • The paper does not offer actionable insights or design improvements for existing Classifier-Free Guidance (CFG) methods based on its analysis.

  • There is a lack of quantitative evaluation assessing how different CFG configurations influence generation quality.

最终评判理由

The paper presents an in-depth analysis of Classifier-Free Guidance (CFG) in diffusion models, revealing a bias from mean shift when the zero-mean assumption is violated. It proposes tuning the CFG coefficient to correct this bias, supported by empirical evidence demonstrating the phenomenon and a practical adjustment.

In the rebuttal, the authors justify the use of linear approximation and provide additional experiments with the EDM-2 model to reinforce their claims.

格式问题

No

作者回复

We really appreciate your thoughtful feedback and are encouraged that they found our analysis thorough. We address each question in turn.


Q.1: The setup for the linear model is not clear-how well does it approximate real-world diffusion models.

A.1:

Setup: The setup is as follows: For a given dataset, we can analytically construct the optimal linear model with the empirical data covariance and mean based on equation 8. Once constructed, this linear model is used in the exact same manner as a deep diffusion model for sampling and guidance. More details can be found in Appendix C.

How well linear model approximates deep diffusion model. The similarity between linear and nonlinear diffusion models has been extensively studied in several recent works [1-3]. In summary, linear diffusion models well approximate the actual diffusion models in high noise levels to moderate noise levels but differ in low-noise levels.


Q.2: The rationale of applying Contrastive Principal Component Analysis (CPCA) is unclear.

A.2: The CPCA framework was not a priori choice for out study. Rather, the mathematical structure of CPCA emerged naturally from our analysis of classifier-free guidance in a linear model.

  • Inspired by the CFG’s similar effects on both linear and nonlinear models, as illustrated in Figure 2, we adopted the linear model as a prototype for understanding CFG.

  • When we derived guidance term in this linear setting, it naturally contains the CPC guidance gcpcg_\text{cpc} in eq.12, which involves ΣcΣuc\Sigma_c-\Sigma_{uc}, the difference between the conditional covariance and unconditional covariance.

  • Given this matrix, the most direct way to understand its effect is through its eigendecomposition. The eigenvectors with positive eigenvalues mathematically spans the subspace that best fits the conditional data while being as far as possible from the unconditional dataset, hence representing the class-specific feature while the eigenvectors with negative eigenvalues represent directions irrelevant to the given condition.

This specific analysis—examining the eigenvectors of a covariance difference matrix—is what has been formalized in the literature as CPCA. Therefore, our citation is intended to acknowledge the established terminology for this mathematical technique, not to indicate that we are simply applying a pre-existing method. Our contribution is the demonstration that this structure is fundamental to the mechanism of CFG.


Q.3: It is unclear how the findings generalize to other diffusion models.

A.3: During rebuttal we repeated all experiments in section 3-4 on EDM-2 [5], which is a latent space diffusion model with latent dimension 4x64x64, and results in images with size 3x512x512. Moreover, the network architecture is also more advanced compared to the EDM-1 model we used in the original submission. The results remain consistent with those presented in the original submission and we summarized them as follows:

  • CFG improves the generation quality of the linear diffusion models constructed with covariances and means of the conditional and unconditional latent variables.

  • In high noise regime of the EDM-2 latent diffusion mdoel, linear CFG achieves qualitatively and quantitatively similar effects as the actual nonlinear CFG. Below we present the FD_DinoV2 scores for the various types of guidance applied to the noise range [11.05, 80.0], for class 'golden retriever':

Method / γ\gamma05101520
Nonlinear‑CFG631.89421.147434.49540.767653.437
Linear-CFG631.89444.712420.143495.995658.213
Mean-shift Guidance631.89489.578470.133553.23728.914
Positive CPC Guidance631.89588.354554.723537.513532.699
Negativ CPC Guidance631.89597.438559.37546.788528.656

The trends match Fig. 5 of the submission.

  • In low noise regime, the heuristic CPC guidance in eq.17 is capable of refining the finer details of the unguided samples.

Q.4: The paper lacks actionable insights to how to improve CFG based on the findings. Do you envision any new directions or improvements to the design of existing CFG methods ?

We'd like to emphasize that our work primarily aims to provide a theoretical understanding of CFG—i.e., to offer insights into an observed phenomenon—rather than to propose practical improvements for model performance. We believe that an investigation of the mechanisms of CFG carries significant value on their own. Nevertheless, our findings suggest several potential avenues for future research both in application side and theoretical side:

  • Our observation that applying the mean-shift guidance to the diffusion trajectory yields improved conditional samples implies the existence of local class-specific clusters in the noisy data manifold learned by diffusion models. Specifically, in our extra experiments after the submission of the original draft, we find that simply initializing noise xTx_T from a mean-shifted Gaussian distribution N(γ(μcμuc),σ2(t)I)\mathcal{N}(\gamma(\mu_c-\mu_{uc}), \sigma^2(t) I), with no additional guidance applied, can improve the overall generation quality. The existence of the class-specific clusters has not been widely discussed or known in the literature, and we hope our results will motivate further theoretical investigation into the geometry of conditional diffusion models.

  • The main insight is that CFG improves the sample quality by (i) enhancing class-specific information and (ii) suppressing generic features prevalent in the unconditional data. We hypothesize that explicitly incorporating this dual objective into the diffusion model’s training process may lead to models capable of generating high-quality images without the need for guidance. Indeed, we have made progress in this direction in our ongoing project.

  • Our findings also reveal that PCA may not be robust in capturing class-specific patterns. As mentioned in the conclusion section of our paper, many recent works (e.g., [2], [3]) apply PCA in an embedding space to extract class-related concepts or structural information. Given that our results suggest CPCA is more effective for uncovering class-specific patterns, we believe investigating CPCA’s utility in these existing methods could be a promising research direction.


Q.5: There is insufficient quantitative evaluation of how varying CFG configurations affect generation quality.

A.5: The main hyperparameters of CFG is the guidance strength γ\gamma, and most of our experiments are conducted over a wide range of γ\gamma's. If “configuration” refers to something else, please clarify and we will be happy to provide additional results.


Q.6: Could you provide both an intuitive and a quantitative justification for why the linear approximation is considered sufficiently accurate in this regime?

A.6:

  • Quantitatively, the similarity between linear diffusion model and nonlinear models have been extensively studied in previous works [1-3]. In particular, it is shown in [3] that this similarity is most pronounced when the model capacity is low or the training is insufficient.

  • Theoretically, given finite number of training data, which matches the practical setting where the model is trained on a fixed dataset with finite samples, the optimal diffusion denoiser admits a closed-form expression as a kernel density estimator, which is approximately linear for high noise levels (please refer to section D.8 of [2] for a proof).

  • However, in practice, the linear diffusion models and nonlinear diffusion models share similarity not only in very high noise levels, but also in moderate noise levels. Why this happens is not clear yet, but intuitively, the deep networks, especially when its size is small and the training is not sufficient, can favor learning simple structures of the data, such as linear structure, data covariance, rather than learning more complex structures such as higher order momentum. Such bias is known as the simplicity bias of deep networks [4], observed in many different settings such as classification and regression and has been well-documented.


Q.7.:Could this separation inadvertently inflate FID despite visual improvements?

A.7: Yes, you are absolutely correct. A very high class-separation can reduce sample diversity and inflate FD metrics. However, since the naive conditional sampling leads to samples that lack separation compared to training data (the ground truth), if we choose the guidance strength carefully to improve inter-class separation by the right amount, the FID can be improved.


References:

[1] Wang, Binxu, and John J. Vastola. "The hidden linear structure in score-based models and its application." arXiv preprint arXiv:2311.10892 (2023).

[2] Wang, Binxu, and John J. Vastola. "The unreasonable effectiveness of gaussian score approximation for diffusion models and its applications." arXiv preprint arXiv:2412.09726 (2024).

[3] Li, Xiang, Yixiang Dai, and Qing Qu. "Understanding generalizability of diffusion models requires rethinking the hidden gaussian structure." Advances in neural information processing systems 37 (2024): 57499-57538.

[4] Kalimeris, Dimitris, et al. "Sgd on neural networks learns functions of increasing complexity." Advances in neural information processing systems 32 (2019).

[5] Karras, Tero, et al. "Analyzing and improving the training dynamics of diffusion models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

评论

Thanks for your responses. I think this paper is thoughtful and well written. I will increase my score.

评论

Thank you for taking the time to revisit our submission and for your positive assessment. We appreciate your kind words regarding the paper’s thoughtfulness and writing, and we are grateful for the improved score. Your feedback motivates us to refine the presentation further for the revised version.

审稿意见
5

The paper aims to understand classifier-free guidance (CFG) from a theoretical perspective. To simplify the analysis, the authors first study a linear model and then explore how the insights might extend to real-world diffusion models. They find that linear CFG improves generation quality through three key components: (i) a mean-shift term that roughly guides samples toward the class means, (ii) a positive contrastive principal components (CPC) term that enhances class-specific features, and (iii) a negative CPC term that suppresses generic features commonly present in unconditional data. The experiments are conducted on simple linear models.

优缺点分析

Strengths:

  1. Analyzing CFG theoretically to understand its mechanism is interesting and motivating.

  2. The theoretical formulation would be helpful to propose new guidance strategies.

Weaknesses:

  1. While I appreciate the analysis in linear models, the extension to non-linear regime and real-world diffusion models seems weak. E.g., the argument in L349: “empirically ..” not not convincing, since I don’t see much improvement in the visual results in Fig. 7, with PCs, or CPC. Guiding with PCs even decrease the quality compared to CFG, why that's the case?

  2. In Fig.7, sample 3, guiding with CPC, PC, change the color of the car, can you explain this? Do PCs capture different visual attributes? But in that case, how the prompt alignment is performed, it's not clear to me.

  3. The qualitative results are not convincing. E.g., Fig. 5, I see non-linear CFG, performs better than CPCs. It can be useful to understand CFG, but how it's helping to improve CFG in terms of generation quality and prompt fidelity remains unclear.

  4. CFG is an inference time strategy, not involving any training, then how class statistics are affected? Please clarify.

  5. Experiments need to be done on recent diffusion models, with higher quality and resolutions to actually see any meaningful improvement.

  6. Can this analysis be extended or applied to classifier guidance?

  7. In L294, whats FD _{DINOv2} ?

  8. The flow and format of the paper needs to be improved. The figures should be improved overall, both in quality and formatting (labels etc).

问题

Please address the weakess.

局限性

Yes.

最终评判理由

My concerns are mostly addressed!

格式问题

No.

作者回复

We really appreciate you find our work interesting and motivating. We are happy to have the chance to address your questions.


Q.1: I don’t see much improvement in Fig.7, guiding with PCs even decrease the quality, why that is the case?

A.1: We thank the reviewer for this insightful question, which gets to the heart of our contribution. The reviewer is correct that naively guiding with Principal Components (PCs) can degrade image quality, and explaining this phenomenon is the central motivation for our proposed CPC guidance.

Question 1: Not much improvement. Fig. 7 focuses on the low‑noise regime where guidance mostly refines finer details rather than producing dramatic changes. Comparing row 1 (CFG) and row  2 (eq.17), both guidances sharpen image details: e.g. in sample 2 the texture of French fries become finer, and in sample 3 the car separates more clearly from the background with the guidance whereas the unguided one blends into the background.

Question 2: Why naive PC guidance fails? On the other hand, guiding with PCs indeed harms quality—precisely the point we highlight (in line 355‑357):

  • PC guidance fails because it is a mix of both (1) generic, coarse structures and (2) the desired class specific, finer details. From Figure 6 (b), it can be seen that by projecting the noisy image onto the conditional posterior PCs, the noisy images can be denoised. To achieve this denoising result, the PCs not only need to capture the class-specific information or finer image details, but also need to capture generic information such as the overall coarse structure of the sample. Guiding with PCs simultaneously amplifies coarse and class‑specific patterns, adding redundant information and causing oversaturation.

  • Actual CFG only enhances class-specific, finer details. In contrast, CFG in the nonlinear (low noise) regime mainly makes the image sharper, implying that it selectively enhances class‑specific details, leaving coarse structure largely unchanged. To do so, CFG must distinguish those desired structures from the generic, coarser ones.

  • How CPC selectively enhances class-specific finer details. Our heuristic CPC guidance (eq.17) formalises this selectivity: by contrasting conditional and unconditional PCs it retains the fine, class‑unique directions while suppressing generic ones, yielding higher fidelity than naïvely boosting all PCs. Note from Figure 6 (c), the unconditional PCs also lead to decent denoising results, implying that the coarse structure is captured, but it lacks some finer details, for example, the structures of the cheese burger and fries are blurrier. The guidance in eq.17 basically contrasts the conditional and unconditional PCs, get rid of the redundant coarse structures and enhance the finer ones, hence achieving better quality compared to naively enhancing PCs.


Q.2: In Fig.7., why is the color of the car changed? What visual attributes are captured by PCs? How is the prompt alignment performed?

A.2: The guiding effects of eq.17 result from the combined effects of all the positive CPCs and some might correlate with color, others with shape, texture, etc. While the color change is visible, the more important effect is that the car’s silhouette becomes sharper and better separated from the background.

Although we don’t know the exact attribute of a particular positive CPC, mathematically, they are the directions where conditional variance exceeds unconditional variance, i.e., they represent the unique class patterns (see appendix A). Ordinary PCs, in contrast, capture the largest overall variance, often coarse foreground–background structure. Prompt alignment therefore arises by enhancing directions unique to the prompt and attenuating generic ones.


Q.3: How linear CFG improves CFG remains unclear in Fig.5.

A.3: In this work, we didn’t propose the linear CFG as a method that improves upon the standard CFG, but rather to use it as an analytical tool to better understand the underlying mechanism of CFG.

In Figure 5 (b), although the linear CFG is slightly better for relatively large guidance γ\gamma, it can be seen that for γ<5\gamma<5, it performs worse than the actual CFG. Whether linear CFG performs better or not depends on the specific class. For example, in Figure 29., for class ‘coffee mug’ and class ‘cheeseburger’, the actual CFG performs better compared to the linear CFG.

The key observation is the close quantitative match between the linear and actual CFG curves across various guidance strengths. This supports our claim that the linear CFG can serve as a good approximation of the actual CFG in the high‑noise regime.

Regarding the slight outperformance of linear CFG in Figure 5, our conjecture is that for high noise levels, it is possible that the network fails to learn the PCs perfectly. As can be seen in Figure 1, compared to the linear model, where we compute the PC exactly, at σ\sigma=80.0 and 42.415, the generated samples of deep diffusion models tend to be noisier. This imperfect learning might be a factor explaining where the improvement comes from.


Q.4: How can sample statistics be influenced by CFG?

A.4: CFG steers the diffusion trajectory; the resulting samples differ substantially from unguided ones, so their aggregate statistics (means, covariances, FID, etc.) also shift.


Q.5: Experiments need to be done in more advanced diffusion models

A.5: During rebuttal we repeated all experiments in section 3-4 on EDM-2 [1], which is a latent space diffusion model with latent dimension 4x64x64, and results in images with size 3x512x512. Moreover, the network architecture is also more advanced compared to the EDM-1 model we used in the original submission. The results remain consistent with those presented in the original submission and we summarized them as follows:

  • CFG improves the generation quality of the linear diffusion models constructed with covariances and means of the conditional and unconditional latent variables.

  • In high noise regime of the EDM-2 latent diffusion mdoel, linear CFG achieves qualitatively and quantitatively similar effects as the actual nonlinear CFG. Below we present the FD_DinoV2 scores for the various types of guidance applied to the noise range [11.05, 80.0], for class 'golden retriever':

Method / γ\gamma05101520
Nonlinear‑CFG631.89421.147434.49540.767653.437
Linear-CFG631.89444.712420.143495.995658.213
Mean-shift Guidance631.89489.578470.133553.23728.914
Positive CPC Guidance631.89588.354554.723537.513532.699
Negativ CPC Guidance631.89597.438559.37546.788528.656

The trends match Fig. 5 of the submission.

  • In low noise regime, the heuristic CPC guidance in eq.17 is capable of refining the finer details of the unguided samples.

Q.6. Can the analysis be extended to classifier-guidance?

A.6.: Yes. Classifier guidance is mathematically equivalent to classifier‑free guidance; only the implementation differs: The primary difference between the two methods is in their implementation:

  • Classifier Guidance uses an external, separately trained classifier to compute the guidance term.

  • Classifier-Free Guidance (CFG) derives the guidance term from the diffusion models themselves (conditional and unconditional scores) with Bayes rule.


Q.7.: What is FD_Dinov2?

A.7: FD_Dinov2 [2] is similar to FID, which measures the similarity between two datasets using the Frechet distance (FD). The lower FD is, the more similar two datasets are. FID computes FD using the inception features while FD_Dinov2 computes FD with Dinov2 features. The work [2] has shown that FD_Dinov2 more accurately reflects human-perceptual quality compared to FID, and advocates the use of FD_Dinov2. In our experiments, we also tested with FID score and the trend still holds. We can include these extra results in the final version.


Q.8. The flow and futures should be improved.

A.8: Due to the page limit, we have to condense some texts and figures. We will definitely improve them with the extra page given in the final version.


References:

[1] Karras, Tero, et al. "Analyzing and improving the training dynamics of diffusion models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[2] Stein, George, et al. "Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models." Advances in Neural Information Processing Systems 36 (2023): 3732-3784.

评论

Thanks for the clarification! My concerns are mostly addressed and I'm happy to increase the score!

评论

Thank you for taking the time to revisit our submission and for letting us know that the clarifications addressed your concerns. Your feedback has been invaluable in refining both the technical exposition and the presentation, and we will incorporate your suggestions into our revision!

最终决定

(a) summary: The paper revisits the classifier-free guidance (CFG) in diffusion models with the aim to provide some insights and justification for its good performance. To do so, it relies on a simplified linear denoising model which allows for analysis of the mean and covariance components of the (de-noised) data. It attributes the low quality of pure conditional sampling to the lack of class-specificity in the class-conditional covariance and argues that this is mitigated in CFG by amplifying class specific features (positive contrastive principal components) and suppressing generic features (negative contrastive PC). It supports this argument analytically by decomposing the CFG guidance in the denoising step into the positive/negative CPC and a mean-shift terms. Though analytical study of nonlinear denoisers is challenging, the authors argue that nonlinear CFG guidance shares similar underlying mechanism. The study is supported by both proofs of the analytical derivations and experimental exploration of the guidance effects.

(b) strengths: Theoretical analysis of the CFG helps to improve the understanding of the underlying mechanism. The decomposition into the three terms (mean shift, positive CPC and negative CPC) is in retrospect natural and intuitive. The theoretical analysis of the CFG decomposition is supported by empirical exploration of both linear and non-linear CFG denoising regimes.

(c) weaknesses: Though the provided theoretical results are interesting, it is unclear how they can be utilized in practice. Experimental exploration limited to EDM-type of diffusion model and it is unclear if/how it translate to other more recent/advanced types of models.

(d) reasons to accept: The paper explores an important type of model (CFG) whose functioning is so far not well understood. The theoretical analysis is solid and is supported by relevant empirical evidence, the outcomes are well formulated and presented in an manner supporting natural intuitions. The provided insights can trigger further research into conditional diffusion generations leading to improved models. I propose a spotlight presentation since CFG is a well known method in the diffusion modelling area and hence the results may of interest for a substantial part of the audience. They are, however, not so groundbreaking to deserve an oral.

(e) review discussions: Initial reviews had been cautiously positive recognizing the technical quality and the potential of the paper in terms of interest for the community. In addition to few clarifying technical questions, the main concerns were about applicability to modern more complex diffusion models and practical usability of the provided insights. The authors actively interacted during the rebuttal, clarified the technical questions, provided additional results over more complex diffusion models and argued that the primary purpose of the paper is not proposing practical improvements but providing better understanding of the CFG with the potential to be elaborated upon in future research. All reviewers increased their scores to 5 in the follow up of the discussion. The authors summarized the main concerns and how they will address these in the camera-ready version of the paper in their final remark. I find these appropriate, realistically doable and improving the paper in the right direction. Overall, a solid paper worthy the interest of the community.