Information Theoretic Learning for Diffusion Models with Warm Start
Score matching could connect with relative entropy with arbitrary noise perturbation.
摘要
评审与讨论
The paper introduces a novel lower bound for the log-likelihood on individual datapoints, aiming to improve the training and evaluation of generative models. To address the high variance issue often found in Monte Carlo-based estimations, the authors incorporate importance sampling techniques, leading to a more stable and potentially more accurate optimization process. The proposed formulation is grounded in solid mathematical derivations and presents a theoretically sound contribution to the field of probabilistic modeling.
优缺点分析
Strengths
- The paper presents a mathematically rigorous approach, with clearly defined objectives and derivations. The proposed lower bound is both theoretically interesting and potentially useful in improving likelihood-based training of generative models.
- The incorporation of importance sampling to reduce variance is a well-motivated and technically appropriate choice.
Weaknesses
- A significant limitation is the lack of comprehensive empirical validation. The experiments are limited to relatively small-scale datasets such as CIFAR-10 and ImageNet-32. This raises concerns about the generalizability and scalability of the method to larger, more complex datasets.
- While improvements in negative log-likelihood (NLL) are observed, the degradation in Fréchet Inception Distance (FID) suggests that the method may prioritize likelihood at the expense of sample quality. This trade-off is not fully explored or explained in the paper.
问题
- Could you provide experimental results on larger datasets, such as LSUN or ImageNet-256, to evaluate how the method scales to higher-resolution or more diverse image distributions?
- Please provide qualitative comparisons. The degradation of FID score raises concerns on practical usages.
局限性
The primary limitation of the paper is the weak experimental support. Without evaluation on more challenging datasets or broader comparisons to baseline methods, it is difficult to assess the practical relevance and robustness of the proposed method. Additional experiments would significantly strengthen the paper.
最终评判理由
As the experiments support authors' argument, and now I understand the literature regarding likelihood, I raise my score.
格式问题
No
We are grateful to the reviewer for the valuable feedback and the opportunity to clarify the objectives, evaluation criteria, and contributions of our work.
General perspective of evaluation of generative models
From a broad standpoint, we first seek to dispel any confusion regarding the purpose and evaluation criteria of our work before addressing the specific questions in turn. It is essential, for the reviewers’ benefit, to clarify our precise objective, the standards by which we judge success, and the context in which this study is framed. Just as selecting appropriate training and optimisation strategies is necessary to achieve strong performance in a given application, so too is the choice of evaluation metric pivotal for drawing valid conclusions. We must stress that the primary focus of this paper is on maximising the likelihood learning metric, specifically, negative log‑likelihood measured in bits‑per‑dimension (BPD; lower is better), rather than on optimising Fréchet Inception Distance (FID) or Inception Score (IS), for the following reasons.
Model samples undoubtedly serve as a valuable diagnostic tool, often enabling us to form an intuition about why a model may underperform and how it might be improved. From this standpoint, a generative model ought to produce samples that are indistinguishable from those in the training set, whilst encompassing its full variability. To quantify these properties, a variety of metrics, such as the IS and the FID, have been proposed. However, both qualitative and quantitative assessments based on model samples can be misleading with respect to a model’s density-estimation capabilities, as well as its effectiveness in probabilistic modelling tasks beyond image synthesis [1]. Consequently, average log‑likelihood remains the de facto standard for quantifying generative image‑modelling performance. For many sophisticated models, the average log‑likelihood is challenging to compute or even approximate. Indeed, it is possible for a model with sub‑optimal log‑likelihood to generate visually impressive samples, or conversely, for a model with excellent log‑likelihood to produce poor samples [1], an observation that underlines the lack of a direct relationship between FID and negative log‑likelihood (NLL).
From an information‑theoretic standpoint, it is well known that maximising the log‑likelihood of a probabilistic model is equivalent to minimising the KL divergence from the data distribution to the model distribution. By contrast, FID operates by fitting multivariate Gaussian distributions to the embeddings of real and generated images and then measuring their discrepancy via the Fréchet distance (equivalently, the 2‑Wasserstein or Earth Mover’s distance) [2]. Clearly, the mathematical formulations of these two metrics diverge fundamentally: one corresponds to a mismatched estimation problem under a KL‑based criterion, while the other embodies an optimal‑transport task.
Moreover, FID conflates both fidelity to the real data distribution and the diversity of generated samples into a single score, and its absolute value is highly sensitive to myriad factors, ranging from the number of samples and the particular checkpoint of the feature extractor network to low‑level image‑processing choices. Consequently, the visual appeal of generated images, as quantified by FID, correlates only imperfectly with a model’s log‑likelihood performance. In our work, we concentrate on advancing the state of the art in likelihood estimation; although we report FID scores for completeness, we leave the optimisation of sample quality to future research.
W1 & Q1
We thank the reviewer for emphasising the need for broader empirical validation. To the best of our knowledge, maximum‑likelihood learning in the generative model setting is extremely demanding, after all, we optimise for every pixel value, which is why nearly all prior likelihood‑based studies have been confined to low resolutions (e.g. 32×32) [4][5]. In fact, most density‑estimation papers cap out at 128×128 due to the quadratic growth in pixel count and the resulting computational burden, and to date no likelihood evaluations (bits/dim, BPD) for dataset of higher-resolution (e.g. 256×256) have been systematically reported in the literature.
To address this gap, we have now conducted additional experiments at 64×64 and 128×128 on ImageNet, achieving state‑of‑the‑art bits‑per‑dimension. For completeness, comparison between our proposed model and other competitive models in the literature in terms of expected negative log likelihood on the test set computed as bits per dimension (BPD) is summarised below:
| Models | ImageNet‑64 | ImageNet‑128 | LSUN-128 |
|---|---|---|---|
| Sparse Transformer [Child et al., 2019] | 3.44 | - | - |
| Flow++ [Ho et al., 2019] | 3.69 | - | - |
| Very Deep VAE [Child, 2020] | 3.52 | - | - |
| Improved DDPM [Ho et al., 2020] | 3.54 | - | - |
| VDM [Kingma et al., 2021] | 3.40 | - | 1.44 |
| Flow Matching [Lipman et al., 2022] | 3.31 | 2.90 | - |
| W-PCDM [Li et al., 2024] | 2.95 | 2.64 | - |
| ISIT (Ours) | 2.91 | 2.59 | 2.28 |
Note: LSUN 128×128 are not yet established benchmarks for density estimation, and downsampling protocols in the literature are inconsistent. Accordingly, we do not attempt direct comparisons to previous methods on these datasets; our results simply provide a ballpark estimate of how well our proposed method scales to higher resolutions.
Additional qualitative comparisons and ablation studies on CIFAR-10 could be found in Tables 6-11 in supplementary material.
W2 & Q2
Thank you for the detailed review and thoughtful feedback. Below we address specific questions.
Our work is principally concerned with enhancing the likelihood of diffusion models for applications such as lossless compression (details see our reply to Reviewer BC5c), rather than targeting improvements in FID. Nevertheless, our reported FID metrics remain superior to those of other likelihood‑focused methods (see Tables 12–14 in the supplementary material, and Section C: Sample Quality and FID). As noted in [3–6], objectives that favour likelihood often incur poorer FID performance (comparisons are provided in Table 1).
For example, among models with similar likelihood performance, such as i‑DODE, MuLAN and DiffEnc, our method not only achieves the best negative log‑likelihood but also retains the lowest FID despite requiring substantially fewer training iterations. Conversely, methods like W‑PCDM that explicitly optimise for FID exhibit a marked reduction in likelihood performance.
Additional FID results are presented in Table 6. We would be pleased to include FID scores for additional models should the reviewers and Area Chair consider this beneficial. We emphasise that our variance‑aware likelihood bound markedly enhances diffusion models’ likelihood while inducing only a marginal effect on FID, as evidenced by Tables 12–14.
Moreover, our work makes a novel contribution by analysing how different noise schedules influence FID. In the supplementary material (lines 916–942), we provide a theoretical account of FID degradation. Briefly, from an autoencoder perspective, the quality of generated samples, reflected by FID, depends critically on the conditional distribution , which functions as a decoder reconstructing the data. When the signal‑to‑noise ratio (SNR) at is sufficiently high, the posterior becomes sharply concentrated around , thereby imposing stringent constraints on reconstruction loss. This induces high sensitivity to noise: even small deviations in the initial noise level can substantially degrade reconstruction fidelity and thus worsen FID. However, while increasing noise makes signal recovery more challenging, leading to FID fluctuations, the associated trade‑off with negative log‑likelihood is effectively resolved by our Theorem 1 and Proposition 2, as consolidated in Theorem 2.
It is worth emphasising that our current approach does not employ several techniques known to improve FID, such as truncation, specific weighted score matching/ELBO or other empirical heuristics, and thus our reported FID scores could be further reduced by incorporating these methods. Nonetheless, our primary contribution lies in achieving state‑of‑the‑art log‑likelihood performance. We regard the integration of such sample‑quality‑oriented strategies, in particular explicit optimisation of the Wasserstein distance through tailored noise schedules, as a compelling direction for future work.
[1] L. Theis, A. van den Oord, and M. Bethge. A note on the evaluation of generative models. International Conference on Learning Representations, 2016 [2] Kynkäänniemi, Tuomas & Karras, Tero & Aittala, Miika & Aila, Timo & Lehtinen, Jaakko. The Role of ImageNet Classes in Fréchet Inception Distance. International Conference on Learning Representations, 2023 [3] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020. [4] Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. Advances in neural information processing systems, 34:21696–21707, 2021. [5] Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based diffusion models. Advances in Neural Information Processing Systems, 34:1415–1428, 2021. [6] Nichol A Q, Dhariwal P. Improved denoising diffusion probabilistic models. International conference on machine learning. PMLR, 2021: 8162-8171.
-
Thank you for the clarification. This has addressed my concerns.
-
I thank the authors for their thorough response and explanation of the literature on likelihood. This has greatly clarified the scope of the paper for me.
Thank you very much for your kind follow-up. We’re truly glad to hear that our response helped clarify the scope and that your concerns have been addressed. We sincerely appreciate your thoughtful engagement throughout the discussion.
If the clarifications have meaningfully improved your impression of the work or helped resolve the concerns that initially informed your score, we would be sincerely grateful if you would consider reflecting that in your final evaluation.
Either way, we thank you again for your time and constructive feedback, it has been extremely valuable to us.
This paper introduces a new likelihood bound for noise, driven generative models, improving both the accuracy and efficiency of maximum likelihood estimation by extending the KL divergence, Fisher information connection, to arbitrary noise perturbations. The method demonstrates competitive results on CIFAR-10 and ImageNet-32 without data augmentation.
优缺点分析
Strengthens: 1: The paper handles an important limitation of perturbation-based likelihood models by providing a more efficient and theoretically grounded estimation approach.
2: It achieves competitive or state-of-the-art results on standard benchmarks like CIFAR-10 and ImageNet-32 without relying on data augmentation.
3: The proposed theoretical framework has promising implications for modelling both continuous and discrete data, broadening its practical impact.
Weaknesses: 1: The practical implementation details for selecting appropriate noise distributions could be elaborated more clearly for readers who doesn’t have strong relevant background.
2: The paper would benefit from more empirical comparisons to alternative state-of-the-art likelihood estimation techniques beyond diffusion models if possible.
3: The results only on image datasets; additional experiments on other modalities would help to strengthen claims about broader applicability.
4: The paper’s overall formatting is not very good, and small errors like these should have been avoided. For instance, at the end of some paragraphs, there are blank boxes.
问题
As we all know that FID is the primary metric for generation, Table 2 shows that only a few baseline methods are evaluated with this metric on ImageNet, which seems unusual. Could you clarify the reason for this?
局限性
Authors already list their limitations in the paper.
最终评判理由
The author's thoughtful rebuttal (including experiments and questions I have asked) has addressed my concerns.
格式问题
N.A.
We thank the reviewer for their careful reading and constructive comments. Below, we clarify the main focus of our work and the role of our experiments.
1. Emphasis on Theoretical Contributions
We would like to stress that the primary contribution of our paper is theoretical. In particular:
- We develop a maximum‑likelihood learning framework for diffusion models under general noise conditions with a novel variance-aware NLL bound.
- Our analysis yields strong resutls for likelihood estimation, extending classical score‑matching theory well beyond the isotropic Gaussian setting.
While previous generative-modeling research has predominantly emphasised practical heuristics, whether in network architectures or sample‐quality optimisation, our aim is to establish a unified, principled learning theory that rigorously addresses the shortcomings of score matching.
2. Role of Experiments
While we fully acknowledge the importance of empirical validation, our experiments are primarily intended to verify the correctness and predictive strength of our likelihood bounds, particularly in relation to noise variance. Although these methods achieve state‑of‑the‑art performance, they function chiefly as a sanity check. We contend that illustrating concordance between our theoretical analysis and empirical results both reinforces our theoretical claims and attests to their practical significance.
We hope this clarification helps to underscore our paper’s focus on general learning theory and parameter‐estimation guarantees, with experiments playing a supportive, validating role. Thank you again for your insightful feedback.
General Perspective and aiming on Score Matching
Score matching has become a powerful tool in machine learning for fitting unnormalized statistical models, most notably in score‑based generative modeling [1-3]. However, it still relies on computing the trace of the Hessian matrix, which becomes a major bottleneck in high‑dimensional problems. To address this, researchers have proposed:
- Approximate backpropagation [4]
- Curvature propagation [5]
- Reformulating score matching as a denoising objective [6]
These approaches either reduce the cost of Hessian‑trace computations or bypass second‑order derivatives entirely. Moreover, [7] established a theoretical link between score matching and maximum likelihood estimation under isotropic Gaussian noise, highlighting the method’s robustness to noisy data. Yet, the assumption of noise remains a limiting factor for real‑world applications. In practice, degradations in computer vision and signal processing are far more complex:
- Real‑camera noise arises from multiple sources, shot noise, thermal noise, dark current, etc., and is further processed by in‑camera pipelines, resulting in spatio‑chromatic correlations and both signal‑dependent and signal‑independent components [8][9].
- Simplistic noise models used in most score‑matching studies bear little resemblance to these real conditions.
- Blur kernels in tasks like face super‑resolution cannot be estimated explicitly, which further limits the applicability of Gaussian-assumption.
Motivated by these challenges, we ask:
Can the theoretical relationship between likelihood estimation and score matching be extended beyond the Gaussian to any arbitrary noise even without explicit forms?
We also note that recent work on score matching with missing data [11] has focused on truncated, denoising, and sliced variants. In contrast, our framework can be viewed as a highly general form of denoising score matching:
- By treating a subset of corrupted observations as “missing,”
We believe this perspective not only broadens the applicability of score matching to realistic noise environments but also lays the groundwork for future exploration of its connections to missing‑data methodologies.
Weakness
W1: We thank the reviewer for this suggestion. In practice, we chose our three canonical noise distributions for the following reasons:
-
Uniform noise for dequantization Uniform noise is the de‑facto standard for dequantising discrete image data, hence its widespread use as a benchmark in likelihood‑based models.
-
Laplace noise as a heavy‑tailed exemplar The Laplace distribution is the prototypical heavy‑tailed noise, and it is frequently employed in machine learning regularisation and robustness studies.
-
Gaussian noise for its optimal information‐theoretic properties (see lines 333-335).
We thank the reviewer for this suggestion. In response, we have augmented our experiments with logistic noise, a bell‑shaped symmetric distribution, which, at a fixed variance of , delivers performance nearly identical to that of Laplace noise. These additional results further validate the robustness of our variance‑aware likelihood bound across a wide range of noise distributions. We will incorporate these findings into the relevant tables and include full implementation details in the revised manuscript. Crucially, our use of uniform, Laplace, Gaussian and logistic noise is intended solely to illustrate the generality of our theoretical framework; a thorough exploration of application‑specific noise schedules and their optimal tuning will be pursued in future work.
W2: Thank you for this suggestion. We agree that a broader empirical comparison would strengthen our claims. As shown in Table 1 in main text, we have already compared our method against several state‑of‑the‑art non‑diffusion models on CIFAR‑10 and ImageNet‑32; however, we did not explicitly indicate each model’s category. Below, we present a summary of the results on ImageNet‑32 and ImageNet‑64, now including the model types:
| Model | Type | ImageNet-32 | ImageNet‑64 |
|---|---|---|---|
| Our Method (ISIT) | Diffusion | 3.01 | 2.91 |
| Very Deep VAE [Child, 2020] | VAE | 3.80 | 3.52 |
| Flow Matching [Lipman et al., 2022] | Flow | 3.53 | 3.31 |
| Sparse Transformer [Child et al., 2019] | Autoregressive | - | 3.44 |
| W-PCDM [Li et al., 2024] | Diffusion | 3.32 | 2.95 |
By adding a “Type” column, readers can immediately see which paradigm each baseline represents. We will update and include the fully annotated table in the revised manuscript.
W3: To date, likelihood estimation with generative models on truly discrete data (e.g., text) remains under‑explored. This is because most generative frameworks define continuous densities, which become singular when evaluated on a discrete support. As a result, researchers commonly employ dequantisation to bound discrete likelihoods, see our response to Reviewer 1sQW for a full discussion. Existing dequantisation methods, however, introduce extra training overhead and suffer from a train–test mismatch.
Our Theorem 1 and Proposition 2 show how to resolve both of these issues, yielding tighter likelihood bounds without incurring additional training cost. For this reason, we have not included extra experiments on textual data in the current manuscript.
Instead, to demonstrate the real‑world utility of our likelihood bounds, we applied the bit‑back coding scheme for lossless compression on image data. Concretely, we compressed CIFAR‑10 using our variance‑aware bounds and obtained the following results:
| Model | Compression Rate (bits/dim) |
|---|---|
| LBB [Ho et al., 2019] | 3.12 |
| IDF [Hoogeboom et al., 2019] | 3.26 |
| VDM [Kingma et al., 2021] | 2.72 |
| W-PCDM [Li et al., 2024] | 2.37 |
| ISIT (Ours) | 2.57 |
(We leave the improvements of this avenue of research for further work.)
These experiments confirm that our theoretical improvements to likelihood estimation translate directly into practical gains in lossless image compression.
W4: Thank you for pointing out the formatting issues. We will carefully revise the manuscript to eliminate unintended artifacts and improve overall presentation.
Regarding the blank boxes appearing at the ends of paragraphs (e.g. lines 153, 187, 212), these were deliberately included to mark the end of a proof, an established convention (a “□” or QED symbol) in theoretical work. However, we appreciate that this may be unfamiliar to some readers and can be mistaken for a formatting error. In the revised version, we will include a brief note clarifying their purpose.
Question
Thank you for raising this insightful question. In maximum likelihood learning research, the community’s standard comparison metric is negative log‑likelihood (measured in bits‑per‑dimension), rather than FID. FID is generally reported only as a secondary measure for completeness, not as a benchmark for contribution claims. As we explain in our response to Reviewers 1sQW and trPf, this is precisely why so few prior works compute FID on ImageNet, most simply include it to round out their experimental section. We appreciate your attention to this detail and hope this clarification is helpful.
[1] Song, Yang, and Stefano Ermon. "Generative modeling by estimating gradients of the data distribution." [2] Song, Yang, et al. "Sliced score matching: A scalable approach to density and score estimation." [3] Song, Yang, et al. "Score-Based Generative Modeling through Stochastic Differential Equations." [4] Kingma, Durk P., and Yann Cun. "Regularized estimation of image statistics by score matching." [5] Martens, James, Ilya Sutskever, and Kevin Swersky. "Estimating the hessian by back-propagating curvature." [6] Vincent, Pascal. "A connection between score matching and denoising autoencoders." [7] Lyu, Siwei. "Interpretation and generalization of score matching." [8] Zamir, Syed Waqas, et al. "Cycleisp: Real image restoration via improved data synthesis." [9] Yue, Zongsheng, et al. "Variational denoising network: Toward blind noise modeling and removal." [10] Givens, Josh, Song Liu, and Henry Reeve. "Score Matching with Missing Data."
Thanks for authors' thoughtful rebuttal. Especially added additional experiment and detailed explanation on my question. I am quite satisfied with the rebuttal. I shall update my score.
Thank you very much for your kind feedback and for taking the time to engage with our rebuttal. We’re very glad to hear that our response and the additional experiment addressed your concerns, and we truly appreciate your willingness to update your score. We hope the revised evaluation reflects the contribution more accurately, and we’re grateful for your support of the paper.
This paper presents a novel likelihood bound for noise-driven generative models, aiming to improve both the accuracy and efficiency of maximum likelihood estimation. The main contribution is an extension of the KL divergence, Fisher information connection to general noise perturbations, transforming a theoretical insight into a practical design tool. This allows for the use of realistic noise distributions, such as those reflecting sensor noise or heavy-tailed behavior, while maintaining the standard diffusion training frameworks. By interpreting diffusion as a Gaussian channel, the authors derive an exact expression for the mismatched entropy between data and model. Empirically, their method achieves impressive likelihoods on CIFAR-10 and ImageNet-32.
优缺点分析
Strengths
This paper intends the extend the training framework of regular diffusion to a much more general data corruption process. Several theoretical results seems interesting. For example, in Theorem 1, they showed that the score matching loss flavour can extend to a general addictive disturbance.
Weaknesses
-
While the authors present a sequence of interesting theortical results, it is still not very clear for me how they used those results to implement their algorithm. It would be helpful if they could provide a pseduocode block to present the entire pipeline.
-
While the authors claim that the proposed method could handle arbitrary noises, it appears that this is somewhat overclaimed. Authors should have a better discussion on what propeties they assume the noise should have.
问题
I have mentioned several points in section Strengths And Weaknesses. In addition,
-
Does the framework require condioned score function of nioses have a closed-form expression?
-
In prop 1 and thm 2, how do we obtain theta_1 in real implementations and how is it different from theta.
-
I am a bit confused about the notation in prop 1 (and thm 2). Are you intended to say is something like instead?
-
In (9), \hat{n} is trained to match n, is n here gaussian?
-
The work you present here is similar to (Paven 1990 Lemma 3.8). Could you comment on their relationship?
- Michele Pavon and Anton Wakolbinger On Free Energy, Stochastic Control, and Schrödinger Processes, Progress in Systems and Control Theory, vol 10. Birkhäuser, Boston, MA. 1990
局限性
yes
最终评判理由
Most of my initial concerns and questions have been addressed in the rebuttal. I believe much of the original confusion stemmed from the unclear notation and writing in the current version.
While I now have a better understanding of how the algorithm works, I do not feel I have sufficient background to fully assess its significance or the comprehensiveness of the empirical study. To reflect this, I have raised my score to 4 while lowering my confidence to 2.
格式问题
No.
We are grateful for the reviewer's appreciation of the clarity and relevance of our work, and for raising insightful concerns. Below we respond to each comments.
W1: Thank you for the suggestion. Basically, our approach can be viewed as a concatenation of two distinct noise regimes, each catering to a different range of noise variance:
-
Low‑variance regime (0 ≤ < )
- We make no assumption on the noise distribution and allow arbitrary perturbations.
-
High‑variance regime ( ≤ ≤ )
- We adopt a classic Gaussian noise model.
- Leveraging Gaussian noise in this range ensures the standard diffusion training methodologies and the optimisation process.
By concatenating these two regimes, our method combines flexibility to handle diverse noise types at low intensities with the tighter theoretical guarantees of Gaussian noise at higher intensities. Indeed, a pseduocode block would be helpful to present the entire pipeline, and we appreciate the reviewer’s suggestion to include it and will add the following to the manuscript:
— Pre‑processing: Fisher‑consistent arbitrary noise —
for each x in X do
u ~ Ψ # Ψ is the ARBITRARY isotropic noise
v ← x + σ(0)·u # 0 < σ(0) ≪ 1
store v in X # no training needed
end for
— Algorithm 1: Training—
repeat
v ~ X
t ~ U(0,1)
α,σ ← Sched(t)
n ~ N(0,I) # N is the Gaussian noise
y ← α·v + σ·n # Gaussian channel
n̂(θ) ← f(y,t;θ)
Take gradient descent step on
∇θ‖n − n̂(θ)‖²
until converged
Theorem 1: As σ→0, minimising this loss matches asymptotically equivalent to maximising the log‑likelihood.
— Algorithm 2: Likelihood evaluation —
# 1. identical dequantisation at test time
u ~ Ψ; v ← x + σ(0)·u
# 2. Monte‑Carlo estimation of DSM integral (Theo.2)
η ~ p(η) #log-SNR parameterization, Importance Sampling
α,σ ← Sched(η)
n ~ N(0,I)
y ← α·v + σ·n # Gaussian channel
n̂(θ) ← f(y, η;θ)
L(θ) ← 0.5·Z·‖n − n̂(θ)‖² # Z is a normalizing constant
# 3. Point‑wise log‑likelihood lower bound (Theo.2)
ℓ̂(x;θ) ← H(p(y₁|x), π) + L(θ) # H(p(y₁|x), π) is pre‑computed
return ℓ̂(x;θ)
Theorem 2: A novel, tighter, noise variance-aware log likelihood bound for standard Gaussian diffusion Models.
W2: Thank you for raising this important point. We agree that “arbitrary noise” sounds strong without further clarification, but we have made explicit in the manuscript that our method requires only the following mild, broadly‑applicable assumptions on the noise distribution in Theorems (see, for example, line 142). No further constraints: We impose only zero‑mean and isotropy, no additional requirements such as symmetry (unlike [1]), a closed‑form score function (Q1), or detailed noise statistics. In practice, these two conditions are satisfied by the vast majority of noise types encountered in imaging, signal processing, and real‑world measurements.
Under these minimal assumptions, our method delivers five clear benefits:
- A new theoretical paradigm for dequantising discrete data. 2. Robustness and stability at , resolving train–test mismatch.3. Optimal efficiency and performance in likelihood estimation: Injecting an additional noise block at the start of diffusion improves the efficiency and tightness of our likelihood bound, yielding state‑of‑the‑art likelihood estimates on ImageNet-32, -64, -128 without extra computational overhead. 4. A foundation for future extensions to more complex or uncontrollable noise. 5. Exceptionally mild requirements that hold in the wild.
We hope this clarification demonstrates that our use of “arbitrary noise” is not an overclaim, but rather a reflection of the very limited and realistic assumptions underpinning our theoretical and empirical results.
Questions: We thank the reviewer for these thoughtful questions. As described in our Algorithms section, we partition the diffusion process into two noise regimes:
-
Arbitrary‑perturbation regime for : We impose no assumptions on the noise distribution here, and thus do not require any closed‑form expression for the conditional score of noise .
-
Gaussian channel regime for : In this range, the noise is explicitly modeled as Gaussian. The conditional score for admits the well‑known analytical form
which our network approximates in the usual fashion.
Accordingly:
- Q1: We do not require a closed‑form score function for arbitrary noise, only the standard Gaussian score in the high‑noise regime.
- Q4: In (9), is indeed drawn from ; is trained specifically to match this Gaussian noise within the Gaussian channel.
We hope this clarifies that our framework leverages analytical tractability where available, without over‑constraining the low‑noise, arbitrary‑perturbation stage.
Mismatched channel
Below is an intuitive statement of our theoretical logic:
In information theory, a mismatched Gaussian channel refers to the scenario where two different inputs:
- the true data , and
- the assumed input produced by our network
are both passed through the same Gaussian channel with variance . In the diffusion framework, any input to this channel will, over infinite time, converge to its stationary distribution. However, in practical diffusion models, the true data’s output never exactly matches an isotropic Gaussian, whereas our generative sampling procedure does start from a perfect . To reconcile this mismatch, we fix the channel’s assumed output of to be its true stationary law:
Concretely, we pick so that the score network at recovers exactly the isotropic Gaussian equilibrium. Meanwhile, our main parameters are trained across the entire noise schedule via denoising score matching. This ensures that, although real data output never lies exactly in , our theoretical decomposition (Prop 1, Theo 2) remains valid by anchoring the high‑noise endpoint to the true Gaussian stationary distribution.
Accordingly, below we address Q2 and Q3 in the context of the “mismatched Gaussian channel” logic just outlined.
Q2 – How do we obtain in practice, and how does it differ from ?
| Role | How it is obtained | Whether it is updated | Purpose |
|---|---|---|---|
| (reference parameters) | Fix to ensure the | Never updated during the main training loop. | To “anchor’’ the channel’s output at to its true stationary distribution . This is needed for the decomposition in Prop. 1 / Thm. 2. |
| (main parameters) | Trained with the denoising score‑matching loss across the full noise schedule . | Continuously updated throughout training. | To predict the noise for every intermediate noise level and drive the likelihood value. |
Hence, is a fixed reference that guarantees , whereas is the parameter vector we actually optimize.
Q3
-
denotes the distribution of the channel output when the true input is .
-
is the stationary (isotropic‑Gaussian) output distribution of the same channel under the assumed input dictated by .
-
The mismatched entropy term
therefore measures the mismatch between these two output distributions after they pass through the identical channels.
So the notation is intentional: is not ; rather, it is the channel’s response to real data, while serves as the fixed Gaussian reference, i.e., stationary distribution of the Markov chain.
Q5
Thank you very much for drawing our attention to [2]; we will cite this result in the revised manuscript. The Lemma 3.8 in [2] is the classical KL–Fisher identity for a Gaussian channel, and it underpins several recent studies in deep learning [3], [4]. Prop 1 of our paper indeed builds on the same identity, see Supplementary Material, line 757, so Pavon’s work is certainly relevant.
Our contribution, however, diverges from Pavon’s in two key respects: 1. Theo 1 extends the KL–Fisher connection from Gaussian to aribitrary one. Pavon’s lemma therefore emerges as a special case of our result (line 153). 2. In diffusion models the pathological behaviour at creates a severe train–test mismatch. This issue is outside the scope of [2], but is resolved in our framework by combining Theorem 1 with Proposition 2, which in turn yields the likelihood bound of Theorem 2 and, empirically, state‑of‑the‑art performance.
[1] K. R. Narayanan and A. R. Srinivasa, "On the thermodynamic temperature of a general distribution." [2] Michele Pavon and Anton Wakolbinger, "On Free Energy, Stochastic Control, and Schrödinger Processes." [3] Siwei Lyu, "Interpretation and generalization of score matching." [4] Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon, "Maximum likelihood training of score-based diffusion models."
Thank you for the detailed responses. I now have a clearer understanding of the work, and most of my concerns have been addressed. I do have a few follow-up points:
-
Regarding the statement on arbitrary noise. I was aware of the assumptions mentioned in the statement, but my point is that I don’t believe they are sufficient. For example, in Theorem 1, it is important to ensure that is chosen appropriately so that the score of exists and the Fisher divergence in (4) converges. I understand that extreme rigor is not always required in machine learning papers, and personally, I would be satisfied with a statement like: “ is assumed to behave sufficiently well so that all quantities involved in the analysis are well-defined.” This is a minor issue, but the authors may want to reconsider their wording to make the claim appear more convincing.
-
On the notation. I believe the current notation system should be significantly simplified, as it introduces unnecessary confusion. For instance, in the expression , it seems you are referring to distributions via their density functions. A clearer and more standard notation might be . The current formulation lacks mathematical clarity.
I will update my score accordingly.
Thank you for your thoughtful comments and for engaging deeply with our paper.
- Regarding the assumptions on the noise distribution $\Psi$ in Theorem 1, we would like to clarify that the score function of $\Psi$ is not required. This is because the KL divergence is computed over the perturbed variable $\mathbf{Z}_t = \alpha_t \mathbf{X} + \sigma_t \Psi$, and the limiting expression depends only on the score functions of the data distribution $p(\mathbf{x})$ and the model $q(\mathbf{x}; \theta)$, as shown in Equation (4), which defines the Fisher divergence (i.e., score matching) between $p$ and $q$, evaluated over the original input space $\mathbf{x}$ and therefore independent of the noise. The noise $\Psi$ does not appear in this final expression. Moreover, our proof is based on characteristic functions, which do not require the noise distribution to have a differentiable density or well-defined score.
Our analysis assumes only that $\Psi$ has zero mean and identity covariance, and that the perturbed distributions $p_{\sigma_t^2}$ and $q_{\sigma_t^2}$ admit well-defined densities such that the KL divergence is finite and differentiable with respect to $\sigma_t^2$. We do not rely on any score-related property of $\Psi$. The noise simply serves to define the perturbation and does not directly enter into the divergence computation. This stands in contrast to denoising score matching (DSM) [1], where the score of the noise distribution is explicitly required, as evident from Theorem 2 and Proposition 1. That said, we fully agree that the current wording could benefit from improved clarity, and we are happy to revise the statement to something like: “Ψ is assumed to behave sufficiently well so that all quantities involved in the analysis are well-defined.”
- Regarding the notation, we fully agree with your suggestion. We acknowledge that expressions such as $\pi(x) := q(y_1; \theta_1)$ may be unnecessarily confusing, and we will revise them to use clearer and more standard notation such as $q_1(x; \theta^*)$ in the final version to improve clarity and mathematical precision.
We’re very grateful for your feedback and your willingness to reconsider your score. If there are still any specific concerns or remaining issues that you feel are holding the work back from a stronger rating, we would sincerely appreciate it if you could let us know. We are truly eager to improve the paper as much as possible, and your input would be invaluable in helping us reach that goal.
[1] Vincent, Pascal. "A connection between score matching and denoising autoencoders." Neural computation 23.7 (2011): 1661-1674.
The paper addresses the problem of likelihood estimation in diffusion models. The authors generalize a previously known relationship between the KL divergence and Fisher information to non-Gaussian perturbations. They perform numerical experiments showing likelihood bounds on imagenet 32x32 and cifar10.
优缺点分析
I'll start this review by saying that I am not very familiar with the subject of likelihood estimation for diffusion models. Therefore, it is hard for me to judge novelty and assess some of the claims made in the paper.
Strengths
The authors explain the goal and motivation of the paper very well. The result of Theorem 1 seems novel. Although the experimental results show that Gaussian perturbations still lead to better results, I think this generalization is theoretically interesting and may provide a useful foundation for future analysis. The authors also acknowledge the limitations of their work.
Weaknesses
I found the paper a bit dense and hard to follow. The proposed method also seems to be computationally expensive, although the authors do not discuss this aspect or compare it to the cost of existing estimation approaches. Also, as mentioned by the authors, the method is limited to likelihood estimation and does not offer a training objective or sampling scheme for diffusion models with more general noise schedules. And finally, the bound does not seem to yield much tighter bounds than existing works.
问题
At the beginning of the paper, you claim that IT bounds may be looser than the ELBO, but it seems like the bound of [30] is slightly tighter than the ELBO. Can you elaborate on that?
局限性
yes
最终评判理由
I read the rebuttal and the authors' explanations. I will keep my score.
格式问题
N.A.
We are grateful for the reviewer’s thoughtful comments, incisive critiques and detailed questions. We recognise that our work builds on concepts from information theory, statistics and signal processing, domains whose terminology can be dense for readers unfamiliar with them. We hope that our point‑by‑point response has clarified how these principles support our focus on maximum‑likelihood learning at the foundations of machine learning.
Overall
Average log‑likelihood is widely recognised as the default metric for evaluating generative image models. Previous work has largely prioritised perceptual quality, emphasising coarse scale patterns and global consistency of generated images, with common metrics such as the Fréchet Inception Distance (FID) and the Inception Score (IS). In contrast, we optimise for likelihood, a criterion that is inherently sensitive to fine‑scale details and the exact values of individual pixels. Mathematically, these objectives differ markedly in both form and focus (details see line 20-24 in main text) [1], [2], [3]. The visual appeal of generated images (as measured by e.g., FID) correlates imperfectly with log-likelihood [4]. We focus here on pushing the state-of-the-art in log-likelihood estimation, and while we report FID for completeness, we defer sample quality optimisation to future work (details see supplementary material, Section C: Samples Quality and FID).
Meanwhile, modern generative models operate in a continuous domain, yet many real‑world datasets are inherently discrete. When a continuous‑density model is fitted directly to such data, the likelihood evaluation becomes singular and severely degrades performance. The conventional remedies, uniform or variational dequantisation [3], inject auxiliary noise but suffer two drawbacks: they require an additional training phase, which is hard to train to the optimal, and, in general, introduce a pronounced train–test gap that inflates the negative log‑likelihood (NLL) performance. Moreover, another significant issue occurs at the very start of the diffusion process (), where numerical instabilities plague both training and sampling. To avoid this, practitioners typically substitute with a small constant , but this workaround again introduces a discrepancy between training and evaluation.
To eliminate both sources of discrepancy and numerical instabilities, we prove Theorems 1 and Propositions 2, establishing that, for any i.i.d. continuous noise distribution with sufficiently small variance, the Fisher divergence is asymptotically equivalent to the KL divergence. Consequently, we may dequantise the discrete data with arbitrary low‑variance continuous noise and feed the resulting continuous samples into the incremental‑SNR Gaussian channel, that is, the standard diffusion process, without retraining the model or altering its architecture. Building on this insight, Theorem 2 and Proposition 1 yield a tightened analytic bound on the NLL of classical diffusion models (recovering Theorem 1 as the Gaussian special case). Empirically, our method removes the train–test gap, stabilises optimisation near , and delivers state‑of‑the‑art likelihood estimates across ImageNet‑32, ‑64, and ‑128 and a competitive performance for CIFAR-10.
Weakness
Thank you for highlighting the importance of computational cost.
As you point out, maximum likelihood training in the generative model setting can be extremely demanding, after all, we optimise for every pixel value, which is why most prior likelihood‑based studies have been confined to low resolutions (e.g. 32×32). However, in response to another reviewer request, we have extended our experiments to 64×64 and 128×128 resolutions (see our reply to Reviewer trPf).
Typically, achieving benchmark likelihood estimation performance requires several million training iterations, and the training process usually takes a week, or even a month or longer. However, our likelihood bounds with additional methods largely mitigate this issue: we require only 300 000 iterations to achieve optimal performance. In the main text (lines 302–305) and in greater detail in the Supplementary Material (lines 976–984), we present comparative tables that reports training iteration counts for different likelihood‑estimation methods. Although absolute iteration/seconds vary with hardware and network architecture, iteration count is a robust proxy for relative compute cost (details see Table 1 in main text):
Models with information-theoretic methods
| Model | CIFAR-10 NLL ↓ | Iter. (million) |
|---|---|---|
| ScoreODE (second order) | 3.44 | 1.3 |
| ScoreODE (third order) | 3.38 | 1.3 |
| ScoreFlow | 2.90 | 0.3 |
| Flow Matching | 2.99 | 0.391 |
| Stoch. Interp | 2.99 | 0.5 |
| i‑DODE | 2.56 | 6.2 |
| ISIT (SP with IS, ours) | 2.49 | 0.3 |
| ISIT (VP with IS, ours) | 2.50 | 0.3 |
Models with variational bounds
| Model | CIFAR-10 NLL ↓ | Iter. (million) |
|---|---|---|
| VDM | 2.65 | 10 |
| DiffEnc | 2.62 | 8 |
| MuLAN | 2.55 | 8 |
| BSI | 2.64 | 10 |
| W‑PCDM (VDM weight) | 2.35 | 2 |
| W‑PCDM | 10.31 | 2 |
- previous information‑theoretic bound typically require on the order of 1 million training iterations merely to reach competitive likelihoods, but still do not attain state‑of‑the‑art performance (cf. main text, line 39–43).
- ELBO methods converge in 8-10 million iterations to the SOTA likelihood.
- Our information‑theoretic bound converges in only ≈ 300 000 iterations to second best likelihood (in BPD) on CIFAR-10 and SOTA on ImageNet.
Moreover, we discussed our training iteration speed in line 304 of the main text. It is worth noting that in the likelihood‑estimation domain, even a 0.01 bits per dimension (BPD) improvement represents a substantial advance (e.g. the improvement of MuLAN over i-DODE). While ELBO (W‑PCDM) remains the strongest existing approach on CIFAR-10, Theorem 2 allows us to nearly close the performance gap at a fraction of the computational effort.
Claims of limitation
Also, we really appreciate the reviewer for the observation of claims of limitations. Let us clarify the intended scope and contributions of our work:
-
Focus on likelihood rather than perceptual metrics. When we state that our method is “limited to likelihood estimation,” we mean precisely that our training objective, and hence our strongest empirical gains, are measured in terms of negative log‑likelihood (NLL) rather than FID. A fuller discussion of this point, and comparative FID results, can be found in our response to Reviewer trPf and line 985-990 in supplementary material.
-
Sampling schemes for general noise schedules. It is true that we do not propose a new sampling algorithm optimised for arbitrary noise. Our primary goal was to develop a principled training objective, namely, the novel likelihood bound of diffusion models, rather than to accelerate or improve diversity of sample‑quality under arbitrary noise. Details of our algorithm pipeline appear in our response to Reviewer Nedb. We regard the design of specialised sampling methods as an important avenue for future work and acknowledge this as a current limitation.
-
Tighter bounds through Theorem 2. Although our bound is framed as a likelihood objective, it is indeed tighter in the sense that lower NLL corresponds to a strictly better bound on the data density. In practice, we achieve noticeably improved NLLs over prior methods without increasing computational cost on ImageNet-32, -64, and -128.
-
Relation to multiscale approaches. We note that W‑PCDM attains superior CIFAR‑10 NLL by augmenting its architecture with multiscale modelling components, specifically, Laplacian pyramids and wavelet transforms. We fully expect that integrating analogous multiscale designs into our framework would similarly elevate our performance, and we intend to explore this in future work.
We hope these clarifications demonstrate both the precise remit of our contribution and the potential for further enhancements.
Questions
We appreciate the reviewer’s insightful question.
Reference [5] was indeed one of the key works that motivated our use of information‑theoretic tools in diffusion modelling. In that paper, the authors employ the classical I‑MMSE identity from information theory to derive a novel diffusion bound and report a CIFAR‑10 NLL of 2.90, which is certainly a promising result. However, this value remains above the best ELBO‑based bound at the time, approximately 2.65 BPD, so in practice their bound does not yet surpass the ELBO in tightness (recall that lower NLL indicates a tighter log‑likelihood bound). Building on this insight, our work is instead guided by Song et al.’s [3] exposition of the KL–Fisher connection for score matching, which yields substantially tighter likelihood bounds. We discuss these developments in greater detail in the Related Works section (line 281).
[1] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020. [2] Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. Advances in neural information processing systems, 34:21696–21707, 2021. [3] Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based diffusion models. Advances in Neural Information Processing Systems, 34:1415–1428, 2021. [4] Subham Sahoo, Aaron Gokaslan, Christopher M De Sa, and Volodymyr Kuleshov. Diffusion models with learned adaptive noise. Advances in Neural Information Processing Systems, 37:105730–105779, 2024. [5] Xianghao Kong, Rob Brekelmans, and Greg Ver Steeg. Information-theoretic diffusion. In The Eleventh International Conference on Learning Representations, 2023.
I thank the authors for their detailed response. I appreciate the authors' additional explanations on the computational cost and the explanations about the tightness of IT bounds vs ELBO. I will keep my score.
Dear Reviewer,
Thank you again for your thoughtful feedback and engagement.
We noticed that your score remains unchanged. Could you perhaps elaborate on what your remaining criticism is, i.e. why you did not give a higher rating? Is there something we could do to further improve your rating?
If there are specific concerns that prevented a higher rating, we’d be grateful if you could share them, we’re keen to understand and improve.
To reiterate: our work establishes a principled and scalable likelihood framework that generalises score matching to arbitrary noise and achieves state-of-the-art density estimation on ImageNet-32, -64, and -128. We believe this makes a solid contribution to the foundations of generative modelling.
Thank you again for your consideration.
Thank you for the follow-up and for clarifying the intended scope of the work. I understand that the main focus of the paper is likelihood estimation as opposed to improving all aspects of generative modelling such as sampling or visual metrics such as the FID.
I will explain why I decided to keep my score. I found the main result to be theoretically interesting and the experiments show that the paper's bound is slightly tighter than the previous ones. On the other hand, I believe the scope of the paper is quite limited, and that it would have been more significant with additional contributions. That being said, I fully acknowledge my lack of familiarity with the literature of likelihood estimation, so it is hard for me to judge the significance of the results properly. I am therefore open to discussing with the other reviewers and the AC during the discussion period.
Thank you very much for your follow-up and for your openness to discussion. We truly appreciate your thoughtful engagement with the paper.
Regarding your concern about scope: while we fully understand that the paper does not aim to improve all aspects of generative modelling, we would like to respectfully clarify that making progress in likelihood estimation is particularly challenging. Unlike visual metrics such as FID, which can often be improved substantially through sampling heuristics or architectural tricks, NLL improvements are fundamentally harder, as they require actual improvements in the underlying data distribution. In recent years, even top-tier models from leading research groups [1-6] have only managed to reduce NLL on datasets like ImageNet by small fractions (e.g., 0.02–0.1 bits per dimension), and these improvements are considered highly nontrivial.
In this context, our work provides a principled and tractable likelihood objective that generalises score matching under arbitrary noise and achieves state-of-the-art density estimation performance across multiple ImageNet resolutions. We believe this contribution can serve as a strong foundation for future advances in likelihood-based generative modelling, especially for those aiming to move beyond sampling-focused heuristics.
Once again, thank you for your time and your willingness to coordinate with the other reviewers and the AC. If there are any remaining aspects that we could clarify or strengthen to better convey the significance of our contribution, we would be more than happy to do so.
[1] Kingma, D., Salimans, T., Poole, B., & Ho, J. (2021). Variational diffusion models. Advances in neural information processing systems, 34, 21696-21707.
[2] Song, Y., Durkan, C., Murray, I., & Ermon, S. (2021). Maximum likelihood training of score-based diffusion models. Advances in neural information processing systems, 34, 1415-1428.
[3] Lu, C., Zheng, K., Bao, F., Chen, J., Li, C., & Zhu, J. (2022, June). Maximum likelihood training for score-based diffusion odes by high order denoising score matching. In International conference on machine learning (pp. 14429-14460). PMLR.
[4] Zheng, K., Lu, C., Chen, J., & Zhu, J. (2023, July). Improved techniques for maximum likelihood estimation for diffusion odes. In International Conference on Machine Learning (pp. 42363-42389). PMLR.
[5] Bao, F., Li, C., Zhu, J., & Zhang, B. (2022). Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. ICLR.
[6] Nichol, A. Q., & Dhariwal, P. (2021, July). Improved denoising diffusion probabilistic models. In International conference on machine learning (pp. 8162-8171). PMLR.
This paper derives a connection between KL-divergence and Fisher-divergence for diffusions with arbitrary noise distributions. While these results are asymptotic and hold for small noise regimes, they nevertheless address two key issues in likelihood training for diffusion models: (a) handling the degeneracy at , and (b) handling the train-test gap that arises when de-quantising. These new results are used to design a method that obtains SoTA or nearly SoTA NLL, using significantly less training time, on both CIFAR-10 and ImageNet datasets.
Ultimately after the rebuttal/discussion period, the reviews on this paper were overall positive, recognizing both the theoretical and practical merits of this work. For the camera-ready, I ask that the authors incorporate the rebuttal to Reviewer Nedb regarding both the pseudo-code for the ISIT algorithms and the notational improvements. Furthermore, some exposition about how the algorithms arise as a consequence of both Theorem 1 and Proposition 1 would substantially improve the clarity of the manuscript.