Learning normalized image densities via dual score matching
We train an energy-based model on image datasets through a dual score matching objective and analyze the local geometry of the learned energy landscape.
摘要
评审与讨论
The paper introduces a method for learning normalized energy-based models by extending score matching to account for both spatial and temporal gradients. Leveraging the structure of diffusion models, the authors propose a novel dual score matching objective that enables direct estimation of energy values rather than just their gradients. The method is demonstrated on a synthetic toy example and ImageNet64, showing consistent energy estimates and competitive negative log-likelihood performance. The paper also explores the geometry of the learned energy landscape.
优缺点分析
Strengths:
- The paper tackles a key limitation of score-based models - their inability to estimate normalized densities -and proposes a direct way to recover normalized energies using both space and time gradients in diffusion.
- Learning a normalized energy opens the door to likelihood-based evaluation, OOD detection, and deeper understanding of image statistics — things that score-based models can’t directly do.
- On ImageNet64, the proposed method achieves negative log-likelihood (NLL) scores that are competitive with several strong baselines, including flow-based models
- The paper includes a detailed empirical study of the learned energy geometry, revealing properties like variation in local dimensionality and log-probability distributions across image content.
Weaknesses:
- The paper lacks theoretical guarantees that minimizing the proposed objective leads to an accurate approximation of the true log-density, which makes some of the empirical findings (e.g., observed geometric properties) harder to fully trust.
- The paper evaluates only on ImageNet64 (and on a synthetic example). Including additional datasets would help demonstrate the generality and robustness of the proposed method.
- While the paper focuses on likelihood and geometric analysis, it would be valuable to demonstrate downstream applications (e.g., out-of-distribution or outlier detection) where normalized energy estimates are especially useful.
问题
- One of the main motivations for learning normalized energy is enabling tasks like OOD detection or uncertainty estimation. Have you tried applying your model in one of these settings? Even a small experiment could help illustrate the practical value beyond likelihood evaluation.
- The method is only tested on ImageNet64 (and a toy example). It would be helpful to know whether it generalizes to other datasets. Even a single additional benchmark could make the results feel more robust .
局限性
Yes
最终评判理由
While some concerns remain, such as the lack of comparison to another baseline on the additional datasets, the rebuttal provides clarifications and new experiments, which helped me better understand the strengths and limitations of the work. I therefore keep the borderline accept recommendation.
格式问题
None
Thank you for your review, which will help us to improve the paper. We ran several experiments to address your questions, which we answer below.
Other datasets: We applied our model to CelebA images at 80x80 resolution. Our model achieves a NLL of 1.94 bits/dim. We are not aware of reported NLLs on CelebA (ImageNet64 has been a de facto standard), but it is reasonable that the value is lower than for ImageNet64 due to reduced diversity in the images. Additionally, we verified that the histogram of log probabilities (as in Figure 3) is well-approximated by a Gumbel distribution (apart from a few outliers on the low-probability edge, due to corrupted images). Over the 20k test images, the range from highest to lowest probabilities is 17dB/dim. This is half that of ImageNet64, which is also reasonable given the lower diversity in image content. We also ran our method on CIFAR-10, obtaining an NLL of 3.16 bits/dim, and a similarly-skewed log p distribution (with a range of values spanning 23 dB/dim). We hope that these additional results alleviate the reviewer’s concerns. We believe that most photographic image datasets should lead to qualitatively similar results regarding the experiments of Section 3.
Downstream applications of energy-based models: OOD detection based on probability values is notoriously difficult, due to the counterintuitive behavior of probability densities in high-dimensional spaces. As we show in Figure 3, the highest-probability images are not typical, and could be considered outliers (this is particularly apparent in the highest-probability image, which is blank). The following paper demonstrates that generative models can assign higher probability density values to images from other datasets (e.g., MNIST) than to those in their training set (e.g., FashionMNIST): Nalisnick, E., Matsukawa, A., Teh, Y. W., Gorur, D. & Lakshminarayanan, B. Do Deep Generative Models Know What They Don’t Know? (2019).
Nonetheless, probability values can be used to detect certain degradations such as additive noise. We verified that adding noise to the reference images in Figure 3 decreases their probability (this experiment is similar to the third panel of Figure 5, except that the value of given to the model is 0). In addition, one application of our model is that it can be used to estimate the variance of the noise added to an image when it is unknown. Indeed, the value of that maximizes provides an accurate estimate of the true value of . We will incorporate figures demonstrating these two experiments in the paper. Finally, we highlight that our analysis of the learned energy model revealed novel statistical and geometric properties of image densities learned from data, which we believe to be of general interest to the NeurIPS community.
On theoretical guarantees of denoising score matching: We briefly touch on this subtle point in Appendix A.3. Obtaining theoretical guarantees requires establishing a Poincaré inequality on the joint distribution , which Figure 1 suggests could be true, but such inequalities are notoriously challenging to prove and have only been obtained in restricted settings. However, even if this inequality was shown to hold, in practice there is no guarantee that the trained network generalizes. Proving that this is the case is an even bigger open problem. So one always needs to resort to empirical verifications, which are complicated by the fact that the “true probability density” is not known for images. This is the goal of the strong generalization test of Figure 2, which shows that energy models trained on distinct images compute nearly identical probabilities. Our energy-based model is the first for which such a stringent test has been performed. (We also conduct a more direct verification on synthetic data in Figure 1.) We believe these tests to give as much credibility as can be realistically obtained to the other empirical findings of Figures 3 to 5.
Thank you for your thoughtful rebuttal. While some concerns remain, such as the lack of comparison to another baseline on the additional datasets, I appreciate the clarifications and new experiments, which helped me better understand the strengths and limitations of your work. I therefore lean toward maintaining my borderline accept recommendation.
The paper has two related but distinct contributions.
Firstly, the paper proposes a framework called “Dual Score Matching” which is an extension of the original Score Matching (DSM). On top of traditional DSM, the authors used the matching of “time-score”s which is a quantity similar to score but the gradient is with respect to time .
Secondly, the authors did not model the score directly, instead they model “normalised energy” which is almost equivalent to log-probability. They came up with a new architecture that models energy as a conservative field.
In the experiment section, denoising performance and log-likelihood comparisons have been made. Additionally, a significant part of the experiment section focuses on analysing the image distribution using the learned model.
优缺点分析
The authors have investigated a more accurate score matching by introducing a time score component. The authors claim that this modification lead to better denoising and log-likelihoods. While improving the fundamentals of score matching is commendable, the paper has several flaws in terms of novelty, presentation, focus and evaluations.
- It is quite unclear to me why the time score matching is necessary. There is not much reasoning or justification given to support usage of the time score. Is it only to improve the estimated values at the network output? The time score does not seem to have any explicit usage in the paper (e.g. in log-likelihood calculation) — am I wrong in this assessment? The only piece of reasoning I found in the paper is (L42) “[it] .. ensure consistency of the energy estimates across noise levels”. What does this even mean? There is no explanation of this sentence. Why would a score network without the time matching part be NOT consistent across noise levels?
- Authors claim the second objective (i.e. the time score matching) is novel (L42; “second is novel”) but clearly there is a 2022 paper by Kristy Choi et al. (it is cited indeed) that proposes the time score and the matching objective. There is no comparison or discussion about this paper.
- I do not quite understand how section 3 of the paper is related to dual score matching, which is the central theme of the paper. Computing entropy, log density distribution, intensity range, dimensionality — these seem to be related more to the dataset and does not have much to do with the model or the algorithm presented in the paper. I'm sure any other model that computes log-likelihood and trained properly would behave the same way.
- I'm quite doubtful about the usage of conservative energy models. In the Diffusion literature, it is quite an accepted fact that conservative energy models are NOT superior to unconstrained score models. The authors accepted that their finding is (L162) “contrary to” the results of Salimans & Ho [2021], where they attributed this to the specific choice of architecture. However, there is no ablation regarding the choice of architecture. Also, it is unfair to compare with Salimans & Ho [2021] as they investigated generation/sampling performance but this paper did not.
- The authors also claimed that the (L178) “.. unique advantage of our method is that .. it is fast [for NLL computation] ..”, which seems to be an over-claim. The reason it is fast has not much to do with the specific parameterisation or the matching algorithm, but the specific choice of NLL calculation algorithm that they adopted from Karczewski et a. [33]. These important details only appear in appendix rather than mentioning clearly in the main paper.
- There are a lot of approximations made in the theoretical calculations and justifications. On one hand you argue that your method is better due to “homogeneity” property but on the other hand you say that the network is only approximately (L508 appendix) homogeneous. On one hand, you argue about your energy functions being “normalised”, only to reveal later that it is only an approximation. Also, the explanation in appendix A.3 is quite hand-wavy, non-rigourous and does not conclude anything concrete. In general, it seems that the authors are quite uncertain about their rationals/claims.
Overall, I am do not particularly see much novelty (or proper justifications for them) convoluted by the fact that the many rationals are quite uncertain and not concrete. Also, the paper lacks proper experimentation to validate the claims.
问题
Major questions:
- Since you are modelling the energy using a neural network but still matching the “score”, you require a “double backpropagation” (once with respect to space and then parameters). It is quite evident from Eq. 43 of appendix C.2. This is clearly a computational drawback but not discussed in the paper (except one small line in the limitation section).
- With regard to the “normalisation” in Eq. 9, do you need to invoke the network twice for every forward pass (i.e. one for a given and another for ) ? If so, it is also a computational drawback. Please clarify.
- I feel like the two parts of the paper are sort of unrelated. Is it necessary to consider energy models in order to use dual score matching? Could you have done dual score matching with score models (one model for space and another for time)? There are no ablation or justification for these questions.
Minor questions:
-
In several places in the paper it says "gradient of the energy is the score”, which is technically untrue. The score is the negative of the gradient of the energy. Isn’t that true ?
-
A similar issue is with the term “normalised energy (log prob)” which is also technically not true. Log-probability is the “normalised exponentiated negative energy”. It seems that authors used certain terms in a very loose sense.
-
Equation 2 & 3 may have a typo — shouldn't it be and not ?
局限性
Mentioned limitation. But they are rather big drawbacks of the paper and should be discussed in the main paper.
最终评判理由
The rebuttal clarified some core concerns. But the paper needs to be polished quite a bit. There are still downsides about the paper, regarding practicality and overall impact. I am increasing the score one point, but still slightly lean on the negative side.
See my rebuttal response for details: https://openreview.net/forum?id=wtYcS4kxpF¬eId=xYq2B0yBeF
格式问题
NA
Thank you for the detailed review, which will help us significantly improve the paper.
Clarifying contributions: We believe that the primary contributions of the paper are twofold: A combination of architecture and objective that allow efficient training of an accurate normalized energy model. We model the energy directly (rather than the score) so that energy computation is done in a single forward pass. The purpose of the time score objective is to ensure that the energy values are accurate and normalized (see below). They go hand in hand: in our setting, time score matching is meaningless for score models (see below). Analysis of the model (trained on ImageNet) to elucidate properties of natural image densities. We believe these observations to be novel and of value to the community (as mentioned by Reviewer hMws). These two contributions are not disconnected. While most of the experiments in 2. could in principle be done with any generative model that can compute probabilities, they are much more easily done when the log probability is explicitly available with a single forward pass. Furthermore, the dimensionality measurements require access to the distribution of noisy images. Our energy model is the only one that provides access to the density of clean or noisy images in a single forward pass. We will incorporate these points into the introduction.
Novelty: Beyond our combination of energy architecture and dual score matching objective which is novel (see below for time score matching), we are not aware that any of the results in Section 3 have appeared in the literature. Could the reviewer provide references that establish similar results?
Purpose of time score matching: While the time score is not used at test time to evaluate the energy, it plays a crucial role during training to ensure that the learned energy (rather than the “space” score) is accurate. This is demonstrated in Figure 1: the energy model trained with single score matching does not learn the ground truth energy, while dual score matching succeeds. The reason is that single score matching does not sufficiently constrain the values of the energy when the data is multimodal: one can add different constants to the energy in disconnected modes without affecting the score. Matching time scores fixes these unwanted degrees of freedom. This is because the joint distribution is unimodal, contrary to , as visualized in the right panel of Figure 1. Matching its score, which contains both the “space” and time components, thus constrains the energy up to a single additive constant, as opposed to one per mode of . We will clarify this central point in the paper, which was not sufficiently explained.
Novelty of time score matching: While time score matching has been proposed before, our particular objective is novel. In Choi et al., the time score matching objective uses a second derivative in time, which is more computationally expensive than our regression-based objective. The conditional expectation identity (4), and the resulting objective (5), are novel compared to this prior work. However, after submission of the paper we became aware of the concurrent work “Density Ratio Estimation with Conditional Probability Paths” (Hanlin et al., 2025) that derives the same identity and objective. We will incorporate citation and discussion of this into our paper.
NLL calculation: The reviewer is mistaken. The fact that our NLL calculation is fast (a single forward pass of the model ) is specifically because of our choice parameterization and objective (this was the motivation for these choices). Our energy model can compute the probability of 50k images in 12s on an A100 GPU, whereas a diffusion model requires upwards of 3h20 as it requires at least 1000 forward (+ backward) passes per image (assuming 100 time steps and 10 noise samples for computing the divergence with Hutchinson trick or the MSE as explained in Appendix A.2). Appendix A.2 provides a literature review of other methods of estimating energies, not a description of our method. We do not rely on any algorithm from Karczweski et al. (who do not consider EBMs).
Normalization: Normalization is computed only once post-training, by calculating the constant . Test-time normalized energies are then obtained with a single forward pass by calculating .
Conservative diffusion models: The goal of the paper is not to obtain better diffusion models by imposing them to be conservative. Rather, our primary goal is to learn the energy, a byproduct of which is a conservative score model. We merely point out that due to our choice of architecture, there are virtually no differences between the original score network and the gradient of the learned energy.
Computational drawback: We do not understand the question. We mentioned the limitation of the added computational cost of the double backpropagation in the main text. To be more precise, the training time for a score model for 1M steps on ImageNet with an A100 GPU was 1d8h, while that of our energy model was 5d. We will mention this in the paper and apologize for the imprecise statement.
Approximations: As in any empirical work, there are approximations involved. We will update the text to make these more explicit. It is common practice in energy-based models to resort to approximate normalization as exact normalization is typically impossible (continuous integrals can only be calculated by discretization unless a closed form is available). The purpose of Appendix A.3 is not to prove theoretical guarantees (we say this explicitly) but to provide a sketch of what would be needed to establish them for future research.
Terminology: We apologize for the confusion. As an abuse of language, we called the gradient of the energy the score rather than negative score. Thus, there is no typo in eqs (2) and (3): is the negative log probability, not the log probability (see equation (1)). However, we disagree with the second remark on terminology. Log probability is equal to the negative energy up to an additive constant , not the normalized exponentiated negative energy. We used “normalized energy” to refer to , a term that for instance is used in the review by Song and Kingma (2021).
Thank you for the author’s timely rebuttal. It was overall quite helpful, I have to say. There were several things unclear in the original paper, many of which were clarified and I now have a better idea of the proposition.
The good parts:
- The clarification regarding the “purpose of time score matching” was the most helpful. And, as I can see you realised too that, it wasn’t adequately explained.
- The clarification on “NLL calculation” was also quite helpful. I think you should consider re-explaining this properly in the paper. Specifically, the comparison with the “score model” style NLL computation. In your case, it is almost trivial to compute NLL because what you are modelling is pretty much an approximation of NLL itself.
- Thank you for clarifying the terminology. This was specifically confusing to me because, throughout the paper, you used both terms “normalised energy” and NLL in different places. I could not comprehend that they are same thing (as you clarified in the rebuttal). I say more about this below.
The still not-so-good parts:
- Using the term “normalised energy”, IMO, is quite confusing, specially when you are referring to simply NLL — why not just say NLL ! Because, there can be multiple ways to normalise energy . Only one specific normalisation leads to NLL. I can come up with an arbitrary value of for which refers to the same underlying probability and score.
- The statement that you are “learning normalised energy model” is also confusing. You are not learning normalised model directly. You are simply learning the “
normalisedenergy model” and then deriving a normalisation constant. Moreover, the constant is not theoretically guaranteed to be exact — its an approximation too. By your logic, one can claim that score based method also learn a “normalised energy model” because NLL can be derived from the learned score model too (yes, its expensive to compute, but possible). - Finally, it is clear that the model training takes a lot longer. Typically, in practice, such double-backpropagation is not very popular as a solution. So, what you are getting isn’t really for free.
Overall, I feel this paper may be a good analytical tool. But, considering computational disadvantage, I am not sure how much actual impact it can make. Also, the paper lack significant quantitative evaluation on downstream application; instead focuses on low-level analysis.
We are happy that we could clarify some points, and thank you again for helping us improve their presentation in the paper. Your points about terminology are taken, we will clarify those in the text.
We would like to point out that our "low-level analysis" has potential for impact in itself beyond more direct downstream applications. Observations about properties of natural image densities can motivate better architectures or generative models. Besides, some applications may benefit from the large inference-time savings in computing energy/NLL even at the cost of a longer training time.
The authors propose a framework to learn a time-dependent energy function for a distribution given a set of images. This paper presents a dual score matching framework that enables direct learning of normalized energies using spatial score matching, based on the standard denoising target, and temporal score matching, a novel target, based on the gradient with respect to the noise level.
优缺点分析
The submission is written very clearly, especially the introduction, which does a great job of motivating the problem and explaining the core idea. In my opinion, this paper is a very good fit for NeurIPS. One of the main strengths is the deep analysis of different aspects of the learned image distribution, where the authors go beyond just reporting likelihood scores. These insights are really interesting for the community.
Small weaknesses:
- The example in Figure 1 needs a bit more explanation. It's not immediately clear why single score matching fails there, maybe adding a short sentence in the main text or figure caption would help.
- In Figure 3, it would be helpful to mention directly that Figures 6 and 7 (from the supplement) show the images corresponding to the marked arrows.
问题
- Will you publish code of your experiments?
- Could you please expand on why single score matching fails in the toy Gaussian mixture example in Figure 1? A short 1-2 sentence explanation would be sufficient.
- Can you comment on generalization to larger-scale images?
局限性
yes
最终评判理由
The authors have answered all my questions. I appreciate that the code is going to be published and the explanation of the figures. I'm raising my rating from accept to strong accept.
格式问题
no formatting issues.
Thank you for the kind comments! We especially appreciate your positive evaluation of our analysis of the learned energy-based model. We answer your questions below.
Clarification of Figure 1: Single score matching fails on multimodal distributions because the objective does not constrain the absolute energy levels of separated modes. Indeed, one can add an arbitrary constant to the energy that is different in each mode without changing the score in each mode. Of course, this requires “stitching” together the different constants, which adds a term to the score, but the transition can occur in-between the modes where there is no data to constrain the scores. We hope this clarifies the issue: we will update the text to be more explicit.
Clarification of Figure 3: The purple-to-yellow-colored arrows in Figure 3 correspond to the images on the right of Figure 3 (as mentioned in the caption), not the figures in Appendix. The brown and green arrows respectively correspond to a uniform noise image in [0, 1] and a constant image of intensity 0.5 (not shown).
Code: We will publish code to reproduce all experiments, as well as trained model checkpoints.
Scaling to higher-resolution images: The main bottleneck in increasing image resolution is the memory cost of double backpropagation. This is shared by all energy models that are trained with a score-based objective. It can be alleviated with standard techniques such as sliced score matching (Song et al., 2019), which here corresponds to projecting the DSM residual on a random direction and minimizing its square. This enables the use of forward-mode auto-differentiation to compute the scores, as only inner products are required. This can be further combined with auto-differentiation software optimizations. For instance, the ICLR blog post “How to compute Hessian-vector products?” (Dagréou et al., 2024) reports up to 3x memory reduction with the use of optimized implementations (JIT compilation, etc) in the related setting of Hessian-vector products.
This paper presents a framework for learning normalized energy-based models directly from image data, leveraging techniques from score-based diffusion models. Particularly, a dual score matching objective is proposed to match gradients of the energy network with respect to both space and noise level. The authors further propose a novel architecture for energy estimation by computing the inner product of an input image with the output of a score network. The resulting model is evaluated on ImageNet64 dataset and achieves comparable results to other methods on NLL.
优缺点分析
Strengths:
- The proposed dual score matching objective is interesting and looks effective for energy function learning.
- The proposed model achieves a good result on ImageNet64 in terms of NLL. In addition, extensive experiments and analyses are provided to illustrate the comprehensive ability of the proposed method.
- The overall paper is well-written and organized.
Weaknesses:
- Compared to common diffusion objective, the proposed dual score matching requires to learn two objectives. The training process might be slower. Can you provide the convergence speed comparison between the DSM-only loss and the DSM-TSM joint loss.
- In the experiments, it would be great to also apply the method to higher resolution images such as ImageNet or CelebA. In addition, can you also report FID to evaluate the sample quality?
- In Table 2, it would be better to add at least one more energy-based models (EBMs). And adding one figure to compare the training stability between different EBMs would further increase the reliability of the proposed model.
问题
- In the combined loss (equation (8)), why you normalize the two losses with dimensionality ? Especially there is no in the original DSM loss.
- Can the proposed method be applied to conditional generative models?
局限性
yes
最终评判理由
Thank you for the detailed and thoughtful rebuttal. The authors addressed my questions clearly and provided new results and clarifications. While I acknowledge that the method is technically sound and well-presented, my main concerns remain such as limited evaluation; so I maintain my original score.
格式问题
no
Thank you for your review. We have run several experiments to answer your questions, which will help us to improve the paper.
Computational complexity of the training process: Training an energy model with DSM+TSM losses converges in the same number of iterations as for DSM loss alone. The additional time complexity of the TSM loss is negligible: real world training time is 4d16h for DSM and 5d for DSM-TSM, for 1M steps on an A100 GPU.
Higher-resolution images: We have now applied our model to CelebA images at 80x80 resolution, and find that it achieves a NLL of 1.94 bits/dim. We are not aware of reported NLLs on CelebA nor other higher-resolution datasets (ImageNet64 has been a de facto standard), but it is reasonable that the value is lower than for ImageNet64 due to reduced diversity in the images. Additionally, we verified that the histogram of log probabilities (as in Figure 3) is well-approximated by a Gumbel distribution (apart from a few outliers on the low-probability edge, due to corrupted images). Over the 20k test images, the range from highest to lowest probabilities is 17dB/dim. This is half that of ImageNet64, which is also reasonable given the lower diversity in image content. Finally, we mention that scaling dual score matching to higher-resolution images is bottlenecked by the memory cost of double backpropagation. This can be alleviated with standard techniques such as sliced score matching (Song et al., 2019), which is beyond the scope of this work.
Comparison with other EBMs: We will add a comparison with the Routing Transformer (Roy et al., 2021) that achieves a NLL of 3.43 bits/dim (compared to our value of 3.36). But we were unable to find an energy-based model (i.e., a model which directly learns the energy) that reports NLL on ImageNet64 - all models we examined were auto-regressive, flow-based, or compute variational lower bounds, while older models consider lower-resolution datasets. We also were unsure what the reviewer meant by “training stability”? Our training procedure is quite stable, as it relies on two regression objectives. In addition, these objectives have the same minimum, so that there are no adversarial instabilities (as in e.g. GANs).
Normalization of losses with dimensionality: The factors and in equation (8) set the relative scales of the two objectives. This is equivalent to minimizing with .
Extension to conditional models: The proposed method is straightforward to extend to conditional generative models. Given conditioning information , the network computes the conditional energy function . The expression of the DSM and TSM losses are unchanged, as well as the energy architecture . We will mention this in the paper, and we thank the reviewer for the suggestion.
FID: The generative abilities of the score derived from our energy model are equal to that of the base score network . FID thus measures the performance of the base diffusion generative model, which is orthogonal to the focus of this work (energy-based models).
Contributions: We would like to note that the reviewer summary omits several aspects of the paper that we believe are important contributions, regarding analyzed properties of the learned density model. We show that images with highest vs lowest probability have starkly different visual features, and that the high-probability images are not the most typical images, as intuition might suggest (Figure 3). We also show that the support of some (but not all) natural images can be locally described as a low-dimensional manifold. We believe these observations to be novel and of value to the NeurIPS community (as mentioned by Reviewer hMws).
Thank you for the detailed and thoughtful rebuttal. The authors addressed my questions clearly and provided new results and clarifications. While I acknowledge that the method is technically sound and well-presented, my main concerns remain such as limited evaluation; so I maintain my original score.
This work proposes a new method for learning normalized energy-based models. One of the weaknesses of this work, raised by reviewers, is the much longer training time that might prevent its practical use. Authors discussed this limitation in the main paper and in the rebuttal and suggest it could be useful due to its faster inference. Overall, the AC agrees with the merit of this work for the various analyses and insights provided by the proposed approach in the paper. My recommendation is acceptance. Please incorporate the feedback from the reviewers on clarity and the terminology used in the final copy.