Energy-Based Modelling for Discrete and Mixed Data via Heat Equations on Structured Spaces
A MCMC-free method based on the discrete heat diffusion for the training of energy-based models on discrete and mixed data with applications in generative modelling for tabular data.
摘要
评审与讨论
This article, "Energy-based modelling for discrete and mixed data via heat equations on structured spaces", proposes to perform the training on EBM, using the Energy Discrepancy (ED) loss, in the case where having multi-modal dataset mixing eventually continuous inputs but also discrete (categorical) ones. The work describes into details how to parametrize in different setting the inclusion of discrete variables, and they apply it to various datasets. The main contributions are the design of the continuous time Markov chain transition probability that lies at the heart of the ED approach and the application to tabular dataset for which generative approach is usually hard.
优点
The authors show how their method can be efficiently used on Tabular dataset. In particular, they apply to several dataset and show that in average the EBM trained with Energy-Discrepancy using a discrete implementation of the Markov chain transition probability outperform the concurrent approach, ref[Xu et al 2019].
The authors also show experimental results on image modelling.
缺点
The authors extend the formalism of Energy Discrepancy to the case of including discrete state in addition to continuous features. Whether or not this justifies an entire publication can be debated, Although it should be emphasized that the datasets under consideration are quite original.
It might be because I'm not an expert on ED, but while traditional EBM relies on MCMC to compute the gradient, ED does not. However, it is not clear to me if sampling the EBM trained in such way need MCMC to be generative ? If so, the article should provide more details on the implementation. They should also check that the trained model is at equilibrium (the generated samples corresponds to the equilibrium properties of the model).
More importantly, the comparison for the tabular dataset is only done at the level of the AUC curves. Can at least the authors compare the frequency and correlations amongst the generated samples and the true ones ?
问题
The authors said that their work is one of the first dealing with tabular data, I at least find this one dealing with EBM and tabular dataset: https://arxiv.org/abs/1912.09382, the authors might check if it is relevant w.r.t. their work. Also, this article https://arxiv.org/abs/2405.15376 deals with generative and categorical variables for EBM.
局限性
The limitations are correctly discussed in the article.
Thank you for your review. We first want to comment on the weaknesses of our paper mentioned in your review:
The authors extend the formalism of Energy Discrepancy to the case of including discrete state in addition to continuous features. Whether or not this justifies an entire publication can be debated
Here is why we hope that this work is interesting to the community: Currently, discrete EBM training methods are hard to tune and don’t translate to mixed state spaces. We want to introduce a simple performant training method to address these limitations.
To do so, we introduce a graphically informed noising process which may also be of great interest in other domains, e.g. discrete diffusion modelling, too. The theoretical understanding of the diffusion allows automated tuning of the time parameter as a function of the state space size (Theorem 2) which is a critical new contribution to this work substantially different from prior work and important to tabular data, where state space sizes typically vary across between features.
Our theoretical analysis is supported by the experimental results that show good performance, thus unlocking a new tool for density estimation on non-continuous domains.
[...] it is not clear to me if sampling the EBM trained in such way need MCMC to be generative ?
To use EBM as a generative model, one needs to solve two challenges: Training the model and sampling from the model. This paper addresses the training part, only. As you say, sampling methods have to be used to obtain synthetic samples from the model. We employ a fairly standard method for sampling with interleaved updates of the numerical and discrete state. The implementation can be found in the appendix (Algorithm 2).
Importantly, tuning a sampler at inference time is much easier than tuning it for a stable training algorithm. Indeed, we were unable to implement the MCMC-based training for tabular datasets, while sampling from our model with MCMC was unproblematic.
It should be noted that EBMs are particularly useful in non-generative downstream tasks like out of distribution detection and calibration of classifiers, because unlike other generative models they output the unnormalised data density, explicitly. For this reason we focussed on contributing to better EBM training.
They should also check that the trained model is at equilibrium (the generated samples corresponds to the equilibrium properties of the model).
As stated in lines 78-81, ED offers the theoretical guarantee that after training converged. Hence, if we run an long-run MCMC sampler on our learned model with Metropolis Hastings acceptance step, the produced samples are at equilibrium by construction (The MH acceptance guarantees detailed balance which is equivalent to samples being at equilibrium.)
The same guarantees does not exist for MCMC-based training approaches because short-run MCMC has to be used in practice introducing biases in the optimisation, see e.g. [1]. For empirical evidence of these favourable properties, see figure 2, 7, 8 which contrast ED and CD results.
More importantly, the comparison for the tabular dataset is only done at the level of the AUC curves. Can at least the authors compare the frequency and correlations amongst the generated samples and the true ones ?
Thanks for your suggestion. We include two new metrics here: single-column density similarity and pair-wise correlation similarity. These two metrics can measure the similarity of the frequency of single columns and correlations of pair-wise columns between the generated and the true table. They can be computed by using the open-source API: SDMetrics. It shows that the proposed ED approaches can outperform baselines or achieve comparable performance in most datasets.
single column density similarity score
| Adult ↑ | Bank ↑ | Cardio ↑ | Churn ↑ | Mushroom ↑ | Beijing ↑ | |
|---|---|---|---|---|---|---|
| CTGAN | .814 | .866 | .906 | .901 | .951 | .799 |
| TVAE | .783 | .824 | .892 | .899 | .965 | .711 |
| TabCD | .719 | .790 | .824 | .845 | .618 | .799 |
| TabED-Un | .791 | .760 | .948 | .905 | .918 | .843 |
| TabED-Grid | .751 | .766 | .945 | .846 | .951 | .951 |
| TabED-Cyc | .778 | .826 | .937 | .834 | .969 | .751 |
| TabED-Ord | .828 | .894 | .933 | .887 | .943 | .747 |
pair-wise correlation similarity score
| Adult ↑ | Bank ↑ | Cardio ↑ | Churn ↑ | Mushroom ↑ | Beijing ↑ | |
|---|---|---|---|---|---|---|
| CTGAN | .744 | .769 | .717 | .826 | .828 | .761 |
| TVAE | .669 | .772 | .687 | .808 | .919 | .618 |
| TabCD | .522 | .600 | .629 | .710 | .428 | .761 |
| TabED-Un | .653 | .661 | .828 | .851 | .825 | .735 |
| TabED-Grid | .583 | .768 | .829 | .764 | .842 | .842 |
| TabED-Cyc | .636 | .703 | .810 | .755 | .860 | .685 |
| TabED-Ord | .702 | .796 | .811 | .791 | .826 | .662 |
The authors said that their work is one of the first dealing with tabular data, I at least find this one dealing with EBM and tabular dataset: [...] Also, this article https://arxiv.org/abs/2405.15376 deals with generative and categorical variables for EBM.
Thank you for these interesting references! We missed the first reference in our literature research. Unfortunately, we are using different benchmark data sets so a direct comparison is difficult. Furthermore, RBM architectures tend to be more constrained than deep NN architectures, which has the advantage of stabilised training at the cost of flexibility. This means that ED and the referenced work have slightly different advantages and drawbacks.
The second article is very interesting. It is concurrent work to ours and was published after the NeurIPS deadline, so we were not aware of it.
[1] Nijkamp et al. Learning non-convergent non-persistent short-run MCMC toward energy-based model
I thank the author for the clarifications.
The only thing that remains unclear to me is how the author ensure the convergence of the MCMC chains during sampling. In many systems, in particular EBM, can have metastable states or long mixing time.
I understand that the authors use the algorithm 2 for the sampling but I'm wondering how the authors ensure that the samples correspond to the learned distribution.
Thank you for this question. In this work, we did not employ quantitative MCMC diagnostics to assess the convergence of Markov chains as we were mostly concerned with learning the data distribution. We will work on additional MCMC diagnostics for the revision of our work.
I'm wondering how the authors ensure that the samples correspond to the learned distribution.
Some energy-based models are trained with short-run MCMC, i.e. the samples during training are not at equilibrium. As a consequence, one also needs to employ non-convergent short-run MCMC at inference time to produce good samples [1], and in these settings one may be concerned about a mismatch between model and samples.
Our setting is different: Since we do not require MCMC at training time, the estimated energies are generally more accurate representations of the data distribution (with theoretical guarantees). In principle, running MCMC for longer will also produce more accurate samples from the model. For improved mode coverage, each sample is initialised from an independent noise sample and the Markov chain is simulated until we observe high-quality samples. We then evaluate the quality of the produced samples indirectly in various metrics:
-
For tabular data, we generate synthetic data with 200 rounds of interleaved discrete/continuous updates, and assess the sample quality indirectly by showing that a random forest classifier trained on our synthetic samples and trained on the actual training data displays comparable performance. (Table 1)
-
For samples generated from lower-dimensional discrete densities, we compare the training data with the MCMC samples in the maximum mean discrepancy and use the MCMC samples to compute the negative log-likelihood of the data under the model. (Table 3 and 4)
-
The negative log likelihood for discrete image data is computed with 300,000 steps of annealed importance sampling. Annealed importance sampling helps with the convergence of the MCMC chains. This is the same setup as in earlier work [2, 3], where it was noted that AIS for discrete image data converges after 30,000 steps. According to this heuristic we assume that the sampler has converged in our case as well. Similarly, we use the same number of sampling steps for the experiments on Ising models as [2, 3].
-
In all experiments, we assume that a non-convergent sampler would produce sub-optimal results due to the assumed accuracy of the learned energy. In addition, the samples (and in some cases the learned distributions) can be visually assessed (Figure 2, 3, 10, 12), demonstrating great agreement between data, learned distribution, and MCMC samples. While we can not rule out metastable samples that match the data distribution better than the model, we consider this unlikely across multiple baseline data sets, performance metrics, and dimensionalities.
We will work on providing additional more rigorous MCMC diagnostics for the revision of our work. We agree that this may be important when wanting to apply our method in generative downstream tasks. Please, let us know if we left your question unanswered.
[1] Nijkamp et al. On the Anatomy of MCMC-Based Maximum Likelihood Learning of Energy-Based Models, AAAI 2020
[2] Zhang et al. A Langevin-like Sampler for Discrete Distributions, ICML 2022
[3] Grathwohl et al. Oops I took a Gradient: Oops I Took A Gradient: Scalable Sampling for Discrete Distributions, ICML 2021
This paper extends the Energy Discrepancies framework introduced by Schroder et al. to the setting of discrete data. In order to do this, the authors first describe ways to perturb discrete data by modeling the perturbation process as a CTMC. They describe suitable perturbation choices for different types of discrete data (e.g. binary, categorical, ordinal, cyclical) and describe different considerations for the time parameter in the CTMC. They then propose an approach that performs a Monte-Carlo estimate of the contrastive potential needed for the Energy Discrepencies loss and compare their method to existing methods for training discrete EBMs.
优点
Originality: Energy discrepancy is a relatively new approach. While the original paper proposed some extensions to discrete data, this paper goes into extending energy discrepancy to discrete data in much more depth and includes new mathematical and experimental analyses.
Clarity: Overall, the paper is well written.
Quality: I believe the paper is technically sound.
Significance: The authors show on toy examples that their method outcompetes contrastive divergence. The authors appear to generally outperform two methods proposed in 2019 along with contrastive divergence. While I have some minor concerns about these baselines, outperforming these baselines is at least demonstrating some empirical benefit of this approach.
缺点
Clarity: The clarity can be improved a bit (see my questions below).
Significance: Despite demonstrating that the method can work empirically, I have some concerns with the overall significance. It seems that while the method works well on toy examples, the results are less impressive on real-world image modeling tasks. I am unfamiliar with the field of tabular data modeling and therefore, cannot properly assess the significance of the results. Beyond contrastive divergence the main baselines is a method from 2019 with 2,000 citations. Are there better baselines to compare against among these 2,000 citations?
问题
In section 3 on lines 93-95 the authors describe two key criteria for defining useful perturbation processes. The first criterion is described as “the negative samples obtained through are informative for training the EBM when only finite amounts of data are available.” I struggled to understand precisely what this statement meant. What are examples of processes that are more and less informative?
I am confused about the connections to the heat equation, which is likely due to my own lack of understanding but may also indicate that the clarity could be better. My understanding is that we need to define a process that perturbs our data, . Such processes have been described in the ML literature, which the authors cite and can be solved through the matrix exponential. While normally this matrix exponentially may be hard to solve, since the noise process is applied independently across the dimensions, it should scale with O(S^3). It was unclear why small timesteps were introduced and why Euler integration was needed. Was the point that for some problems S^3 is too big and so for these problems we will restrict ourselves to small timesteps in order to avoid computing the matrix exponential? Overall, I was left confused about why we need to talk about heat equations at all and why we don’t just describe this as a CTMC with initial conditions? Is there something that the heat equation view is really buying us?
Lines 176-181 describe the subsection “Localization to random grid”. Related to my comment above about the lack of clarity regarding when the negative samples are “informative for training” the authors say that adding uniform noise to categorical variables “introduce too much noise to inform the EBM about the correlations in the data set.” Can the authors make this statement more precise in the text? I think I can intuitively see that if make random uniform perturbations at each dimension then you are sampling from a uniform distribution and this will be uninformative in some sense. However, I think this notion needs to be explicitly connected to the optimization objective / loss functional in order to make this clear. Furthermore, it is not clear why this isn’t solved by taking smaller timesteps so that only a few dimensions will on average be changed. Can the authors please clarify this?
Overall, I am confused about the choice of time parameter and I think this needs to be better written in the manuscript. Section 3 establishes that as goes to infinity, “the energy discrepancy loss converges to maximum likelihood estimation.” The authors describe why maximum likelihood may have statistical advantages for finite data sets but then immediately move on in Section 3.1 to small timescales. This seemed like an odd transition and immediately made me ask “why not just use large times since this is maximum likelihood?” I suspect the answer lays in Section 4.2 where it becomes apparent that the contrastive potential must be estimated with Monte-Carlo sampling. It seems that larger timescales induce higher entropy distributions that would require more MC samples to approximate the expectation on line 192?
For the related works, I felt the “Contrastive loss functions” paragraph needs more discussion. Energy discrepancies seems very closely related, if not exactly, to contrastive loss functions for training energy-based models. Can the authors please provide a more thorough comparison of these different methods?
Similarly, I did not see a discussion on pseudolikehood loss functions. For small timesteps, the loss function seems very closely related to pseudolikelihood estimation and it seems that when MC sampling must be used in this method to approximate an otherwise intractable integral, that the MC sampling can be seen as an MC approximation to pseudolikelihood?
The authors make a point of saying that ED is much more efficient that CD and point to timing experiments in the appendix. However, it appears that authors are only reporting timing for M=4 samples when in the paper M=32 is used. If I extrapolate and multiply the ED time by 8 (since 32 = 8*4) then ED is more expensive then all of these methods. Can the authors please clarify this? I suggest changing Table 6 to M=32 if that is what is used in the paper.
The biggest experimental win seems to come from the Tabular Dataset. I am not very familiar with this area so I have a limited ability to evaluate the significance here. While the results seem reasonable I have two questions: 1) since the baseline methods were published in 2019 are there more sensible baselines to compare with? I again emphasize that I am not requiring that the authors’ method is SOTA – it is okay if other methods beat their method. 2) Since these methods have mixed continuous and discrete data can the authors do a separate benchmark that only models the discrete columns? I think it would be helpful to tease apart whether the best strength of this method is in modeling mixed continuous-discrete data or also purely discrete data.
I was confused by the statement that method is sensitive to the assumption that the data distribution is positive on the whole space. Why is this more of an issue for ED than other EBM training techniques? Intuitively, it seems that you can always avoid the assumption that the data distribution is positive by just assuming that the energy in these regions is so high that the probability that you sample these regions is vanishingly low. Either way, can the authors point me to where this assumption of a low-dimensional manifold is investigated in the paper / SI?
局限性
I would like more honest discussion throughout the paper of the relative strength and drawbacks of their method. For example, while it is true that this method is “MCMC-free” it still requires Monte-Carlo estimation and the relative tradeoffs here are not adequately discussed. I would also like to see more comparisons of CD with large numbers of MCMC steps v.s. ED with large numbers of samples.
[...]For small timesteps, the loss function seems very closely related to pseudolikelihood estimation [...]
Thank you for raising this point. It seems that pseudo likelihood can be derived from energy-discrepancy as follows: Define which masks exactly one entry of the data vector. Then, energy discrepancy is given for a sampled perturbation as
Hence, this specific ED loss function is indeed a MC approximation of pseudo-likelihood. Energy discrepancy is appealing because it is more general and more tunable through the choice of and . We will discuss this connection in our revised version.
Since these methods have mixed continuous and discrete data can the authors do a separate benchmark that only models the discrete columns?
The performance on continuous data is compared in prior work [9]. In this work, we benchmark ED on various discrete data sets. For a separate benchmark on discrete/continuous variables in tabular data we are missing baselines that perform the same experiment, and an implementation of comparative experiments was not possible in the given time frame.
I was confused by the statement that method is sensitive to the assumption that the data distribution is positive on the whole space. Why is this more of an issue for ED than other EBM training techniques? [...]
Other training techniques for EBMs are adaptive, i.e. the negative samples are created by attempting to self-sample from the model. If the sampler is well-tuned, this produces smoothed estimates of the data distribution (see e.g. Figures 2, 7 to see the smoothing effect compared to ED). Thus, these methods produce biased estimates which can be advantageous when training on image data, where the data support is small compared to the ambient space. ED produces more accurate estimates, but tends to oversaturate quickly when the diffusion produces uninformative negative samples.
I would also like to see more comparisons of CD with large numbers of MCMC steps v.s. ED with large numbers of samples.
- Table 2 reports results for CD with 40 MCMC steps. CD with larger numbers of MCMC steps is rarely used due to the cost of training.
- Similarly, ED with large number of samples would loose the key advantage of ED of being cheaper to compute than CD or MLE with MC approximated normalisation constant.
A major difficulty with increasing the number of MCMC steps in CD is that the sampler likely needs to be retuned. For all these reasons, we chose to compare CD with a typical number of MCMC steps (e.g. 40 steps) and ED with M = 32, i.e. maximising the parallelisation capabilities of our GPU.
Finally, here are the references mentioned in the rebuttal and in the following comments.
[1] Du et al. Compositional Visual Generation with Energy Based Models, NeurIPS 2020
[2] Glaser et al. Maximum Likelihood Learning of Unnormalized Models for Simulation-Based Inference
[3] Grathwohl et al. Your Classifier is Secretly an Energy Based Model and You Should Treat it Like One, ICLR 2020
[4] Grathwohl et al. Oops I took a Gradient: Oops I Took A Gradient: Scalable Sampling for Discrete Distributions, ICML 2021
[5] Xu et al. Modeling Tabular Data using Conditional GAN, NeurIPS 2019
[6] Kotelnikov et al. TabDDPM: Modelling Tabular Data with Diffusion Models, ICML 2023
[7] Campbell et al. A Continuous Time Framework for Discrete Denoising Models, NeurIPS 2022
[8] Lou et al. Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution, ICML 2024
[9] Schroeder et al. Energy Discrepancies: A Score independent Loss for Energy-Based Models, NeurIPS 2023
[10] Lyu S. Unifying non-maximum likelihood learning objectives with minimum kl con- traction., NeurIPS 2011
[11] Gao et al. Learning energy-based models by diffusion recovery likelihood. ICLR 2020
[12] Luo et al. Training Energy-Based Models with Diffusion Contrastive Divergences
[13] Lee et al. GUIDING ENERGY-BASED MODELSVIA CONTRASTIVE LATENT VARIABLES, ICLR 2023
[14] Foster et al. A Unified Stochastic Gradient Approach to Designing Bayesian-Optimal Experiments, AISTATS 2020
[15] van den Oord et al. Representation Learning with Contrastive Predictive Coding
[16] Le Cun et al. A Tutorial on Energy-Based Learning
Thank you to the authors for their thorough reply. I have read through it and the replies answer my questions and satisfies my main critiques. At this time I do not have any further questions.
Thank you for your review. Let us know if any further questions about our responses come up.
Thank you for your thorough review and apologies that we could not fit all responses in the rebuttal.
We now want to respond to the questions left out in the rebuttal. We follow the chronological order of your review and the rebuttal.
It was unclear why small timesteps were introduced and why Euler integration was needed.
We don't use the Euler discretisation in practice and our method is simulation-free. For the uniform perturbation, the exact solution is given by
This is equivalent to equation (7) via the time rescaling We are going to replace equation (7) with the closed form solution.
[...] since the noise process is applied independently across the dimensions, it should scale with O(S^3). [...] Was the point that for some problems S^3 is too big [...]?
Thank you for this comment! We missed that one could diagonalise any graph Laplacian in time before running the training which extends the generality of our suggested method significantly. We consider this for the revision of this work.
[...] the authors say that adding uniform noise [...] “introduce too much noise to inform the EBM about the correlations in the data set.” Can the authors make this statement more precise in the text? [...] it is not clear why this isn’t solved by taking smaller timesteps [...]
The intuition is the same as our earlier response on informative perturbations. We will revise the statement in the paper. The grid perturbation always changes exactly one entry of the data vector which is the main different to the uniform perturbation with small . We observed empirically that sometimes this is beneficial, e.g. for the binary image data the grid perturbation outperforms the heat perturbation (a Bernoulli distribution) with a small time step. For tabular data, one can also condition the negative samples on the column that was perturbed, so that the parameter update is only informed by the change of a single column. We have not quantified the benefit of this analytically.
For the related works, I felt the “Contrastive loss functions” paragraph needs more discussion. [...]
We will revise our discussion of related contrastive loss functions. Without giving proofs, we want to give a brief summary of the connections we see:
The following papers are based on the same functional as energy discrepancy:
- ED is equivalent to KL contractions [10]. However, [10] does not turn the KL contraction into a training loss, and does not explore the influence of the perturbing distribution.
- ED for Gaussian perturbation is equivalent to the diffusion recovery likelihood functional [11]. However, [11] uses a large number of noise scales, an MCMC scheme, no stabilisation, and only one negative sample to produce a training algorithm, so methodologically it sits between ED and CD training approaches.
- ED is equivalent to the concurrent work Noise Contrastive Divergence [12]. However, [12] approximates the loss with a variant of score matching, thus turning it into a different loss in practice.
- ED is strongly related to the concurrent work Contrastive Latent Energy Learning [13]. Again, the loss function implemented in practice by CLEL is very different as it only considers a single negative sample obtained from a latent code.
- If the perturbation satisfies the detailed balance relation, ED becomes equivalent to the contrastive divergence loss [9].
- As remarked by your review, pseudo-likelihood methods can be seen as a special case of ED.
The following works discuss a similar training objective as ED:
- The stabilised ED loss is equivalent to the prior contrastive estimation (PCE) bound [14] because both losses are based on approximations of KL divergences. However, PCE is not used to learn a probabilistic model but to find an optimal Bayesian experimental design, a very different learning task.
- Similarly, the stabilised ED is structurally similar to InfoNCE [15], a loss used in representation learning.
- The stabilised ED is related to the log loss [16]. However, the log-loss was introduced without connection to learning the EBM as a probabilistic model, introducing meaningful ways to obtain contrastive samples, or providing theoretical guarantees, and we could not find experimental results for this loss.
It should be noted that none of these references apply to discrete or mixed data like our work.
Thank you for your review and your questions! Due to lack of space we focus on questions of most interest for all reviewers. Citations and responses to the rest can be found in the comment.
Significance: [...] Are there better baselines to compare against [...]?
EBMs are appealing because they output the unnormalised data density explicitly, and allow more flexible modelling than e.g. normalising flows. This unlocks a unique set of downstream applications, see [1, 2, 3].
Given our interest in using EBMs as a tool in downstream tasks, we are not trying to claim SOTA compared to all generative models. For binary discrete image data, our method is competitive with other EBM training approaches like [4]. For tabular data we compare with [5] due to lack of EBM baselines. Prompted by your comment we found that a tabular DDPM gives a stronger baseline [6] and we will add this reference to our table.
Questions
[...] What are examples of processes [q] that are more and less informative?
In principle, the negative samples can be drawn from an arbitrary noise distribution. If, for example, is independent of the data, the contrastive term becomes just the log-normalisation constant up to a constant, i.e.
which is an example of a non-informative distribution. In this case, ED becomes the log likelihood in theory, but the normalisation constant needs to be approximated from thousands of samples to be low-variance.
An informative noising distribution achieves two things.
- can be computed from a small number of samples with lower variance than the full normalisation because the integrand is localised around .
- similar to GAN training, the model converges prematurely if the negative samples are trivial to distinguish from data .
I am confused about the connections to the heat equation [...] It was unclear why small timesteps were introduced and why Euler integration was needed.
We don't use the Euler discretisation in practice and our method is simulation-free, i.e. for any time we can sample exactly from the transition density using the closed form solution in Proposition 1. For the uniform perturbation, the exact solution is equivalent to equation (7) via the time rescaling , (see comments). We are going to replace equation (7) with the closed form solution.
[...] Is there something that the heat equation view is really buying us? [...] “why not just use large times since this is maximum likelihood?”
Any CTMC could be used to define a noising process, but this leaves us with many choices for the rate function. In previous work using CTMC [7, 8], the structure on the discrete space is ignored and only the uniform and absorbing diffusion is discussed.
We choose the rate function as the graph Laplacian which defines the smallest possible transitions in the graph. For small , the corrupted data points are then with highest probability the nearest neighbours of the data point in the graph.
However, a small corruption may miss global distribution properties like mixture weights [9]. Theorem 1 tells us that for large , ED performs MLE which captures these global properties at the cost of higher variance of the estimated loss. The heat equation allows us to trade-off the variance and better statistical properties of the loss.
To support our intuition empirically, we include new experimental results in the attached PDF, Figure 1, showing that a large requires increasing to achieve accurate energy estimates.
The authors make a point of saying that ED is much more efficient that CD and point to timing experiments in the appendix. However, it appears that authors are only reporting timing for M=4 samples when in the paper M=32 is used. [...]
is often sufficient to obtain good results, see table 8. For the main text of this paper we used to maximise our GPU usage. In fact, ED can be computed in parallel, while CD requires sequential sampling algorithms that cannot be parallelised. We provide the training time for below:
| CD-1 | CD-5 | CD-10 | ED-Bern (M=32) | ED-Grid (M=32) | |
|---|---|---|---|---|---|
| Per Epoch (s) | 29.1660 | 95.2178 | 167.5718 | 46.4317 | 44.0621 |
It shows that ED with is still more performant than CD-5. We will change the corresponding table in the revision.
[...] 1) since the baseline methods were published in 2019 are there more sensible baselines to compare with? [...] 2)
As mentioned earlier, there is a better baseline that uses a diffusion model [6] which outperforms our approach on the tabular data benchmark. We provide the experimental results for detailed comparison:
| Adult ↑ | Bank ↑ | Cardio ↑ | Churn ↑ | Mushroom ↑ | Beijing ↓ | |
|---|---|---|---|---|---|---|
| CTGAN | .861 | .774 | .787 | .792 | .781 | 1.01 |
| TabDDPM | .910 | .924 | .802 | .793 | 1.0 | .570 |
| TabED (Ours) | .853 | .845 | .796 | .814 | .985 | .978 |
TabED refers to the best results from our proposed methods. We remark that, while the DDPM outperforms the GAN and EBM, we are interested in tasks that cannot be addressed by diffusion models.
I would like more honest discussion throughout the paper of the relative strength and drawbacks of their method. [...]
We will revise the paper keeping your comment in mind. In brevity, the biggest strength of ED are improved theoretical guarantees over CD or robust behaviour when tuning an MCMC sampler is difficult (e.g. tabular data). The biggest drawback of ED is currently its scalability to data more high-dimensional than MNIST.
: The paper introduces a novel method for training energy-based models (EBMs) on discrete and mixed data using heat equations on structured spaces. This method employs the Energy Discrepancy (ED) loss function, which eliminates the need for Markov chain Monte Carlo (MCMC) by using graph-structured perturbations. The proposed approach is evaluated on several applications, including discrete density estimation, synthetic data generation, and calibrated classification, demonstrating significant improvements in training efficiency and model performance.
优点
The paper successfully extends the Energy Discrepancy method to discrete and mixed data with solid theoretical analysis, addressing the challenge of robust and fast sampling in these space. The designed experiments demonstrate the method's ability to accurately capture complex data structures and generate high-quality synthetic data, highlighting its practical applicability.
缺点
-
Despite the method's solid contributions and experimental design, the motivations behind each step and their presentations are not very clear, making it hard to follow. For instance, in Section 3.1, the paper discusses different structured and unstructured categorical values, introducing the four types {cyc, ord, unif, abs}. However, it is not clear why these specific structures are chosen. Are they meant to cover all categorical values comprehensively, or are they the most common in tabular data? Providing a clearer rationale would help readers understand the choices made.
-
The scalability of the proposed method in such scenarios is a significant concern. An analysis or discussion on how the method handles large categorical values would be beneficial. This could include potential modifications or considerations to ensure that the method remains efficient and practical when applied to datasets with large categorical variables. What’s more, I strongly recommend moving these algorithms from the appendix into the main body of the paper. This would make the paper easier to follow and more accessible to readers who need to understand the detailed workings of the method.
问题
See weakness
局限性
See weakness
Thank you for your review and your helpful comments.
Despite the method's solid contributions and experimental design, the motivations behind each step and their presentations are not very clear, making it hard to follow. For instance, in Section 3.1, the paper discusses different structured and unstructured categorical values, introducing the four types {cyc, ord, unif, abs}. However, it is not clear why these specific structures are chosen. Are they meant to cover all categorical values comprehensively, or are they the most common in tabular data? Providing a clearer rationale would help readers understand the choices made.
Thank you for letting us know that the presentation is not clear in this point, so we would like to provide additional background on this topic: To create the Energy Discrepancy loss we have to introduce a noising process that produces contrastive negative samples which the model can learn from. In principle, the noising processes {unif, abs} already cover all categorical values and this is the approach taken for discrete diffusion models, see e.g.
[1] Campbell et al. A Continuous Time Framework for Discrete Denoising Models, NeurIPS 2022
[2] Kotelnikov et al. TabDDPM: Modelling Tabular Data with Diffusion Models, ICML 2023
[3] Lou et al. Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution, ICML 2024
In this work, we observe that many discrete types of data have additional structure such as seasonality/periodicity (e.g. month of year) or ordering/grading (e.g. age, RGB pixel values, salary) that give us more fine-grained control over the noising process than the diffusions in [1, 2, 3]. This generalises previous diffusions while remaining simulation-free, i.e. we can compute and sample from the noising distribution for any in closed form.
The intuition for respecting the structure of the underlying discrete space is that we seek negative samples that are close to the data support to provide a good training signal (This is similar to the discriminator training in GANs).
If the discrete space of interest has a more complicated graph structure, the process either needs to be simulated from eigen decompositions of the graph Laplacian or one can resort to the simpler uniform or absorbing perturbation. Since all our benchmark data sets didn't require an involved graphical structure, we stick to the more common ordinal or cyclical structure which generalises previous noising processes [1, 2, 3].
The scalability of the proposed method in such scenarios is a significant concern. An analysis or discussion on how the method handles large categorical values would be beneficial. This could include potential modifications or considerations to ensure that the method remains efficient and practical when applied to datasets with large categorical variables.
Thank you for your suggestions. Could you elaborate what you mean by large categorical values in case we misunderstand your question?
Scalability is indeed a valid concern for energy-based models in general. Most scalable continuous EBMs such as [4], which trains EBMs on image net 32x32, have other problems, e.g. over-smoothing the estimated densities due to missing theoretical guarantees, and are limited to the continuous domain. This limits their applicability in some key downstream tasks like reliable out of distribution detection.
For discrete data sets, alternative training methods for EBMs produce comparable results to our method. Compared to MCMC based training methods like contrastive divergence [5], the energy discrepancy approach requires less tuning and runs faster since the loss can be computed in a single parallel pass through the network and does not rely on sequential steps. Furthermore, these works have yet to be extended to mixed state spaces, and many tabular data sets of interest have a comparably small number of features.
We are also interested in better scalability in future work. However, this will likely require a combination of EBMs with other generative modelling approaches such as a variational decoder model as in [6]. This method has already been established for the energy discrepancy training method. Secondly, the energy function may be learned with the energy discrepancy loss at various noise scales as in [7], thus improving training stability and learning a sampler alongside the EBM. Finally, EBMs may be an interesting way to modify a pre-trained base model as in [8]. While ED can likely be used to modify [6, 7, 8], this is not the main contribution of our work.
[4] Du et al. Implicit Generation and Generalization in Energy-Based Models, NeurIPS 2019
[5] Grathwohl et al. Oops I took a Gradient: Oops I Took A Gradient: Scalable Sampling for Discrete Distributions, ICML 2021
[6] Pang et al. Learning Latent Space Energy-Based Prior Model, NeurIPS 2020
[7] Gao et al. Learning Energy-Based Models by Diffusion Recovery Likelihood, ICLR 2021
[8] Deng et al. Residual Energy-Based Models for Text Generation, ICLR 2020
What’s more, I strongly recommend moving these algorithms from the appendix into the main body of the paper. This would make the paper easier to follow and more accessible to readers who need to understand the detailed workings of the method.
Thank you for this suggestion. We will include the algorithms in the main text for the revision of this paper.
I appreciate the authors' efforts in rebuttal, and I decided to keep my score.
Thank you for your review and feedback for our work!
The paper proposes a suite of methods for training energy-based models for discrete and mixed data using the Energy Discrepancy loss, a recently proposed method for training EBMs. Compared to contrastive divergence, it does not require MCMC sampling from the model distribution during training, improving training efficiency. This is done by simulating a diffusion process on the discrete states and effectively using those noisier samples as the negative samples. The paper introduces a connection between the new method and maximum likelihood estimation, showing that energy discrepancy as applied to discrete state spaces can converge to the negative log-likelihood. In experiments, the new method behaves favourably compared to contrastive divergence-based methods on synthetic data sets, on average better than baselines on real-world tabular data sets, and comparably to many competing methods generation on discrete image modelling. An application of the trained EBM on classification and improving uncertainty quantification compared to a direct classifier is also shown.
优点
- The paper proposes a relevant extension to recently published work. Especially Theorem 1 does not seem obvious, and the paper may open up the use of the Energy Discrepancy loss to a much wider variety of use-cases.
- The method is also quite simple, and seems simple to implement.
- The paper connections to recent work on discrete diffusion models, and proposes a variety of methods to estimate the energy discrepancy loss.
- The results are good compared to standard contrastive divergence based methods
- The paper is well written, and I found it easy enough to understand even without prior knowledge on the Energy Discrepancy method.
缺点
- As noted in the limitations, the application to data such as images seems to be challenging as the noisy negative samples may not give very useful training signal in this case.
- Although the energy discrepancy method has already been proposed and published in previous work, I found the justification for the method slightly confusing while reading this paper. What is Theorem 1 exactly saying? (see questions) The loss also is, in practice, approximated with something slightly different than the proposed loss, which seems conceptually a bit confusing. However, this is not a major concern given that the base method has been proposed and published in previous work.
问题
- How should I interpret the left side of the equation in Theorem 1, and the fact that the right side approaches zero with large enough t? How does this link ED to maximum likelihood, exactly?
- What is Avg. Rank in Table 1?
Overall, the paper seems like a solid contribution in advancing the training of this branch of energy-based generative models. However, I was not aware of ED before reading this paper, and am not very up-to-date on the most recent work on energy-based models. As such, I give tentatively a weak accept.
局限性
Addressed adequately.
Thank you for your helpful comments and questions.
Although the energy discrepancy method has already been proposed and published in previous work, I found the justification for the method slightly confusing while reading this paper. What is Theorem 1 exactly saying? (see questions)
How should I interpret the left side of the equation in Theorem 1, and the fact that the right side approaches zero with large enough t? How does this link ED to maximum likelihood, exactly?
We want to address both your question and the weaknesses mentioned in your review. Theorem 1 is meant to guide the hyperparameter choice of . Another way of stating Theorem 1 is as follows:
Here, is a constant independent of , so the optimisation landscapes of energy discrepancy estimation and maximum likelihood estimation in align at an exponential rate, except for a shift which does not affect the optimisation.
Practically speaking, this property implies that large time parameters yield losses with favourable statistical properties as the MLE has the lowest variance among all unbiased estimators. The most striking example of this is global mode awareness, i.e. as shown in prior work [1], energy discrepancy losses are capable of estimating the mixture weight distributions in multi-modal data sets while being more tractable than maximum likelihood estimation. However, this comes at the cost of higher variance of the loss function due to the sample approximation used to compute the loss function in practice.
Since arbitrarily large are impractical, the parameter needs to be tuned. For this reason we study the opposite limit in the following sections. Similar to the late stages of GAN training, our goal is that negative samples obtained from the noising process remain close to the data support to provide a good training signal. The infinitesimal analysis relates the properties of ED for to the geometry of the underlying discrete space. The perturbed samples produced are then with highest probability the nearest neighbours of the data point in the graph. This makes it possible to get fine-grained control about the properties of the ED loss with a single time parameter.
While these results guide our understanding of the influence and tuning of in the algorithm, Theorem 1 is not needed to justify ED as a loss function. Instead, the justification was given in prior work and is reiterated in lines 78-81: One can show that after convergence, for a sufficiently rich family of neural networks and arbitrary amounts of data, which holds for all .
The loss also is, in practice, approximated with something slightly different than the proposed loss, which seems conceptually a bit confusing. However, this is not a major concern given that the base method has been proposed and published in previous work.
The loss used in practice can be justified by observing that for any :
Thus, the estimated loss function is asymptotically unbiased, and empirically we observe that the loss performs similarly to our expectations from the theory. It should be noted that most EBM training approaches need some form of approximation and stabilisation. MCMC-based training of deep EBMs usually use short run MCMC, introducing biases into the derived parameter update, and use a regulariser which alters the desired MLE loss. In comparison, ED is easier to justify, and indeed we observe better performance in most applications.
To further support our theoretical analysis empirically, we conduct discrete density estimation experiments with different (see Figure 2 in the attached PDF for illustration). Despite using varying , our approach can still learn a faithful energy landscape using a sufficiently large , verifying the theoretical analysis that when .
What is Avg. Rank in Table 1?
We rank each of the 7 methods in table 1 according to their AUC scores from 1 to 7. The average rank is the average ranking of the method across the used data sets.
[1] Schroeder et al. Energy Discrepancies: A Score independent Loss for Energy-Based Models, NeurIPS 2023
I thank the reviewers for the comprehensive explanations to the questions! I have no further concerns, and will keep my accepting score.
Thank you for your review and feedback for our work!
We thank all reviewers for their constructive comments. First, we would like to summarise the strengths of the paper according to the reviewers.
The reviewers agree that our paper is a successful extension of energy discrepancy to discrete and mixed data where energy-based modelling is challenging.
9ZAB: The paper successfully extends the Energy Discrepancy method to discrete and mixed data with solid theoretical analysis, addressing the challenge of robust and fast sampling in these space. qwnm: The paper proposes a relevant extension to recently published work. Especially Theorem 1 does not seem obvious, and the paper may open up the use of the Energy Discrepancy loss to a much wider variety of use-cases. FxWc: While the original paper proposed some extensions to discrete data, this paper goes into extending energy discrepancy to discrete data in much more depth and includes new mathematical and experimental analyses.
The reviewers generally consider our paper to be well-written and well-supported by experiments.
9ZAB: The designed experiments demonstrate the method's ability to accurately capture complex data structures and generate high-quality synthetic data, highlighting its practical applicability. qwnm: The results are good compared to standard contrastive divergence based methods. The paper is well written, and I found it easy enough to understand even without prior knowledge on the Energy Discrepancy method. FxWc: Overall, the paper is well written. The authors appear to generally outperform two methods proposed in 2019 along with contrastive divergence. While I have some minor concerns about these baselines, outperforming these baselines is at least demonstrating some empirical benefit of this approach. t2RA: The authors show how their method can be efficiently used on Tabular dataset. In particular, they apply to several dataset and show that in average the EBM trained with Energy-Discrepancy [...] outperform the concurrent approach.
We now want to address the biggest concerns and questions of the reviewers.
- The reviewers 9ZAB, qwnm, FxWc, and t2RA were concerned about the scalability of our proposed method, particularly for image data. To address this concern we want to put our work into perspective of the research field.
Scalability is a valid concern for EBMs in general. Even in continuous spaces, EBMs struggle to scale up and outperform the sample quality of GANs and diffusion models. However, EBMs are appealing because they explicitly output an approximate data density, which is advantageous for downstream tasks such as out-of-distribution detection and calibration.
Most existing EBM training methods rely on contrastive divergence which over-smoothes the estimated densities. Our method is more reliable due to theoretical guarantees, requires little tuning, and displays competitive performance on binary image data. ED can be computed in a single parallel pass, unlike MCMC-based methods that need many sequential steps. Thus, ED is significantly cheaper and more parallelisable. Furthermore, MCMC-based approaches have yet to be extended to mixed-state spaces, and many tabular data sets of interest have a comparably small number of features.
Enhancing the scalability of EBMs will likely require a combination of EBMs with other tools such as a variational decoder model used to scale energy discrepancy in the continuous domain in prior work. The energy function could also be learned in a diffusion model style with the energy discrepancy loss at various noise scales, thus improving training stability and learning a sampler alongside the EBM. However, this was not the main concern of our work.
- The reviewers 9ZAB, qwnm, and FxWc had questions about the importance of the noising process which we define as a heat equation on a graph. We want to summarise what guides the choices we are making.
In principle, the negative samples can be drawn from an arbitrary noise distribution. If, for example, is independent of the data, the contrastive term becomes just the log-normalisation constant up to a constant, i.e.
which is an example of a non-informative distribution. In this case, ED becomes the log likelihood loss in theory, but the normalisation constant needs to be approximated from thousands of samples to be low-variance.
An informative noising distribution achieves two things.
- can be computed from a small number of samples with lower variance than the full normalisation because the integrand is localised around .
- similar to GAN training, the model converges prematurely if the negative samples are trivial to distinguish from data .
If we select via an arbitrary CTMC we have no guidance about what rate function to choose. In previous work using CTMC the structure of the discrete space is ignored and only the uniform and absorbing chain is discussed. We choose the rate function as the graph Laplacian which defines the smallest possible transitions in the graph. For small , the corrupted data points are then with highest probability the nearest neighbours of the data point in the graph. This guarantees that negative samples are different from data while staying as informative as possible.
However, a small corruption may miss global distribution properties like mixture weights. Theorem 1 states that for large , the difference between the optimisation landscape of ED and MLE goes to zero. MLE has preferable statistical properties, but taking the large limit comes at the cost of higher variance of the estimated loss. Therefore, the heat equation provides a unified framework and allows us to tune the loss properties in terms of a single hyperparameter .
Energy-discrepancy is a recently loss function similar in spirit to contrastive divergence, but with a more clear loss (as opposed to algorithmic) formulation and more tractable for analysis. The paper extends the energy-based discrepancy to models involving discrete states, with some convergence analysis. (The analysis is intuitive—e.g. if the mixing rate is faster, the method provides a better approximation.) There is also some analysis for particular state spaces. Reviewers appreciated the new method, but had some concerns about the experiments (in particular scalability).
Small point: Regarding theorem 1, the authors might wish to stress in the paper that z_t is independent of θ, meaning that when the right hand side is small, the two losses are nearly equivalent regardless of the value of z_t.