/10

Poster4 位审稿人

最低2最高4标准差0.8

ICML 2025

Joint Learning of Energy-based Models and their Partition Function

Michael Eli Sander,Vincent Roulet,Tianlin Liu,Mathieu Blondel

OpenReview PDF

提交: 2025-01-23更新: 2025-08-14

TL;DR

We propose a method to jointly learn energy-based models and their partition function.

摘要

关键词

energy-based modelsmaximum likelihood estimationstochastic optimizationstructured predictionMCMC-free

评审与讨论

审稿意见

评分: 42025-03-12

This paper proposes a min-min training paradigm for simultaneously training energy models and their log-normalization constants in a combinatorially-large space. It demonstrates that this approach can effectively approximate the Maximum Likelihood Estimation (MLE) objective and extends it to Fenchel-Young losses. Notably, the training of the objective function does not rely on traditional MCMC sampling; instead, It only needs to sample from the prior distribution and use stochastic gradient descent. Experiments are conducted on two tasks: multi-label classification and label ranking.

Strengths

The insight that lagrange multiplier exactly coincides with the log-partition function in the MLE (logistic loss) case is good. Based on this insight, the proposed min-min formulation for MLE of EBMs seems to be new.
The proposed min-min training method is supported via theoretical development.
Experiments investigate variations in model architecture, regularization functions, and the number of sampled points, and show the effectiveness of the proposed method.

Weaknesses

In addition to learning EBM model parameters, learning partition functions simultaneously is interesting. But the experiments conducted in this paper do not show the benefit of such joint learning.
Missing relevant references and comparisons

Learning probabilistic EBMs in combinatorially-large discrete spaces have been studied with substantial progress in recent years, particularly in language modeling [a-d]. The discrete space in language modeling is V^k, where V is the vocabulary size and k is the sentence length. A variety of methods have been proposed for learning discrete EBMs, including augmented stochastic approximation (AugSA) [b], noise contrastive estimation (NCE) [c]. Particularly, the AugSA method has been developed to be able to jointly estimate the model parameters and partition functions (up to a constant). The idea of formulating equations with partition functions and then estimating them in this work is very similar to AugSA.

Connecting and comparing to these previous studies are needed.

Limitation in experiment evaluation

The experiment comparisons are only conducted between different versions of the proposed method, by varying settings such as unary vs pairwise, different network architectures, and different losses.

Presenting at least one experiment comparing the proposed method and one of the SOTA methods for learning discrete EBMs is necessary; otherwise, the effectiveness of the proposed method is not convincing.

Regarding the results for multilabel classification, only learning curves in Figure 1 and the learned log-partition functions in Figure 2 are shown in the main text. Not clear why the numerical metrics are not reported in the main text, which should be major results.

[a] Energy-Based Models with Applications to Speech and Language Processing. Foundations and Trends® in Signal Processing, 2024.

[b] Learning Trans-dimensional Random Fields with Applications to Language Modeling. TPAMI, 2018.

[c] Learning neural trans-dimensional random field language models with noise-contrastive estimation. ICASSP, 2018.

[d] Residual Energy-Based Models for Text Generation. ICLR 2020.

Questions

The right hand sides of Eq 2 and 5 are the same, which is confusing.
Line 243: "We have therefore obtained convergence rates for learning EBMs in arbitrary combinatorial spaces." Not clear.
In section 3.3, where does the min-max formulation come from?
Figure 3: what are plots (a) and (b)?

给作者的问题

see above

论据与证据

see above

方法与评估标准

see above

理论论述

see above

实验设计与分析

see above

补充材料

see above

与现有文献的关系

see above

遗漏的重要参考文献

see above

其他优缺点

see above

其他意见或建议

see above

作者回复

2025-04-01

We thank the reviewer for the constructive feedback and relevant references.

Presenting at least one experiment comparing the proposed method and one of the SOTA methods for learning discrete EBMs is necessary

The original Table 3 contained generalized Fenchel-Young (GFY) losses comparisons, and Figures 1 and 3 contained comparison to logistic regression, which serves as a relevant baseline for structured prediction tasks learned without explicit partition function estimation. Following reviewers feedback, we have now added comparisons with MCMC and adversarial (min-max) training baselines.

Link to updated Table 1 and Table 3: https://anonymous.4open.science/r/ebms/tables.pdf

If the paper is accepted, we are going to include these tables in the main text, thanks to the extra page allowed by the camera-ready.

For convenience, we also show the new results below:

Table 1(label ranking)

Loss	Polytope C	Model $g$	authorship	glass	iris	vehicle	vowel	wine
Logistic (MCMC)	P	Linear	84.29	80.67	55.56	79.15	49.97	91.36
	P	MLP	87.90	75.50	97.04	86.34	57.90	91.98
	B	Linear	88.10	79.33	97.78	82.22	55.99	96.91
	B	MLP	89.68	80.16	97.78	85.88	48.90	91.98
Logistic (min-max)	P	Linear	61.01	79.22	61.48	70.20	45.68	65.43
	P	MLP	70.02	67.03	55.56	68.56	40.22	70.99
	B	Linear	82.58	78.40	61.48	74.90	52.86	93.21
	B	MLP	78.50	71.68	42.96	68.76	48.99	83.33

Table 3 (multilabel)

Loss	Model $g$	yeast	scene	birds	emotions	cal500
Logistic (exactMLE)	Linear (unary)	61.46	70.64	43.97	63.35	37.31
	MLP (unary)	60.71	70.81	40.88	62.99	36.24
	Resnet (unary)	62.27	71.02	40.78	61.50	37.45
Logistic (MCMC)	Linear(unary)	61.98	70.53	44.93	61.76	44.61
	MLP (unary)	60.34	71.57	33.13	62.23	45.70
	Resnet (unary)	56.57	71.24	27.03	58.98	44.34
	Linear (pairwise)	61.96	70.50	44.93	61.76	44.61
	MLP (pairwise)	60.83	71.92	34.61	63.15	45.95
	Resnet (pairwise)	56.68	70.46	27.08	64.05	44.35
Logistic (min-max)	Linear (unary)	52.80	52.29	27.59	61.72	35.08
	MLP (unary)	58.58	52.54	24.95	60.53	46.08
	Resnet (unary)	57.53	42.82	25.32	55.85	47.59
	Linear (pairwise)	52.80	52.29	27.59	61.72	35.08
	MLP (pairwise)	58.76	51.59	26.71	58.68	43.31
	Resnet (pairwise)	57.53	42.95	22.35	46.72	47.67

Technical details: For the the min-max approach, we use optimistic ADAM as solver, an MLP as generator and we use REINFORCE (score function estimator) for gradient estimation. For MCMC, we use standard Metropolis–Hastings algorithm. Similarly to the min-min approach, we tune the learning rate and L2 regularization using grid search.

References [a-d]

Thank you for these references. We will add them.

Differences with AugSA [b]

Thank you for this very relevant reference. We will cite it and clarify the distinctions. First, this paper studies linear conditional random fields for whole-sentence LMs while we study more general energy-based models. In addition, our method supports the broader class of Fenchel-Young losses (including sparsemax), not just MLE. They indeed propose the idea to optimize the normalization constant but in their case it’s a constant, while in our case it’s a function, parameterized as a neural network that approximates the input-dependent log-partition function. Last, we empirically demonstrate (Figure 2) that our learned $\tau$ generalizes to unseen inputs $x$ .

Not clear why the numerical metrics are not reported in the main text, which should be major results.

This was due to space constraints in the initial submission. If our paper is accepted, we plan to use the extra page allowed for the camera-ready version to move the expanded Table 3 (which now includes MCMC and a min-max approach) into the main text.

The right hand sides of Eq 2 and 5 are the same, which is confusing.

Please note that Eq. (2) is using an argmax while Eq. (5) is using a max.

Line 243: "We have therefore obtained convergence rates for learning EBMs in arbitrary combinatorial spaces." Not clear.

Our claim refers to the fact that our proposed objective can be optimized using standard stochastic gradient methods, with the convergence rate discussed after Proposition 2. This is achieved without requiring an oracle for the exact partition function (needed for traditional MLE gradient computation) or k-best solutions (needed for traditional sparsemax loss optimization), which are often intractable in arbitrary combinatorial spaces. We will clarify this context in the paper.

In section 3.3, where does the min-max formulation come from?

We derived the min-max formulation presented in Proposition 3 and Appendix C.4 ourselves. It serves as a comparison point to our proposed min-min formulation and draws parallels to adversarial learning approaches.

Figure 3: what are plots (a) and (b)?

Thank you for catching that typo. We were referring to the left and center columns of the Figure.

审稿意见

评分: 42025-03-13

It is well-known that computing normalising constants of energy-based models is intractable, and a significant and important problem. The authors propose a simple and elegant technique for discrete models - instead of computing the normalising constant, consider a constrained maximum likelihood optimisation problem with the density constrained to sum to 1, and use Lagrange multipliers to enforce the constraint. Two neural networks are used to jointly learn the energy based model and the normalising constant, as the title suggests. Experiments are conducted on multiabel classification and ranking.

update after rebuttal

As I indicated in my original review, I found and still find the empirical results a little bit underwhelming. However, I believe the other results in this paper are interesting, and are enough to warrant publication even without very strong empirical results. I'm maintaining my current score of accept, noting that the overall average still seems to be erring on accept.

给作者的问题

See above.

论据与证据

Claims and evidence are supported by evidence.

方法与评估标准

Evaluation criteria are appropriate.

理论论述

Theoretical claims appear sound.

实验设计与分析

The experiments appear sound, although only two datasets are used (see below).

补充材料

I briefly checked the proofs for correctness. They use standard tools from optimisation and I could not spot any errors.

与现有文献的关系

It is very well placed in broader scientific literature, tackling a central problem in machine learning.

遗漏的重要参考文献

None that I am aware of.

其他优缺点

Strengths:

The paper is very well written, provides a good amount of background which is accessible to even those who are not familiar with the area, citing appropriate canonical background texts. The notations seem sensible and easy to read. I was not able to spot typos or errors. Easily the best written paper in my pile this year.
The idea is relatively simple --- use the constraint of the normalising constant equal to 1 as a separate function multiplied by the Lagrange multiplier. Then just use neural nets for both the constraint and the original log loss. A simple idea seems to give good results, without convoluted hacks.

Weaknesses:

The approach is limited to a discrete target space $\mathcal{Y}$ .
Figure 2 is perhaps misleading and I am also not sure what I should take from it. Firstly, the horizontal axis shows randomly selected x values (i.e. there is no "metric" between x and no "continuity" in y), but the plot makes it look like a timerseries or something similar. Secondly, the errors are entirely qualitative, relying on the reader to try and eyeball the difference between the dark and light plots. All I can say is "emotions" looks good and "cal500" looks relatively poor. Finally, there is no "control", i.e. it shows the proposed method, but doesn't compare with any baseline. I think a single scalar number for each of the datasets would be more informative, and a comparison with other techniques.
In the empirical comparison with other techniques, I am not seeing comparisons with other (perhaps intractable?) techniques, like standard (non-probabilistic) deep net classifiers, logistic regression, etc. Could you provide these, or discuss why they are inappropriate?

其他意见或建议

Detailed comments:

The variational perspective in equation (2) is generalised on page 6. NLL loss and KL divergence gives a standard "Bayes posterior" (which is not actually a posterior in this context). Arbitrary loss and KL divergence gives a "Gibbs posterior", which is equation (1). Arbitrary loss and arbitrary divergence (or, more specifically, f divergence) gives a generalised posterior. Maybe you would like to mention this (I guess "posterior" can be replaced by "measure")? --- see Knoblauch
As far as I understand, the problem of sampling from the probabilistic EBM is not studied. Is that right? Would you expect any particular sampling algorithm to be especially well suited to this formulation of EBM, or are we no better off than generic samplers?
The constraint is on the normalising constant equal to 1. Why not constrain the log normalising constant to be equal to zero? Wouldn't this then be of the same "scale" as the likelihood term in the loss?

作者回复

2025-04-01

Thank you very much for your positive review and constructive feedback.

The approach is limited to a discrete target space

Our approach can in principle be applied to continuous output spaces $\mathcal{Y}$ . However, we believe standard regression tasks are less common applications of conditional EBMs, compared to structured prediction over discrete spaces. This motivated our focus on discrete, combinatorially-large spaces like sets and permutations.

Figure 2: a single scalar number for each of the datasets would be more informative and a comparison with other techniques

Indeed, Figure 2 plots the learned vs. true log-partition for 100 individual test samples, rather than a time series. We chose this visualization to show the correlation across diverse samples. While scalar metrics like correlation coefficients could summarize this, the plot provides a better view of the approximation quality. We agree that summarizing with a scalar metric in the caption would be beneficial and will add this. A comparison with other techniques for learning the log-partition function is difficult as, to our knowledge, parameterizing and learning $\tau$ as a parametrized function is novel to our work.

Empirical comparison with other techniques

For multilabel classification, in Table 3, logistic (unary) corresponds exactly to using sigmoids as output layers and binary cross-entropy as loss. In addition, Table 3 includes the results when using the generalized Fenchel-Young loss, which is state-of-the-art for structured prediction tasks.

For label ranking, we use full rankings as supervision, which is naturally cast as a structured prediction task. We could potentially reduce full rankings to many pairwise rankings and use pairwise ranking losses instead, but we believe such a comparison is out of the scope of our paper.

However, we acknowledge that the paper did lack comparisons with MCMC and min-max baselines. We therefore added these comparisons to Table 1 and Table 3.

Link to updated Table 1 and Table 3: https://anonymous.4open.science/r/ebms/tables.pdf

If the paper is accepted, these tables are going to be included in the main text, thanks to the extra page allowed by the camera-ready.

The variational perspective in equation (2) is generalised on page 6. NLL loss and KL divergence gives a standard "Bayes posterior" (which is not actually a posterior in this context). Arbitrary loss and KL divergence gives a "Gibbs posterior", which is equation (1). Arbitrary loss and arbitrary divergence (or, more specifically, f divergence) gives a generalised posterior. Maybe you would like to mention this (I guess "posterior" can be replaced by "measure")? --- see Knoblauch

This is a good remark. If $g(x, y) = \log p(x|y)$ and $q(y) = p(y)$ , then Eq. (1) indeed gives the posterior $p(y|x)$ (in the sense of Bayes’ rule).

As far as I understand, the problem of sampling from the probabilistic EBM is not studied. Is that right? Would you expect any particular sampling algorithm to be especially well suited to this formulation of EBM, or are we no better off than generic samplers?

Indeed, we do not address the issue of sampling in this paper. We wonder whether the learned log-partition function can help design a better rejection sampling scheme. We leave this to future work.

The constraint is on the normalising constant equal to 1. Why not constrain the log normalising constant to be equal to zero? Wouldn't this then be of the same "scale" as the likelihood term in the loss?

We haven’t considered this. Our proof is based on the more general Fenchel-Young losses, where the Lagrange multiplier $\tau$ naturally appears as a consequence of Lemma 1.

审稿意见

评分: 32025-03-15

The paper proposes a method for learning discrete energy-based models and their partition function. Specifically, by parameterizing the partition function as $\gamma(x)$ , the paper introduces the constraint $\sum_y p(y|x) - \gamma(x) = 0$ , which can be incorporated into the maximum likelihood estimation using the Lagrange multiplier method. Experimental results on multilabel classification and label ranking demonstrate the effectiveness of the proposed method.

给作者的问题

None

论据与证据

The paper is well written and most claims are well supported. However, there is one point I hold different opinions.

In line 295, it states: “However, min-max formulations are notoriously hard to optimize. In contrast, our approach is based on a min-min formulation, which is easier to optimize.”

There are no experimental results in the paper to support this claim. I suspect whether the proposed min-min formulation is indeed easier to optimize than the min-max formulation, which parameterizes a variational distribution to estimate the partition function. In fact, in the continuous domain, variational methods have demonstrated scalability in modeling high-dimensional data [1].

The proposed min-min formulation requires Monte Carlo estimation of the intractable sum, as seen in Equation 13. This approach generally suffers from high variance, which can make optimization unstable. In this regard, I am curious about the scalability of the proposed method.

[1] Duvenaud, David, et al. "No MCMC for me: Amortized samplers for fast and stable training of energy-based models." International Conference on Learning Representations (ICLR). 2021.

方法与评估标准

The proposed method is straightforward and easy to follow. However, the evaluation criteria are insufficient. In the experiments, the paper only presents the results of the proposed method, lacking a baseline comparison with other approaches, such as the variational-based min-max method.

The paper focuses solely on the non-probabilistic EBM setting, specifically computing the mode argmax_y p(y, x). It would be beneficial to explore whether the proposed min-min formulation could also perform well in a probabilistic EBM setting, where the goal is to learn an EBM to model the data distribution rather than merely finding the mode.

Moreover, [2] addresses a similar problem—learning set functions—which can be viewed as a type of mode-finding problem (i.e., argmax). [2] explores broader applications, including product recommendation, set anomaly detection, and compound selection, making it a valuable reference. Additionally, [2] proposes a method for training discrete EBMs. Although it is based on a biased objective, it serves as a competitive baseline for performing the mode-seeking task of argmax.

[2] Ou, Zijing, et al. "Learning neural set functions under the optimal subset oracle." Advances in Neural Information Processing Systems 35 (2022): 35021-35034.

理论论述

The theoretical results appear to be correct.

实验设计与分析

As mentioned before, the experimental designs is insufficient. More comparison to the baseline and including more settings, like learning probabilistic EBM s. The experiment in [2] can also be a good reference.

补充材料

None

与现有文献的关系

None

遗漏的重要参考文献

There are some missing references regarding the training of discrete EBMs with non-MLE approaches, which could be cited in Section 2.5.

Energy discrepancy [3] is one such non-MLE approach that has an interesting connection to MLE.
Ratio matching and concrete score matching are also applicable for training discrete EBMs.

[3] Schröder, Tobias, et al. "Energy-Based Modelling for Discrete and Mixed Data via Heat Equations on Structured Spaces." Advances in Neural Information Processing Systems 37 (2024): 79246-79281.

[4] Hyvärinen, Aapo. "Some extensions of score matching." Computational statistics & data analysis 51.5 (2007): 2499-2512.

[3]Meng, Chenlin, et al. "Concrete score matching: Generalized score matching for discrete data." Advances in Neural Information Processing Systems 35 (2022): 34532-34545.

其他优缺点

None

其他意见或建议

None

作者回复

2025-04-01

We thank the reviewer for the constructive feedback.

Challenges of min-max implementation

To avoid any controversial claim, we will remove “notoriously hard to optimize”. Instead, we will explain the technical challenges that arise using this formulation in more detail.

The min-max objective is $\mathbb{E}_x \mathbb{E}{y \sim p(\cdot|x)} [g(x, y)] - \Omega(p(\cdot|x)) - \mathbb{E}{(x,y)} [g(x,y)]$ , where $\Omega$ is the negative Shannon entropy. The variable $p$ appears in the sampling (expectation) and in the entropy. In practice, we parameterize $p$ as a neural network. If $p$ is an EBM, sampling from $p$ is difficult and the min-max formulation does not bring any benefit. Therefore, it is common to use a parameterization such that it is easy to sample from the mode instead.

Another challenge comes from the gradient computation. Since $p$ appears in the sampling, one typically uses the score function estimator (REINFORCE) to estimate the gradient w.r.t. the parameters of $p$ . However, this estimator suffers from high variance. In our min-min formulation, the variable $\tau$ is a function, not a distribution and its gradients are easy to compute

High-variance of Monte-Carlo estimation of expectations

Our experiments (e.g., Table 1, Table 3) were designed to show scalability with respect to the output space size, governed by the number of classes $k$ ( $|Y|=2^k$ or $k!$ ). Scalability with the number of training samples $n$ is addressed through standard SGD optimization. Regarding the MC estimation in Eq. (13), while MC estimates can introduce variance, our doubly stochastic approach (Alg. 1) proved stable and effective in practice across our experiments. Fig. 1 illustrates that even with a small number of prior samples ( $y'$ ), the method converges effectively, although more samples accelerate convergence towards the exact MLE objective.

Lack of baseline comparison with other approaches, such as the variational-based min-max method.

Our paper included logistic regression (“Exact MLE” in Fig. 1 & 3) and generalized Fenchel-Young losses (GFY) (Table 3) as baselines. However, our paper did lack comparisons with MCMC and min-max approaches.

Links to updated Table 1 and Table 3 https://anonymous.4open.science/r/ebms/tables.pdf.

If the paper is accepted, we are going to include these tables in the main text, thanks to the extra page allowed by the camera-ready.

For convenience, we also show the new results below:

Table 1

Loss	Polytope C	Model $g$	authorship	glass	iris	vehicle	vowel	wine
Logistic(MCMC)	P	Linear	84.29	80.67	55.56	79.15	49.97	91.36
	P	MLP	87.90	75.50	97.04	86.34	57.90	91.98
	B	Linear	88.10	79.33	97.78	82.22	55.99	96.91
	B	MLP	89.68	80.16	97.78	85.88	48.90	91.98
Logistic(min-max)	P	Linear	61.01	79.22	61.48	70.20	45.68	65.43
	P	MLP	70.02	67.03	55.56	68.56	40.22	70.99
	B	Linear	82.58	78.40	61.48	74.90	52.86	93.21
	B	MLP	78.50	71.68	42.96	68.76	48.99	83.33

Table 3

Loss	Model $g$	yeast	scene	birds	emotions	cal500
Logistic (exactMLE)	Linear (unary)	61.46	70.64	43.97	63.35	37.31
	MLP (unary)	60.71	70.81	40.88	62.99	36.24
	Resnet (unary)	62.27	71.02	40.78	61.50	37.45
Logistic( MCMC)	Linear(unary)	61.98	70.53	44.93	61.76	44.61
	MLP (unary)	60.34	71.57	33.13	62.23	45.70
	Resnet (unary)	56.57	71.24	27.03	58.98	44.34
	Linear (pairwise)	61.96	70.50	44.93	61.76	44.61
	MLP (pairwise)	60.83	71.92	34.61	63.15	45.95
	Resnet (pairwise)	56.68	70.46	27.08	64.05	44.35
Logistic (min-max)	Linear (unary)	52.80	52.29	27.59	61.72	35.08
	MLP (unary)	58.58	52.54	24.95	60.53	46.08
	Resnet (unary)	57.53	42.82	25.32	55.85	47.59
	Linear (pairwise)	52.80	52.29	27.59	61.72	35.08
	MLP (pairwise)	58.76	51.59	26.71	58.68	43.31
	Resnet (pairwise)	57.53	42.95	22.35	46.72	47.67

Technical details: For the the min-max approach, we use optimistic ADAM as solver, an MLP as generator and we use REINFORCE. For MCMC, we use Metropolis–Hastings algorithm. We tune the learning rate and L2 regularization using grid search.

Whether the proposed min-min formulation could also perform well in a probabilistic EBM setting

We indeed focus primarily on the prediction setting (i.e., finding the mode, $\mathrm{argmax}_y ~ p(y|x)$ ), because our approach, while learning the model $p(y|x)$ and approximating its normalization constant, doesn’t directly address the challenge of sampling from this distribution, which would be central to evaluating performance in a generative/probabilistic setting. Sampling in structured discrete spaces can be challenging. Thank you for pointing out the work of Ou et al, that we will cite.

Missing references

Thank you, we will add these relevant citations.

审稿人评论

2025-04-08

Thank you for the response and for sharing the updated results.

Overall, the paper is well-written, and I find the narrative clear and easy to follow. It would be interesting to see whether the min-min algorithm performs well in the probabilistic EBM setting as well. My intuition is that it might face challenges in that context compared to MCMC-based methods. If that turns out to be the case, a discussion on why the min-min approach is more effective in the prediction setting than in the probabilistic EBM setting would be a valuable addition.

作者评论

2025-04-08

Dear reviewer,

Thank you for your comment and your positive feedback.

We agree that evaluating the quality of $p(y|x)$ , and not just $\mathrm{argmax}_y ~ p(y|x)$ would be valuable. However, we also think that this evaluation would be challenging and could warrant a separate paper.

To our knowledge, evaluating $p(y|x)$ could either mean evaluating whether it is well calibrated or whether generations $y \sim p(\cdot | x)$ are good. Both evaluations are challenging in the structured setting (sets, permutations), as they highly depend on the quality of the MCMC sampler. In our opinion, sampling $y \sim p(\cdot | x)$ would be most interesting if $y$ is continuous, for instance for generating an image $y$ from a prompt $x$ . However, our paper focuses on discrete output spaces.

In addition, we believe that the learned log-partition (for which we empirically demonstrate the generalization properties in Figure 2) could help design sampling algorithms (such as rejection sampling) with faster convergence than vanilla MCMC. We leave such an investigation for future work.

To address your concern, we propose to add a new paragraph “limitations and future work”, that will recap the discussion above.

Thank you,

The authors

审稿意见

评分: 22025-03-18

Authors formulate the MLE of Energy Based Models as a constraint problem with the equality contraint defining the partition function. Then they suggest performing stochastic optimization by learning jointly the energy model and its associated dual variable. Amortization is suggested by parameterizing the dual as a neural network.

给作者的问题

When using f-divergences like square loss instead of KL, thus leading to sparsity, can the authors provide details about the associated challenges in terms of smoothness of the optimization? Is the method as stable as in the KL "dense" case?
Which other f-divergences could be interesting beyond the square loss and KL?

Thanks!

论据与证据

In the introduction, the authors motivate their work by emphasizing the limitations of MCMC-based methods. However, the experimental section lacks a thorough benchmarking analysis and comparisons with currently established methods, leaving the practical relevance of the proposed approach unverified.

Furthermore, while the authors employ an amortized optimization technique to handle a polytope constraint in the dual problem, a very common method used for instance extensively in the literature on neural OT [1], the paper does not clearly explain why the structure of the EBM problem is particularly suited to this approximation over traditional Monte Carlo-based methods. This omission underscores the need for thorough empirical ablation studies, which are notably absent.

Based on this important shortcoming, I believe the paper does not meet the acceptance threshold.

Additionally, the discussion on generalization to Fenchel-Young losses appears to offer limited practical utility. For future revisions, I recommend relocating the section on Fenchel-Young losses to an appendix, thereby allowing more space for extensive large-scale experiments and detailed comparisons with current benchmarks in EBMs.

[1] Korotin, A., Li, L., Genevay, A., Solomon, J. M., Filippov, A., & Burnaev, E. (2021). Do neural optimal transport solvers work? a continuous wasserstein-2 benchmark. Advances in neural information processing systems, 34, 14593-14605.

方法与评估标准

Figures 1 and 2 aim to validate the presented theoretical results. However, the experiments appear to be limited in scale and provide minimal insight into performance in real-world settings.

For the experimental part, it would be beneficial to include a thorough comparison with current MCMC-based approaches in terms of runtime, memory consumption, and final performance. Also a comparison with adversarial approaches described in section 3.3 can be valuable.

Remark: I believe Table 1 is not discussed enough and it is hard to understand what conclusions to draw there.

理论论述

The theoretical claims seem sound and rigorous.

实验设计与分析

This is the weakest part of the paper as discussed above.

补充材料

Yes I reviewed the proofs which seemed rigorous.

与现有文献的关系

This work applies the very well known technique of amortized optimization on the dual problem of a convex optimization problem to the problem of learning EBMs.

遗漏的重要参考文献

None

其他优缺点

其他意见或建议

I think equations 12 and 13 should be better introduced. It would be nice to be guided in this critical part without having to look at other sections.

作者回复

2025-04-01

Thank you for your constructive feedback.

Lack of baselines, comparison with MCMC and adversarial approaches

While our paper includes GFY and exact MLE (logistic loss) baselines, it did lack comparisons with MCMC and min-max approaches. We have run additional experiments with these baselines.

Link to updated Table 1 and Table 3: https://anonymous.4open.science/r/ebms/tables.pdf

If the paper is accepted, these tables are going to be included in the main text, thanks to the extra page allowed by the camera-ready.

Amortized optimization in the dual problem, neural OT [1]

We agree that parameterizing dual variables as neural networks is a commonly-used technique in the OT literature. We already cited Seguy et al. (2018) and will gladly add the suggested citation [1]. Technically, the dual variable in [1] is a convex function, hence the use of ICNNs. In our work, the Lagrange multiplier $\tau$ is a continuous function.

That said, we believe our application of that idea to EBMs remains novel and significant. To our knowledge, we are the first to parameterize the log-partition function of conditional EBMs as a neural network. We believe that parameterizing dual variables with neural networks is a powerful concept with potential applications beyond OT and EBMs.

Regarding amortized optimization, its goal is typically to accelerate the process of repeatedly solving similar optimization problems. In our paper, the primary goal of parameterizing the log-partition function $\tau$ as a neural network is different: it is to enable the approximation of the log-partition function to generalize to unseen data points $x$ .

Why EBMs are particularly suited to this approximation over Monte Carlo-based methods?

As discussed in Section 2.4, sampling from EBMs, especially in discrete spaces, is often challenging and typically requires MCMC methods like Langevin-MCMC (for continuous spaces). For discrete output spaces, constructing an MCMC sampler is less explored and can be more challenging. Typically each particular output space (such as sets or permutations in our work) will require a specifically-designed sampler. In contrast, our proposed approach offers a simpler, more generic alternative that avoids MCMC sampling during training by jointly learning the energy and the log-partition function.

Relocating section on Fenchel-Young losses

We chose to include the Fenchel-Young losses because they encompass the sparsemax loss. Our work presents, to our knowledge, the first tractable method for optimizing the sparsemax loss for EBMs in general combinatorial spaces without needing a k-best oracle. Thanks to the extra page in the camera-ready, we believe we do not need to move this.

Experiments [in Figures 1 and 2] appear to be limited in scale and provide minimal insight

Our primary goal in these experiments was to demonstrate scalability concerning the number of classes $k$ , which determines the size of the output space ( $|Y| = 2^k$ for multilabel classification, $|Y| = k!$ for label-ranking). Scalability with respect to the number of training samples $n$ is handled via standard SGD.

Table 1 not discussed enough

The key takeaway from Table 1 is that the Birkhoff polytope representation tends to perform better with linear models for label ranking, likely due to its ability to capture more interactions. However, with more expressive MLP models, both the Birkhoff and Permutahedron representations achieve strong performance.

Eq 12 and 13 should be better introduced

Thank you for the suggestion. By minimizing the overall objective with respect to both $g$ and $\tau$ , the optimization process encourages $\tau$ to approximate the true log-partition function while simultaneously learning the energy function $g$ that fits the data under the EBM framework. We will add a brief, self-contained explanation to the paper.

When using f-divergences what are the challenges in terms of optimization?

When $f$ is strictly convex, which is the case for the sparsemax loss, the loss is smooth, leading to a well-behaved optimization landscape. The primary challenge associated with the sparsemax loss in structured prediction is the need for a k-best oracle, which our approach alleviates (similarly, we do not need an LSE oracle in the logistic case).

Which other f-divergences could be interesting ?

We may consider alpha divergences, which interpolate between the logistic loss ( $\alpha=1$ ) and the sparsemax loss ( $\alpha=2$ ). Our approach can readily be applied with these divergences by selecting the appropriate $f$ function.

最终决定Accept (poster)

2025-05-01

This paper proposes a method for learning discrete energy-based models by formulating MLE as a constrained optimization problem, jointly learning the energy function and its partition function using Lagrange multipliers. The approach avoids costly MCMC sampling and shows strong performance on multi-label classification and label ranking tasks.

The method is surely an interesting contribution. The main concern of the reviewers, which was raised again in the reviewers' discussion, is the experimental evaluation, both in terms of setup and results.