PaperHub
5.5
/10
Poster4 位审稿人
最低2最高4标准差1.0
2
2
4
4
ICML 2025

A Theoretical Framework For Overfitting In Energy-based Modeling

OpenReviewPDF
提交: 2025-01-24更新: 2025-07-24
TL;DR

We analyze the impact of limited data in the training of energy-based models, focusing on eigenmodes' dynamics, giving a theoretical perspective of early-stopping and data-correction protocols to improve the quality of the inferred model

摘要

关键词
energy-based modelsoverfittingrandom-matrix theoryinverse problemsearly-stoppingtraining dynamicsBolzmann Machine

评审与讨论

审稿意见
2

This paper analyses the training dynamics of learning a multi-dimensional Gaussian distribution from data. The training dynamics considered here is a continuous time gradient ascent optimizing the maximum likelihood objective. Under some assumptions on the starting point of the learning dynamic, the authors use techniques from random matrix theory to study the effect of finite training samples on the learning. They use this analysis to study effect of regularization and early-stopping. The analysis is also extended to Ising type models with the learning done in the high temperature/mean-field setting.

给作者的问题

Could the authors explain here or add a note to the Supp. Mat. showing the intermediate steps on how Eq (5) is derived? There some clarity lacking here. For instance the LHS of the second equation should be for the inner product between d(v^\beta)/dt and v^\alpha.

Do the authors assume that J(t) and C^M are aligned for all t in the training dynamic? If these are aligned at t= 0, does this imply alignment at all times?

Is the overfitting reported in this case really a case of overfitting in the traditional sense? In my view, the hypothesis model is not over-parametrized here. The fact that the test metrics achieve their optimum at a different time than the training metric is just an artifact of the different noise instantiations in the test and train dataset. That is, they are literally different metrics whose difference decreases in infinite sample limit. This does not seem to be an example of a complex model essentially memorizing the training samples and then failing to generalize on the test dataset.

论据与证据

The claims made in the paper are not clearly supported by analysis. This does not give any theoretical framework to analyse the training of EBMs. The analysis only applies to gaussian fitting where the explicit training dynamic can be analyzed. The general case is much more involved with a non-convex optimization dynamic. Also, applying the proposed methodology to the visible Botlzmann machine case will just give results that are not very interesting. In that case, the mean-field approach fails in the low-temperature regime where interesting multi-modality of the model emerges.

方法与评估标准

The use of synthetic data is probelmatic. Fig 1, says that the synthetic model closely mimics the real data sets. This does not seem to be true. The range of eigenvalues is completely different. Also the MM \rightarrow \infty limit of the synthetic model shows non-smooth behavior. It is unclear, if the results of the synthetic data model actually closely mimic that of actual datasets. This can be addressed by numerical comparison of the training dynamics of both the cases.

理论论述

(See questions)

实验设计与分析

(see questions)

补充材料

I did not review the supplementary material closely.

与现有文献的关系

The authors have not clarified how their work connects to existing literature in analyzing gradient descent for convex problems under stochastic noise. Also, for models on discrete variables there exists other works that show that the learning problem can be solved sample optimally using methods based on learning conditionals. (See questions)

遗漏的重要参考文献

(see questions)

其他优缺点

(see questions)

其他意见或建议

The paper can be improved if the initial claims are toned down. As mentioned before, the analysis here does not extend to general EBMs. If the authors think that practical training protocols for generals EBMs can be studied using the gaussian approximation, then they should support that claim strongly with numerical or theoretical evidence in the paper.

The discrete model analysis has to be removed or reworked to a more useful approach analyzing algorithms like interaction screening [VMLC16] or logistic regression [WSD19]. The review by Nguyen et.al. [CZB17] cited in the paper is unfortunately outdated for the inverse ising problem and does not consider the substantial advances in developing efficient algorithms for this problem in the last 7-8 years.

[VMLC16] Vuffray, Marc, et al. "Interaction screening: Efficient and sample-optimal learning of Ising models." Advances in neural information processing systems 29 (2016).

[WSD19] Wu, Shanshan, Sujay Sanghavi, and Alexandros G. Dimakis. "Sparse logistic regression learns all discrete pairwise graphical models." Advances in Neural Information Processing Systems 32 (2019).

[CZB17] Nguyen, H. Chau, Riccardo Zecchina, and Johannes Berg. "Inverse statistical problems: from the inverse Ising problem to data science." Advances in Physics 66.3 (2017): 197-261.

作者回复

Methods and evaluation criteria

We answer this comment about the choice of the spectrum in the answer to Rev. MN9R.

Relation To Broader Scientific Literature: Our work is focusing on the full-batch case since it is the typical case for the BM since you only need to compute the covariance matrix once.. We will add a few comments on the effect of considering minibatches in our framework. The second point answered below.

Other Comments Or Suggestions We refer to the general reply (posted on Reviewer SPJf's section) about extensions to general EBMs. We are unsure why the Reviewer suggests removing or altering the purpose of the discrete-variable analysis. The aim of this section is to demonstrate that the main features and insights obtained from the GEBM analysis—particularly the interpretation of the early stopping point—also apply to the case of discrete variables, within an otherwise identical setup. In both cases, the analysis is carried out on the standard gradient ascent algorithm for the log-likelihood maximization, which is the most generic way to train EBMs of arbitrary complexity. We do not claim that the proposed "cleaning approach" is sample-optimal or superior to other existing methods in the specific case of the inverse Ising problem; for instance, we are not discussing any comparison with other training schemes (e.g. pseudo-likelihood maximization or the interaction screening method), nor comparison with other mean-field like techniques suited for the Ising-BM. However, we will incorporate a discussion of the suggested references in the revised manuscript, and we would be glad to refer to a more recent review if the Reviewer could kindly provide a reference.

Questions for authors About the alignment of J(t)J(t) with CMC^M at t=0t=0: yes the dynamics remains aligned for all time in that case. If the initial condition J(0)\boldsymbol{J}(0) is aligned with CM\boldsymbol{C}^{M}, it means that the two matrices commute, namely [CM,J(0)]=0\left[\boldsymbol{C}^{M}, \boldsymbol{J}(0)\right]=0. From here, any gradient ascent step will add a term proportional to CM+J1(t)-\boldsymbol{C}^{M}+ \boldsymbol{J}^{-1}(t): since both terms commute with which keeps the two matrices commute with each other. Alternatively, by looking at Eq. 5 (right) one immediately sees that if the initial condition is aligned with the eigenvector basis of CM\boldsymbol{C}^M, then cα,β=0c_{\alpha, \beta}=0 and the two matrices remain aligned at all times. Additional numerical details about the alignment of the eigenvectors at the first stage of the training are discussed in Appendix B; it could be interesting to have a characterization of this transient time, but we don't have it yet and should follow from the more general theory mentioned in the common answer, that we plan to only sketch in this paper.

About overfitting: We thank the Reviewer for this question, it is indeed a point that might be worth discussing in the core of the paper. Our interpretation is the following: In the context of GEBM the weak modes of the empirical covariance that are eventually learned by the model correspond to directions of variance poorly estimated by the data and lead to overestimated coupling eigenvalues. As noticed by the Reviewer we are in the under-parameterized regime in this case even though the ratio #samples/#parameters (=M/N2=M/N^2) is typically below 1 in our experiments but what matters is the ratio ρ=M/N\rho = M/N which becomes critical when equal to one because the covariance matrix ceases then to be invertible at that point. Nevertheless, when approaching the interpolation threshold from above (ρ>1\rho>1) there is a departure of the test LL from the train LL corresponding to overfitting in this proportional scaling limit. Then when looking at the more general formulation mentioned above corresponding to a kernel regime of score matching, this appears as a special case where the regression factorizes into NN independent problems (once the coupling matrix is aligned with the covariance matrix), which leads to consider M/NM/N scaling instead as =M/N2=M/N^2. If we now consider the kernel setting with general EBM, there is no such factorization in general and we recover the standard picture of overfitting with ρ=M/P\rho = M/P where PP is the number of parameters. The over-parameterized regime then corresponds to learning the weak modes of the Gram matrix of the score features, i.e. typically high frequency modes able to define a localized energy function on each sample point.

On the derivation of Eq. 5 We will add the details of the derivation in the appendix. The derivation first consider the decomposition Jij=αviαJαvjαJ_{ij} = \sum_\alpha v_i^\alpha J_\alpha v_j^\alpha before projecting the gradient on this new basis. Then we identify the diagonal terms leading to the dynamics of the eigenvalues and the off-diagonal ones leading to rotations in the space of the eigenvectors.

审稿意见
2

The given paper provides a theoretical analyses of overfitting in Energy-based models. Particularly, the scope of the paper is restricted to the analyses of Gaussian Energy-based Model (GEBM), wherein the authors show that the maximum likelihood (ML) training dynamics of GEBM is decomposed into different timescales. Specifically, the dominant mode features (corresponding to higher eigenvalue) are learnt early on whereas non-dominant mode features (corresponding to lower eigenvalue) are learnt later on during training.

The provided analyses shows that for finite number of training samples, the test log-likelihood (LL) improves upto certain time and then starts deteriorating. The authors provide analytical expression for this optimal stopping time (toptt_{opt}) using methods from RMT. Furthermore, the authors also provide several other results on optimal scaling factor, etc using RMT for a regularized training of GEBM. The authors conclude that overfitting observed in GEBM is mainly caused by the variation/noise in non-dominant modes in finite training sample setting, while this problem can be solved exactly when one has access to true model parameters. To this end, the authors propose to overcome the phenomenon of overfitting by using RMT to predict true model parameter via asymptotic results. However, most of the proposed solutions rely on quantities that rely on true model parameters. Lastly, the authors propose to fit a linear model to learn the eigenvalues of the data covariance, then use the model to extrapolate and use the extrapolated eigenvalues to determine the quantities of interest and avoid overfitting in finite sample cases.

给作者的问题

  1. C^\hat{C} has been defined two times - after Eq. 1 and after Eq. 7. Are both the definitions same due to LLN?

论据与证据

The authors have provided empirical and experimental proofs for most of their claims. However, I would like to see the following results as well:

  1. Can you provide a plots of {cαm}\{c_\alpha^m\} against different mm so that one can verify that a linear fit would be good for extrapolation?
  2. A similar plot for the Boltzman Machine learning would be helpful.

方法与评估标准

Method: The authors haven't provided any new method as such. Rather they give a rigorous theoretical insights into overfitting of EBMs (although this is limited to GEBM).

Evaluation: I understand that the provided analyses is limited to GEBM. However, the authors have shown spectral density on several datasets such as MNIST and CIFAR10. In that case, can the authors comment on how does their analyses apply to these datasets?

理论论述

I checked the theoretical correctness of the provided proofs and I cannot find any obvious mistake. However, I might have missed any tricky mistake as I am not well versed with RMT.

实验设计与分析

  1. The authors show spectral density of complex datasets like MNIST and CIFAR10, however, they don't show visual results of sampled datapoints obtained after training.
  2. The authors should sample datapoints from MNIST and datasets mentioned in Fig. 7. Then compare how well these sampled datapoints are with few baselines like IGEBM (Du & Mordatch, 2019).
  3. One should use Negative log-likelihood (NLL) and FID for the above comparison.

补充材料

I have reviewed Supplementary materials except Section D.1.

与现有文献的关系

N/A

遗漏的重要参考文献

N/A

其他优缺点

Strengths

  1. The paper is fairly well written and presented.
  2. To my knowledge, this is one of the first work that discusses the phenomenon of overfitting in EBMs. However, the scope of the provided analyses is limited to GEBM.
  3. The conclusion drawn from the paper seems fair: the training dynamics first learns the dominant modes (which should correspond to low-frequency components) and then learns non-dominant modes (which should correspond to high-frequency components). This is in line with empirical observations as well.

Weaknesses

  1. The paper seems weak empirically/experiment-wise. Although the paper provides a 'theoretical framework', the claims need to be verified on real world datasets.
  2. The authors should show the result of sampling after training the model to verify the correctness of the method. E.g., a visual example of samples obtained using overfitted model against samples obtained using regularized model.
  3. Identification of few key quantities like - λopt\lambda_{opt}, toptt_{opt} are not feasible since the true model parameters are unknown in practice.
  4. The scope of the proposed analyses is limited to GEBM. However, somehow the observations are consistent with non-GEBMs like BM and datasets like CIFAR10.

其他意见或建议

  1. Line 40, second col: generative -> generative modelling
  2. Line 316, second col: circunvent -> circumvent
  3. Line 373, second col: Ref
  4. Line 396, second col: exit -> exist
作者回复

Claims and evidence: We thank the Reviewer for the suggestion, we will put in the appendix additional plots highlighting the behavior in mm of the down-sampled eigenvalues for both the GEBM and the BM.

Methods and evaluation criteria: The eigenvalue spectra of the covariance matrices of real datasets as MNIST/CIFAR10 is depicted to guide us in a specific choice of a functional form of a a continuous spectral density used to derive the asymptotic results through RMT. The analysis on the GEBM can however be done using directly the eigenvalue spectrum of a real dataset. The only critical thing that should be considered is the addition of a threshold in the eigenvalues to avoid the existence of very small ones that would make the convergence extremely slow, and that have no effect on the timescales where the early stop point occurs. We will add some results on this regard in the appendix (see also last reply to Reviewer SJPf).

Experimental Designs Or Analyses: We would like to stress that the experiments on real datasets are not very meaningful here, as the only thing that matters is their covariance matrix, i.e. the highest-order sufficient statistics of both the GEBM and BM. Our setting is not meant to reproduce with any level of quality samples from real datasets (e.g. MNIST/CIFAR10 images). In other terms, the GEBM is not used here to learn a "good" EBM to generate good-looking samples: that would be basically impossible because the GEBM only encodes for the first two moment statistics of the data with a single multivariate Gaussian, so that any sample generated according to the learnt GEBM would be far from a realistic image in the datasets cited above. The same holds for the BM: the binary BM does perform well as generative model, in the sense that it generates good Ising model samples, but this model, without hidden nodes, is known to perform very badly with images even at the level of MNIST dataset, as discussed e.g. in [Decelle et al, SciPost Physics, 2024], because high-order interactions are important. The reason why we mention real datasets in the paper is to justify our choice of the synthetic spectrum but this procedure only looks at the eigenvalue spectrum of the real datasets' covariance matrix. Still, as mentioned in the reply to Reviewer MN9R, even such a reduced information about the spectral structure of real datasets is important to determine generalization properties even in deep networks (Yang, Mao, Chaudhari, ICML 2022).

Strengths and weaknesses In the manuscript, we discuss both GEBMs and BMs. For the latter, the analytical treatment is restricted to the high-temperature phase and remains approximate; nonetheless, we demonstrate that it successfully captures the qualitative behavior observed in real experiments. In particular, we show that early stopping effects in BMs are analogous to those identified in GEBMs, and that they can be mitigated by applying the same regularization recipe introduced in the GEBM setting.

The models considered in this work serve as a controlled playground to analyze overfitting in a highly tractable setting. As discussed in our general response, we believe that the insights obtained here can be extended to more complex and realistic setups; however, such generalizations require dedicated studies, which we plan to pursue in future work. Our current goal is to establish a solid theoretical foundation upon which these future developments can be built.

About point 2: as explained before it is not meaningful here to compare single generated samples as it would be with datasets of images, but we will compare e.g. the generation error (i.e. the error between the covariances of generated samples and the population one) or other distance measures sugggested by Reviewer MN9R in the regularized and the un-regularized case. As stressed in the paper, it is true that in the GEBM the most dominant contribution to the error in the coupling matrix comes from the weak PCA directions, which are the least dominant in measures of generation quality, so we do not expect an improvement as net as in Fig.5(b).

Questions For Authors This is correct: from Eq.1 the population covariance matrix is defined as the inverse of the GEBM's coupling matrix: that would also correspond to the empirical covariance matrix obtained computed with an infinite set of independent samples from Eq.1.

审稿意见
4

This paper proposes a theoretical framework for analyzing overfitting in energy-based models (EBMs). The framework is built upon two special cases of EBMs (namely, Gaussian EBM and Boltzmann Machine for inverse Ising model), which admit analytical (or partially/asymptotically analytical) solutions for the learning dynamics and stationary points (optimal solutions). This framework is later used to analyze the driving factors behind overfitting, suggest new overfitting mitigation strategies and reinterpret existing approaches.

It is claimed that the main reason behind the overfitting in the EGMs considered is the interplay between initialization and different learning timescales associated with eigendecomposition of the weight matrix. It is also claimed that the analysis provided is relevant to more complex models.

update after rebuttal

During the rebuttal, most of my questions have been addressed. The remaining are:

  • Minor experimental analysis concern regarding the confidence intervals.
  • Ablation on spectra lacks details (e.g., the expression used to define the spectra for Figure R2).
  • I am still not convinced that one should mimic the spectral properties of the empirical datasets considered in the paper, because this data is of non-linear structure, and covariance matrix is a poor statistic to describe the properties of such datasets.

Although complex NN-based models are not directly targeted, the analysis is thorough and rigorous, which is a decent start. I have increased my score to 4. For more information, please refer to my last rebuttal reply.

给作者的问题

  1. How the framework proposed can be extended to complex EBMs?
  2. Have you tried other spectral densities instead of (10)?

论据与证据

Overall, I find the claims related to GEBMs and BMs for the inverse Ising model to be well-supported by the theoretical and empirical evidence. The paper provided a comprehensive analysis of these models and corresponding overfitting mitigation strategies.

The only major claim I see problematic is the connection to non-toy-ish EBMs. Although the manuscript can provide some intuition regarding the general EBM case, the whole analysis revolves around spectral properties of the coupling/correlation matrices. The same goes for the protocols to mitigate overfitting. It is unclear how to extend these excellent results to a general, non-linear case.

方法与评估标准

I am mostly satisfied with the proposed method and evaluation criteria. However, there are several concerns which I would like to raise.

  1. On lines 262-263 col.2 and in Figure 3, EC=C^CF\mathcal{E}_C = \Vert \hat{C} - C\Vert_F is referred to as "generation quality". However, to my knowledge, this quantity is not connected to any of the widely accepted divergencies used to assess the generation quality: neither ff-divergencies [3a], nor Wasserstein distances [3b]. Perhaps, the authors should refer to EC\mathcal{E}_C as just "covariance matrix reconstruction error". I also suggest adding the closed-form expressions from [3a,3b] to track how well the learned JJ reproduces the original distribution. This will enable overfitting analysis from the distribution matching perspective, which is more interesting in the context of generative learning. These additional metrics might also provide new information, thus complementing the analysis based on log-likelihood, energies and matrix errors.
  2. I feel uncertain about using MNIST, CIFAR10, HGD and Ising 2D datasets for correlation matrix spectral analysis. These datasets are of highly non-linear structure, which makes me question the relevance of the spectra reported in Figures 1,7 for the research provided. It is unclear why one should try to reproduce the spectral properties of covariance matrices of datasets which are far from being Gaussian.
  3. Still, taking Figures 1,7 into consideration, it is unclear why it was decided to introduce nonsmoothness at x=1x=1 in (10). From FIgures 1,7 it is clear that real spectrum is smooth. Perhaps, the authors should provide additional experiments showing that the results (at least, the key results) are robust to the choice of the spectral density.

[3a] Frank Nielsen, Kazuki Okamura. "On the ff-divergences between densities of a multivariate location or scale family". arXiv:2204.10952

[3b] Salmona et al. "Gromov-Wasserstein Distances between Gaussian Distributions". Journal of Applied Probability, 2022, 59 (4). hal-03197398v2

理论论述

I satisfied with the theoretical part of the work. I find the corresponding theoretical claims convincing and backed up with not only proofs, but also the experimental results. Below I provide my (mostly, minor) concerns regarding this part.

I can not follow the derivation on lines 93-109 col.2. Particularly, I am unable to reproduce the gradient provided in (3). My own calculations yield [CijM+(J1)ij][-C^M_{ij} + (J^{-1})_{ij}]. Additionally, if we assume CM=0C^M=0 (which, of course, does not correspond to any real case, but can be used for the sake of analysis), we should get L/Jij=logdetJ/Jij=(J1)ij\partial L / \partial J{i j} = \partial \log\det J / \partial J{ij} = (J^{-1}){ij}, which does not hold when using (3). Finally, in (4) the Λ\Lambda matrix seems to disappear. Please, clarify.

实验设计与分析

I have minor concerns regarding the experimental setups and analyses.

  1. Perhaps, using several seeds and reporting mean values and error bounds in Figures 1,7 will be more convincing to the latter analysis (it is hard to verify the under-/overestimation observations judging by several samples only).
  2. From the text, I understand that Figure 2 is provided to highlight the difference of Jα(t)J_\alpha(t) dynamics when (a) ground-truth or (b) empirical covariance matrix is used. However, it is therefore unclear why different learning rate γ\gamma was used.

补充材料

I have superficially read the Appendix. The authors do not provide other supplementary material.

与现有文献的关系

I can not pinpoint any specific prior work that targets the problem of overfitting in EBMs.

遗漏的重要参考文献

I can not recall any specific work which is essential for understanding the context of the submission and is not cited.

其他优缺点

The paper is comprehensive, well-written and was enjoyable to read.

其他意见或建议

I understand that the topic is deeply rooted in statistical mechanics. I also acknowledge the freedom of choosing the notation which is more convenient for the authors. Still, I find the latter to be a little confusing.

  1. From my experience, in papers on machine learning or statistics, \langle \dots \rangle is rarely used to denote expectation; rather, it can be used for empirical averaging, whereas the expectation is denoted by E\mathbb{E}.
  2. Additionally, denoting the ground-truth covariance matrix by C^\hat{C} and the empirical estimate by CMC^M is also very unusual. It is usually expected that \hat is used for estimates, and no accent marks are used for the ground-truth values.

Typos:

  1. Lines 57, 60 col.1: inconsistent usage of spaces before and after "—".
  2. Similar for lines 636-638.
  3. Lines 74-75 col.2: "-" is used instead of "—".
  4. Line 101 col.2: "the gradient of (2)" - perhaps, "the gradient in (2)" (as it is the gradient of L\mathcal{L}) was intended.
作者回复

Methods and evaluation

  1. It is true that the choice of the generation quality measure with the Frobenius norm does not reflect a measure of distance between two probability distributions in general, but it is inspired by real experiments where actually such a metric is widely used, for instance in inverse Ising/Potts approaches in order to compare e.g. correlations matrices between the original dataset and generated configurations. In the GEBM it is also true that the metrics suggested by the Reviewer is a better measure of discrepancy between distributions. We have computed in particular the Wasserstein distance between the true GEBM and the inferred one along the training dynamics: also this quantity turns out to display a non-monotonic behavior w.r.t. the training time, and moreover the optimal time computed using this metric seems to be the closest one to the optimal time obtained by maximizing the test-LL. We will discuss these other measures in the final version of the manuscript.
  2. Choice of the spectrum. For the GEBM analysis, we do not employ real datasets directly. Instead, we construct synthetic covariance matrices designed to mimic the spectral properties typically observed in empirical data, with the aim of explaining early stopping effects reported in real experiments. Specifically, we chose a synthetic spectrum with two distinct branches to visually and analytically distinguish between dominant (strong) and subdominant (weak) modes in the population covariance matrix. We would like to emphasize that this two regimes spectral structure is supported by previous works identifying similar two-regime behavior in real datasets (see, e.g., [Yang, R. et al., ICML 2022)]. In the final version of the manuscript we will display the real and synthetic eigenvalue spectra in log-log scale where this common trait can be more easily visualized. That said, the theoretical framework we develop does not rely on the specific details of the eigenvalue spectrum, provided the population covariance matrix is non-degenerate. We tested a variety of synthetic spectra and observed no qualitative differences in the results. For example, modifying the parameters in Eq.~(10) leads to the same learning dynamics as discussed in the manuscript. The choice of synthetic spectrum is guided by numerical constraints: the RMT equations require discretization of the spectral density, and very small eigenvalues significantly increase computational cost due to the need for extremely fine resolution in the integration. We understand the concern and will include a dedicated appendix in the revised version of the manuscript to explicitly illustrate the robustness of our results with respect to changes in the eigenvalue spectrum.

Theoretical claims The presence/absence of the term Λij\Lambda_{ij} arises whether one assumes that the perturbation of the log-likelihood w.r.t. to one interaction JijJ_{ij} to be symmetric or not. Of course this holds only when iji\neq j, that is why the term Λij\Lambda_{ij} appears. It is actually true that in the following steps of the derivation we did not use this additional term when projecting the gradient, nor in the numerical gradient ascent procedure: this is basically equivalent to assume that perturbations instead are non-symmetric ( a more detailed discussion about the two different ways of taking likelihood's derivatives is given e.g. in [Magnus, J. R., Neudecker, H. (1999)]). A practical solution to this problem is to absorb it in the learning rate, meaning that the diagonal terms JijJ_{ij} will evolve in time with a learning rate doubled w.r.t. to the off-diagonal terms to compensate that factor.

Experimental Designs Or Analyses

  1. About the mean values on Figures 1,7: both figures show the mean eigenvalue spectra of the empirical covariance matrix of a dataset at a given number of samples MM: here the average is intended w.r.t. a certain number nn of different downsampling of MM samples from the original full dataset with MM^* samples. We have chosen a large enough nn (specifically n=1000n=1000) such that the standard error of any mean eigenvalue is not appreciable in the plot's scale. Having said this, we will add errorbars or shadows to the figure to highlight the standard deviation between realizations, reduce nn in order to have disjoint subset of samples to compute the covariance matrix.
  2. This is correct. In the left panel, we wish to compare the theoretical behavior (continuous time limit) with the one obtain by maximizing the likelihood using GD. The continuous limit is correct for very small learning rate, which is why γ\gamma is very small. On the right panel, we only consider the resolution of the evolution equations, hence the learning rate didn't matter here as it only renormalize the time.

Relevance for more complex non-linear models See the common answer written in the rebuttal of rev. SJPf.

审稿人评论

I would like to thank the authors for their thorough response.

I would appreciate it if you could upload the plots and other materials referenced in your reply using an anonymous service (e.g., 4open.science), so that I and the other reviewers can get acquainted with the additional results. After that, I will be able to fully assess the response.

Updated response after the additional materials have been provided

Thank you for providing supplementary results and clarifications. The work is now more appealing to me. Although complex NN-based models are not targeted, the analysis is thorough and rigorous, which is a decent start.

Below I also list my (minor) remaining concerns, which I hope to be addressed in the next revision:

  • It seems that the confidence intervals are plotted for mean values, which explains their size. However, perhaps, standard deviation for the eigenvalues themselves should be plotted (not for the mean value) to better illustrate the variance of individual observations.
  • Please, provide more details for the ablation on spectra (e.g., the expression used to define the spectra for Figure R2).
  • I am still not convinced that one should mimic the spectral properties of the empirical datasets considered in the paper, because this data is of non-linear structure, and covariance matrix is a poor statistic to describe the properties of such datasets. Perhaps, more effort should be put into finding real data admitting linear or close-to-linear structure?

I will increase my score.

作者评论

We have have created an anonymous repository that includes a .pdf file with the requested plots:

https://anonymous.4open.science/r/ICML_reply-6020/ReplyMN9R.pdf

Additional reply

In the following link we have added another .pdf file with additional details about first answer to "1. Common reply about model relevance and applicability to more complex EBMs":

https://anonymous.4open.science/r/ICML_reply-CB48/

审稿意见
4

The authors present an analysis of training dynamics and overfitting in different settings (infinite data, limited data, continuous domain, binary domain) for a specific class of EBMs. The basic idea is to project the training dynamics to the principal components of the coupling matrix, which allows (in the class of models the authors study) for strong analytic claims.The authors also study common methods for mitigating overfitting (like regularization) in their framework.

UPDATE AFTER REBUTTAL:

The paper is limited in that they are studying GEBM, which limits its immediate applicability, but the work is nonetheless interesting, novel and thoroughly conducted. The authors also addressed my concerns. I therefore change my rating to 'accept'.

给作者的问题

Could you argue that this type of analysis could be extended to more realistic settings? For now my rating is "Weak Accept" but I am happy to increase the rating if you can argue for this.

论据与证据

The claims are very well supported: The authors base them on theoretical analysis and show excellent agreement with experiment. The projection of the training dynamics into eigenspace is well done and interesting.

方法与评估标准

The methods and evaluation criteria show quite exactly what the authors want to show. However, in order for the paper to be relevant for the wider field the evaluations are not sufficient: The experiments are small scale and the data very far from what ML methods are typically applied to. It is not clear how this analysis would work in more realistic scenario.

理论论述

I did not do the math but it seems to be relatively straight-forward and the results are sensible.

实验设计与分析

I am quite confident that the experiments are valid. They coincide with the theoretical predictions and the setting is relatively small-scale and controlled.

补充材料

No.

与现有文献的关系

The relate to random matrix theory and inverse Ising problems. The paper is well-embedded in this field.

遗漏的重要参考文献

Not that I know of, but some of the topics touched have a vast literature spanning several decades, so I am not entirely sure.

其他优缺点

Strengths:

  1. Setup & Execution: The paper is beautifully written and the analysis is clean and done well, the visualization and experiments are convincing
  2. Novel results: The results are non-trivial and the agreement with experiment is very good
  3. Interesting to some groups: The paper is very interesting to anyone working in the field of statistical methods applied to ML (and related fields like inverse Ising/Potts etc.). However (read below under "weaknesses"), these groups do not correspond to the core audience of ICML.

Weakness:

  1. Model relevance: The type of models analyzed are toy models at best. I am not aware of anyone using these kind of models in moden ML (please correct me if I'm wrong). The most relevant case I know of where this analysis might apply are inverse methods applied to protein sequences, but even that seems to have been largely abandoned. The applied papers that the authors cite are either old or quite niche.
  2. Data relevance: The authors use either synthetic data or MNIST.

I think the paper is worth publishing if the authors argue that their analysis could be extended to a more realistic setting. I am quite confident that if pursued consistently this type of analysis could be valuable for the wider community.

其他意见或建议

Very minor: It's "Frobenius", without the umlaut.

作者回复

1. Common reply about model relevance and applicability to more complex EBMs

We thank all the reviewers for their comments.

All Reviewers have raised important concerns regarding the generality of the results presented in the paper, particularly their applicability to more complex or realistic generative models. While we fully acknowledge that our analysis is far from the architectures typically employed in contemporary machine learning, we would like to emphasize that, to the best of our knowledge, this is the first work to tackle the problem of overfitting in energy based generative models. To start with something concrete and tractable we consider here Gaussian multivariate models which might indeed look over-simplistic and irrelevant regarding modern ML. The logic here is a bit similar as the one considering linear regression for understanding deep learning [Belkin2018,Hastie2022]. Actually we found a concrete parallel to that and plan to add a new section in the main-text and supplementary material discussing it in detail. The argument goes as follows: if we consider score-matching [Hyvarinen2005] as a proxy for studying theoretically the learning of EBMs, we arrive at a description of the learning dynamics in terms of a neural tangent kernel dynamics [Jacot2018] of the score function (ψ(x,θ)=xE(x,θ)\psi(x,\theta) =-\nabla_x E(x,\theta), i.e. the gradient of the energy function of the EBM) of the form

dψ(xθt)dt=E^x[Kt(x,x)ψ(xθt)]+ϕ^t(x)\frac{d\psi(x\vert\theta_t)}{dt} = -\hat{\mathbb E}_{x'}\Bigl[K_t(x,x')\psi(x'\vert\theta_t)\Bigr] + \hat \phi_t(x)

where the kernel KK and the source ϕ^\hat\phi is build on a tangent space corresponding to the derivative of the score function w.r.t the parameters. Considering then the dynamics corresponding to a lazy training regime [Chizat2019] where the NTK (and also ϕ^\hat\phi) can be assumed deterministic we end up with a similar dynamics as for Gaussian EBM, driven now by the empirical covariance matrix of score features instead of the input features. The empirical covariance matrix on the input feature is now replaced by a covariance of tangent score functions, which are as well expected to be in the random matrix regime and overfitting will as well occur when weak modes of this covariance starts to be learned. So in the end we have potentially a much broader theory, which encompasses exponential models and more generally the EBM in the kernel regime.

[Hastie2022] Hastie, Montanari, Rosset, Tibshirani Ann. of Stat. (2022)

[Hyvarinen2005] Hyvärinen, Dayan, JMLR (2005)

[Jacot2018] Jacot, Gabriel, Hongler, NeurIPS (2018)

[Shizat2019] Chizat, Oyallon, Bach, Neurips(2019)

2. Specific answer to the reviewer:

Methods and evaluation criteria: The experiments we have conducted on the GEBM, although performed at relatively small system sizes — with most results in the main text corresponding to N=100N = 100 — exhibit excellent agreement with the asymptotic predictions derived from Random Matrix Theory (RMT). As shown in Fig.~4 (notably panels a and c), the empirical results obtained at N=100N = 100 already display perfect overlap with the theoretical curves expected in the limit N,MN, M \to \infty.

Regarding the Boltzmann Machine (BM), it is true that the system size employed (N=64N = 64 spins) is smaller than what is typically used in standard machine learning applications. Nevertheless, the qualitative behavior we observe is robust with respect to system size. These BM experiments are intended to demonstrate that the key phenomenology observed in the GEBM carries over to the BM case. In particular, we highlight the emergence of negative eigenvalues in the coupling matrix and the role of finite-sampling noise, which primarily affects the learning dynamics associated with the smallest principal components of the data. To strengthen the robustness of these findings, we plan to perform additional numerical experiments at larger system sizes for the final version of the manuscript.

Using real datasets -- Our results do not depend on the specific dataset used, as they are based on a given arbitrary covariance matrix, which can be derived from any dataset. The use of synthetic data is motivated by the need to control the asymptotic limit required for the Random Matrix Theory (RMT) analysis, and to ensure that the eigenvalues are not too small at this limit — a condition that would otherwise significantly increase the computational cost, since small eigenvalues necessitate extremely fine discretizations for solving the numerical RMT equations. Nevertheless, we can always consider a cutoff which should only have effects far away from the early stopping point. In the final version of the paper, we will include an additional section in the Appendix where we reproduce the results using different eigenvalue spectra taken from real datasets and different parameters for the synthetic model, explicitly showing that the qualitative behavior remains unchanged.

最终决定

This paper presents a theoretical framework for analyzing overfitting in energy-based models (EBMs). The framework is grounded in two specific cases—Gaussian EBMs and Boltzmann Machines applied to the inverse Ising model—which allow for analytical solutions to the learning dynamics and stationary points. This foundation is then used to investigate the underlying causes of overfitting, propose new mitigation strategies, and offer reinterpretations of existing approaches.

The paper was reviewed by four reviewers. After the rebuttal, it continues to receive mixed ratings. The primary point of disagreement centers on whether the paper retains value despite its limited generalizability and applicability.

Reviewer BiKA (Weak Reject) expressed concern that the analysis is limited to Gaussian EBMs, where explicit training dynamics can be analyzed. The reviewer notes that extending this to more general cases—characterized by non-convex optimization dynamics—is significantly more complex, which may limit the broader applicability of the findings. Reviewer JaLg (Weak Reject) acknowledged that the paper is among the first to explore overfitting in EBMs, but noted that the analysis is confined to GEBMs, further limiting its scope.

In contrast, the other two reviewers were more favorable. Reviewer MN9R (Accept) felt that most concerns were addressed during the rebuttal and described the paper as comprehensive, well-written, and enjoyable to read. Reviewer SJPf (Accept) agreed that the focus on GEBMs limits the immediate applicability of the results, but found the work to be novel, interesting, and thoroughly conducted. The reviewer also confirmed that the concerns were adequately addressed in the rebuttal.

The AC facilitated a discussion among the reviewers. Overall, while there is a shared concern about the limited applicability of the work—particularly that the EBMs studied are somewhat “toy” models and not well-suited for modeling complex, real-world data—there is also recognition of the paper’s strengths. From both theoretical and empirical perspectives, the submission is solid.

Reviewers MN9R and SJPf actively championed the paper during the discussion. Although the scope is limited to specific instances of EBMs, the AC believes the theoretical insights offered by the authors represent a meaningful contribution. Such theoretical advancements are as valuable as empirical breakthroughs. The AC also finds that the rebuttal adequately addressed the technical concerns raised, and that the paper is well-written, clearly organized, and makes a worthwhile contribution to the field, thus recommending an acceptance.

The AC encourages the authors to revise the paper by incorporating the reviewers' suggestions to further strengthen the work.