6.3

/10

Poster4 位审稿人

最低6最高7标准差0.4

3.3

置信度

正确性3.0

贡献度2.8

表达2.3

NeurIPS 2024

Cascade of phase transitions in the training of energy-based models

Dimitrios Bachtis,Giulio Biroli,Aurélien Decelle,Beatriz Seoane

OpenReview PDF

提交: 2024-05-13更新: 2025-01-10

TL;DR

We show theoretically and numerically that the training of Energy-based models undergoes several phase transitions.

摘要

关键词

Restricted Boltzmann MachineGenerative modelPhase transitionstatistical physicsEnergy-based model

评审与讨论

审稿意见

评分: 6置信度: 42024-07-10

This paper analytically demonstrates a cascade of second-order phase transitions in the training of a simple RBM model and numerically validates this theory in the training on real datasets.

优点

This paper theoretically analyzes the learning dynamics of weight parameters in the Binary (or Bernoulli)-Gaussian RBM (BG-RBM) through the lens of statistical physics.
The authors demonstrate that the RBM undergoes a series of second-order phase transitions during learning, both theoretically in the BG-RBM and numerically in the Binary-Bernoulli RBM (BB-RBM).

缺点

There is a gap between the theoretical and numerical analysis in this paper. Specifically, the BG-RBM with a single hidden node is utilized in the theoretical analysis in Section 4, while the BB-RBM is used in the numerical analysis in Section 5.
Although the main results of this paper may merit acceptance into NeurIPS, I believe the authors should revise the manuscript to improve readability and clarity. Some suggestions are provided in the Questions section below. If the authors address these issues and answer my questions, I can strongly recommend accepting this paper for NeurIPS; however, I cannot recommend it for acceptance in its current form.

问题

Major

While the authors consider a single Gaussian hidden node, they do not provide any statistical details of the Gaussian distribution, such as the mean and variance. I assume the authors consider a Gaussian distribution with zero mean and $1/N$ variance, but I am uncertain about the meaning of $N$ (I guess $N \equiv N_v$ ), why the authors choose this setup, and how $N$ relates to the statistics of $xi_i$ .
The meaning of the equation between lines 161 and 162 is unclear, especially when the authors mention that the equation “is analogous to the one studied in the previous section.”
While the authors analytically investigate the BG-RBM with a single hidden node in Section 4, the numerical analysis in Section 5 is conducted on the BB-RBM with multiple hidden nodes. The authors should discuss the expected differences due to this gap and adequately clarify the limitations of their analysis.
I am not sure why “the projections along all subsequent directions are Gaussian” as stated in line 231 on page 6.
It is unclear how the authors obtained Figure 4 E and what they aim to demonstrate. Does the $h$ in Fig. 4 E indicate the hidden variable?
The direct relationship between Section 4 and Section 5 is hard to capture. As I understand it, the main part of this paper introduces an underlying preferred direction $\xi_i$ , but I am not sure which vectors correspond to $\xi_i$ in Sec. 5. Please describe the correspondence between Section 4 and Section 5 and clarify the differences from the results in Ref. [8].

Minor

While the authors mention the Mattis model to describe $\xi$ , the explanation of $\xi$ is insufficient. It is unclear whether $\xi$ is frozen, the distribution of $\xi$ , etc.
In Fig. 1, the x-labels of the left and right panels differ. The term "susceptibility" is not defined, each line is indistinguishable, and the meanings of the left and right vertical axes are not indicated in the inset of the left panel. There is no indication of the black solid lines in the insets of the right panel.
In lines 127 and 439, the condition for diverging the correlation seems to be a typo. $N^{-1}\sum_{i} w_i^2 \approx 1$ may be correct.
Please introduce the learning rate $\epsilon$ in Eq. (3) because the authors provide results with respect to the learning rate in Fig. 1.
Please provide references and detailed descriptions for the BG-RBM primarily considered in Section 4.
The authors should correct numerous typos in the main text and the appendix. For instance, the authors often use expressions such as $s_i$ and $N$ , which are not defined in this paper. In line 107 on page 3 and line 416 on page 12, there are typos in the definition of magnetization $m$ . In line 414 on page 12, there is a typo in the definition of $\xi_i$ . In Eq. (3), the authors do not clarify what $t$ indicates.
The authors should consistently use expressions throughout the paper. In lines 136 and 194, the word "appendix" is used, but in other cases, the term "SI" is used. The title of Appendix B is "Binary-Binary RBM," but the model is often referred to as "Bernoulli-Bernoulli RBM" in the paper.

局限性

The authors should discuss the limitations of their work more thoroughly.

评论- Detailed answers to the questions (part 1)

2024-08-05

Detailed anwers to the questions:

The referee is right, we should have been more careful when defining the quantity. The distribution of the hidden nodes is indeed of zero mean and variance of order $1/N_v$ . The value is indeed $N=N_v$ . This is the correct scaling to have a large dimensional (large $N_v$ ) limit.
Let us clarify better this point. In a general formulation, we can consider that the probability distribution of the hidden nodes follows a Gaussian distribution with a variance $\sigma_h^2$ . In this case, the effective model on the visible nodes is given by the following Hamiltonian $H=-\sigma_h^2 ( \sum_i s_i w_i )^2$ , where we assume that the weights are of order $\sim O(1)$ . This model can be analyzed as a function of \sigma_h. In practice, if $\sigma_h^2$ scales as $N^\alpha$ , where $\alpha<-1$ , the system is dominated by its entropy and remains in the paramagnetic phase where the average of each variable $\langle s_i \rangle = 0$ since the interaction is too weak. If \alpha>-1 the interaction is very strong and the system is frozen in configurations where $\langle w_i s_i \rangle = \pm 1$ . Therefore the correct scaling is when the weights are of order $\sim O(1)$ and the variance of the hidden $\sim O(1/N_v)$ in order to observe a transition (as the magnitude of the weights change) of a disordered phase without particular orientation, and an ordered phase where the variables are parallel to $w_i$ . For practical experiments, it is not necessary to impose the scaling since the learning process bring the system into the interesting regime where information is stored in the model parameters, as illustrated by our experiments.
The sentence at lines 161-162, refers to the type of phase transition occurring in the system. In the case with multiple features, the first phase transition is very similar to the one with one feature, in the sense that it is of the same “nature”. In this regime, corresponding to the first phase transition, everything happens as if there were just one main direction, which is $\xi^1 + \xi^2$ . We will correct this sentence to make it clearer in the final version.
The theory is written for the binary-Gaussian RBM in the main text and for the binary-binary RBM in the Appendix. In the latter case, we show that in order to observe an interesting behavior we need to have an extensive (large) number of hidden nodes. On this aspect the gap is not strong. The main difference between theoretical and experimental parts relies on the dataset which is much simpler in the former case. Based on the article [8], we expect that even if the dataset present various modes (small compared to the number of visible/hidden nods), the system can still be described in a mean-field setting and, consequently, the phenomenology observed for one or two modes can be extended to many modes (but only a small number). This is confirmed by our experiments. Now we do not claim that this description is valid all along the training process, and we can see that at large training time the mean-field theory description is not as accurate anymore. We will comment on this in the final version.
This sentence might be indeed unclear and thus will be rewritten in the final version. By this sentence we intended to clarify that as the system is learning the first direction to split the data, the dataset projected on the remaining direction is mostly Gaussian distributed. The reason is that in the other direction, the projection is essentially given by the sum of random i.i.d. weights (non sparse) multiplied by the spins $s_i$ . Furthermore, we can provide in the Appendix, if the reviewer thinks it would be useful, a figure showing that the projections on these directions corresponds to random noise as long as the other directions are not learned.

评论- Detailed answers to the questions (part 2)

2024-08-05

We thank the referee for pointing out that we should clarify further the panel E of Fig. 4. The aim of this figure was to show the “hysteresis” phenomenon which is a clear signature, devised and observed in statistical physics, to unveil that a high-dimensional probability measure had a phase transition at which it split in two different lumps. The procedure consists in tilting the probability measure by introducing in the energy function a contribution that favors one lump over the other. If the measure is indeed concentrated on two distinct lumps, one induces a sudden transition (“first-order transition” in physics). In our case, the lumps are associated to the learned patterns and this extra contribution consists in the scalar product between the visible variables and the learned patterns times $h$ (the field controlling the strength of the tilting). In the presence of a first-order phase transition, one generically finds the phenomenon of hysteresis, i.e. the transition from one lump to the other can be retarded because of metastability, thus leading to the characteristic hysteresis loops that we indeed show in Fig.4 E (see e.g. den Hollander, Frank. "Metastability under stochastic dynamics." Stochastic Processes and their Applications 114.1 (2004): 1-26; Bovier, Anton. "Metastability." Methods of contemporary mathematical statistical physics 1970 (1970): 177-221 for rigorous treatment and Chaikin, Paul M., Tom C. Lubensky, and Thomas A. Witten. Principles of condensed matter physics. Vol. 10. Cambridge: Cambridge university press, 1995 for a physics treatment). This figure therefore gives direct evidence of the decomposition of the measure in distinct lumps corresponding to the learned patterns, and that this decomposition takes place at the second-order phase transition happening during learning. We will explain better these points in the revised version.
To start with ref[8], this work presents a mean-field theory describing theoretically the static behavior of RBMs assuming a proper analytical form for the weight matrix. In this work, the phase transition for such a setting is derived but only the static behavior is analyzed theoretically. Our present work instead characterizes the dynamical behavior of the RBM. It shows at which rate the weight matrix is shaped by the dynamics, it describes the different mechanisms (evolution of the weights into a preferred direction, by which time evolution, etc.) in details, and we further show that this is what happen when performing a training in this regime. Also, we show how the weights evolve in the case where the clusters of the dataset are correlated. Concerning the relationship between sec 4 and 5, the reviewer’s comments helped us to understand that we have to clarify the relationship between the SVD decomposition used in Sec. 5 with the projections used in Sec. 4. There is a direct connection between the PCA and the preferred directions ( $\xi^1 + \xi^2$ and $\xi^1 -\xi^2$ ) used in the theoretical analysis, but it was not clear enough in the current version. You could see it from the appendix, but the point was not made explicitly anywhere. We will discuss this and make it explicit in the revised version. In fact, by looking in the appendix at the eq. after line 491, we decomposed in our model the correlation matrix on the vectors $\eta$ and $\bar{\eta}$ : this is the SVD decomposition for this very simple model. It is therefore clear that since the model exhibits a growth first towards \eta, the first eigenvectors of $\langle s_i s_j \rangle$ and then in the direction of $\bar{\eta}$ , the second direction, we are indeed proving that the dynamics is driven by the PCA of the dataset. We think that the part on the divergence of susceptibility is the clearest link between the two sections. Section 4 shows that the analytical behavior is confirmed in numerical experiments with real data sets: first, the relation between the PCA and the SVD of the RBM weight matrix, second, the phase transition: how the susceptibility diverges (as the system size increases and with a known exponent) as well as the divergence of the mixing time.
We will make clearer in the final version the link between the two parts.

作者回复

2024-08-05

We thank the reviewer for reading our paper and for their comments. We think that they are instrumental in improving the quality of our work. We think that the gap between the theoretical part and the experiments highlighted by the reviewer is partly due to a gap in the presentation (which we will revise accordingly) and not in the analysis. Indeed, for simplicity, we have chosen to present the binary-Gaussian RBM in the main text, but we do analyze the binary-binary RBM in the Appendix. There is a clear relationship between the projections and projected weights in the theoretical section and the singular value decomposition described in the empirical evaluation. We think that this has not been clarified in the current version but it can be easily explained. We will discuss this and make it explicit in the revised version. Apart from that, the discrepancy between the theoretical and the experimental part is due to simplifications in the theoretical part, e.g. all hidden nodes have the same weights, which was done to keep the analysis tractable.

We answer the question into details in the official comments while commenting them very briefly:

Questions:

The referee is right, we should have been more careful when defining the quantity. The distribution of the hidden nodes is indeed of zero mean and variance of order $1/N_v$ . The value is indeed $N=N_v$ . This is the correct scaling to have a large dimensional (large $N_v$ ) limit. We provide a detailed explanation in the official comment.
The sentence at lines 161-162, refers to the type of phase transition occurring in the system. In the case with multiple features, the first phase transition is very similar to the one with one feature, in the sense that it is of the same “nature”. In this regime, corresponding to the first phase transition, everything happens as if there were just one main direction, which is $\xi^1 + \xi^2$ . We will correct this sentence to make it clearer in the final version.
The theory is written for the binary-Gaussian RBM in the main text and for the binary-binary RBM in the Appendix. In the latter case, we show that in order to observe an interesting behavior we need to have an extensive (large) number of hidden nodes. On this aspect the gap is not strong. The main difference between theoretical and experimental parts relies on the dataset which is much simpler in the former case. Based on the article [8], we expect that even if the dataset present various modes (small compared to the number of visible/hidden nods), the system can still be described in a mean-field setting and, consequently, the phenomenology observed for one or two modes can be extended to many modes (but only a small number). This is confirmed by our experiments. Now we do not claim that this description is valid all along the training process, and we can see that at large training time the mean-field theory description is not as accurate anymore. We will comment on this in the final version.
This sentence might be indeed unclear and thus will be rewritten in the final version. More details are provided in the official comments.
We thank the referee for pointing out that we should clarify further the panel E of Fig. 4. The aim of this figure was to show the “hysteresis” phenomenon which is a clear signature, devised and observed in statistical physics, to unveil that a high-dimensional probability measure had a phase transition at which it split in two different lumps. A complete explanation is provided in the official comments.
A detailed comparison between ref [8] and our work is provided in the official comments and will be better discussed in our final work. Concerning the relationship between sec 4 and 5, the reviewer’s comments helped us to understand that we have to clarify the relationship between the SVD decomposition used in Sec. 5 with the projections used in Sec. 4. There is a direct connection between the PCA and the preferred directions ( $\xi^1 + \xi^2$ and $\xi^1 -\xi^2$ ) used in the theoretical analysis, but it was not clear enough in the current version. We put a detailed explanation again in the official comments and we will try to make it clearer in the final version of our work.

Minor:

In the Mattis model $\xi$ correspond to a preferred direction (pattern) of the model, which is a frozen variable. In the Mattis model, $\xi$ can be any fixed binary vector. We will add a section in the appendix to describe with more detail the statistics of the model, as well as the conditions for $\xi$ .
We thank the referee for pointing out these issues with Fig 1, we will correct them and define properly the quantities.
The referee is right, we will correct the typographical error.
The learning rate comes in front of the r.h.s. of eq. 3. We will add it in the final version. The learning rate naturally impacts the learning behavior, the crucial ingredient is to keep the ratio learning_rate/N small.
We plan, as mentioned before, to add first a section describing the Mattis model, and then another one more specific explaining the BG-RBM in a more general context.
We agree with the referee, we will carefully read the whole work to avoid confusion in the definitions and correct the typographical errors.
We thank the referee, we will check for this problem in the final version.

Limitations: We will add a new section Limitations, in which we will discuss the limitations of the current analysis: (i) the extent to which these results are applicable to more complicated models, (ii) the limitations of the analytical setup, and (iii) the limitations of the numerical setup (problems with insufficient MCMC sampling).

2024-08-11

Thank you for the authors' careful responses and explanations. After reading the responses, I have resolved most of my concerns regarding the paper. However, I remain uncertain about the readability of the manuscript, as I am unable to review the revised version, which is expected to differ significantly from the current draft. I understand that this limitation is due to the conference's policy, and while the authors may be disappointed, it makes it difficult for me to increase the presentation score. Nonetheless, I acknowledge the significance of this work and have updated my score accordingly (4 -> 6). I hope the authors will take care to clarify the notations and statements, and include the limitations of this work in the final version.

审稿意见

评分: 6置信度: 32024-07-12

This paper presents an analysis of the phase transitions of RBMs through an analysis of the weight matrix. They find the dynamics tend toward the center, then diversity into modes. The theoretical results are supported by empirics on 3 ML datasets.

优点

This expands the important field of exploring the dynamics of energy based models
The figures are generally well presented (conditioned on note below)
The authors offer a compelling argument for the scalability of these phase transitions, and the pipeline theory to CelebA is well done

缺点

This paper would be right at home in a physical review journal, but given the audience of NeurIPS, I would recommend increasing the accessibility to the very ML focused audience (e.g. matching the notation used in ML world, things like labeling axes with descriptive words then putting the equations in text, numbering all equations, etc.)
199 paragraph is too long

问题

How do you imagine these transitions changing with respect to sampling convergence? Perhaps out of scope of this work, but would be interesting to see all of these plots generate but now with x axis being sample steps and see how that looks
What would be the biggest takeaways/generalization to be made from RBM phase transitions to more complex ML models?

局限性

The authors sufficiently addressed societal limitations.

作者回复

2024-08-05

We thank the reviewer for reading our paper and for their comments. We do not entirely agree with the reviewer regarding the target audience of this article.

First, the RBM itself is a model introduced by Hinton and Sejnowski for the purposes of machine learning. This model has been used in ML for decades, it was used as a building block of deep neural networks for a while, and is still used for ML tasks today, notably for interpretability, and in particular in low-data regimes (e.g. in neuroscience, population genetics,...). . While it is true that our analysis relies on phenomena more commonly studied in physics, our work shows that understanding these phenomena is crucial to fully understand the training of RBM, and how the weights are shaped during training. To name just a few elements: the second-order phase transition implies a sudden increase in mixing time which directly affects the quality of the training of RBMs, knowing that RBMs learn the direction of PCA helps to monitor training and to understand the structure of the weights, and we show how the exponential growth of the eigenvalues of the weight matrix is determined by the variances of the clusters.

Questions:

It is not clear how the transitions would change in general. One should expect the first singular values to still correctly match the PCA components, because at the beginning of the learning process the mixing time is very small, which means that the estimation of the negative term by the CD-1 scheme (only one step initialized in the examples in the minibatch) is roughly sufficient. This also means that the first transition should be very similar to that of the PCD-100 studied in this article. It is not yet clear what should happen after the transitions start to follow each other, as the mixing times significantly exceed the 1 step used by the algorithm. To test this, we repeated the analysis for training with MNIST using the CD-1 algorithm. We show the equivalent of Fig.3 A,B,C in Fig.1 of the attached PDF file. We find that the image of the cascade of transitions remains unchanged, but the overall behavior is different because the susceptibility increases sharply with the growth of the corresponding eigenvalue, but then decreases again. In Fig. 2 in the attached PDF file, we compare the first 5 singular vectors with the first 5 PCA components of the data set, which show extremely similar patterns (within an overall sign). This picture is very similar to what we observe when analyzing the HGD dataset, where the mixing times become so large that even PCD-100 is no longer able to thermalize properly. We will include a subsection in the paper where we briefly discuss the impact of MCMC estimation of the gradient on the transitions.
Our analysis suggests that similar phenomena probably occur in more complex ML models. First, in Deep Boltzmann Machine, a previous work [Ichikawa, Y., & Hukushima, K. (2022). Statistical mechanical study of deep boltzmann machines given weight parameters after training by singular value decomposition. Journal of the Physical Society of Japan, 91(11), 114001] shows that the SVD of the weight matrix is crucially important as well. Second, in more general EBMs, it is also highly probable that such phase transitions occur and that the mixing time increases. Finally, the cascade of phase transitions that are observed here are very akin to the ones recently observed in the sample generation of diffusion models [23-25]. For the latter, it is conceivable that during the training dynamics the model of the score first learns the most prominent features, which needs less data (and hence less training time, at least in an online learning schedule) and then later the finer ones.

2024-08-08

I appreciate the authors response, and the changes will improve the quality of the paper, as such I have updated my score to reflect this (5 -> 6).

审稿意见

评分: 6置信度: 32024-07-13

This work examines the relation between the internal representations of an RBM and the data that it is trained on as the training procedure goes forward. The work begins with a theoretical exposition for Binary-Gaussian RBMs trained CW-like models with 1 or 2 preferred encoding patterns. The theoretical analysis shows that the weights develop along directions that aggregate the directions of data modes before further settling into directions that capture each of the data modes. The analysis is extended to Binary-Binary RBMs. Several empirical results are presented. For simple models, theoretical findings are numerically validated. RBMs are also learned for HGD, MNIST, and CelebA dataset. The behavior of these learning situations mirrors certain behaviors of the simple cases. The RBM first learns patterns along the largest principal components of the data before undergoing a phase transition and settling into more nuanced representations.

优点

The dynamics of generative model training and the relations between the data and internal representations of generative models is an important area of study. I appreciate that this work undertakes the challenging task of unraveling these behaviors.
This work offers interesting evidence that, across simple and more complex datasets, RBM learning has somewhat predictable behavior. In particular, RBM weights will first learn to represent the data directions with the highest variation before learning more refined representations that better capture modal structure.
Extensive derivations and empirical results are included.

缺点

A major weakness of this work is the scope. While broad claims are made about implications for other generative models, the analysis presented seems specific to the RBM architecture. Studying more recent generative models with stronger capacity would greatly increase the impact of this work, although such analysis is very difficult.
The work is quite dense, and it can often be difficult to follow the lines of reasoning and conclusions based on derivations. While the topic is complex and this is to some extent inevitable, I recommend the authors make efforts where possible to streamline and clarify the presentation.

问题

Are there conclusions from this work which can be extended to more powerful generative models such as diffusion models, deep EBMs, etc.?

局限性

Limitations are not discussed in detail.

作者回复

2024-08-05

We thank the reviewer for carefully reading the manuscript and for their interest in it.

We agree with the reviewer that the next natural update of this project will be to investigate these features in more complex generative models. Furthermore, we still believe that it is important to establish secure foundations to build on, and we feel that this work is already very dense and compact to include further analysis on a different type of models, which would require different techniques and settings.

Nevertheless, we believe that the applicability of these results goes beyond RBMs for several reasons:

It is very likely that the same kind of transitions are also present in the early stages of deep Boltzmann machine learning, since the same perturbative analysis of refs. [8,9] performed for RBMs was recently reproduced for deep Boltzmann machines in Ref. [Ichikawa, Y., & Hukushima, K. (2022). Statistical mechanical study of deep boltzmann machine given weight parameters after training by singular value decomposition. Journal of the Physical Society of Japan, 91(11), 114001.].
Cascades of phase transitions have also been observed in other generative models, e.g. in the training of the Gaussian mixture model [18-22] or very recently in the sample generation of diffusion models [23-25]. For the latter, it is conceivable that during the training dynamics the model of the score first learns the most prominent features, which needs less data (and hence less training time, at least in an online learning schedule) and then later the finer ones.

If we have the opportunity, we will discuss this point in a new "Limitations" section and try to make the paper more accessible by taking advantage of the additional page.

评论- Thanks for the response. I will keep my score.

2024-08-14

After reading the author responses to myself and other reviewers, I maintain my current score and raise my contribution score. My main recommendation for future versions of the paper is to make the text as clear and readable as possible.

审稿意见

评分: 7置信度: 32024-07-13

This paper analyses the phenomenon of phase transitions in the training dynamics of restricted Boltzmann machines (RBM) in theory and practice. In the theory part, the phase transition is characterised for a toy model: The RBM with a hidden node per feature is fitted to data characterised by one feature vector $\xi$ or two correlated feature vectors $\xi_1, \xi_2$ . The phase transitions are detected as diverging correlation functions between visible states of the RBM and via exponential fitting of the RBM weights to the target features.

In the empirical part, the phase transitions are explored in the training of a RBM on the Human Genome Dataset, MNIST, and CelebA. It is observed that the singular values of the weight matrix and the susceptibility undergo sudden changes during training, indicating a phase transition.

优点

The theoretical evaluation seems sound and in general the paper is well written.
The paper explores an interesting phenomenon in training dynamics of RBMs. The experimental evaluation of the phase transition is compelling and highlights the emergence of phase transitions well. Similar studies may be applicable to other types of generative models.

缺点

It seemed to me that the notation switches between $v$ and $s$ to denote visible states/spins/sites.
I think the paper could be improved by making a stronger connection between theory and empirical evaluation. For example, do the projections and projected weights in the theoretical section correspond exactly to the discussed singular value decomposition described in the empirical evaluation? Can the susceptibility be analysed in the theoretical section? More consistent nomenclature of the terms described would help with reading and understanding the paper.
The numerical analysis would benefit from paragraphs or subsections.
The analysed dynamics of the RBM weights don't fully reflect the actual training dynamics. In practice, RBM training requires self-sampling from the model which causes biases in the estimated negative term. I would expect that the sampler is adversely affected by the phase transitions in the model (e.g. sudden drop in mixing time). Considering that sampling in an RBM is relatively tractable (compared to other types of EBM) I am wondering if the training dynamics of e.g. one step contrastive divergence can be analysed in a similar fashion. However, the authors do address the effect of the phase transition on the MCMC sampling in the experimental section.

问题

Do the projections and projected weights in the theoretical section correspond exactly to the discussed singular value decomposition described in the empirical evaluation? Is a theoretical study of the susceptibility in the theoretical section feasible?
Can the phase transitions be detected during training to, for example, adjust for the longer thermalisation time in the vicinity of the phase transition?
When running a gradient based Langevin sampler during training, do your observations suggest spiking parameter updates in the vicinity of the phase transition due to second order discontinuities, thus creating potential training instabilities?

局限性

The work analyses the phenomenon of phase transitions in RBM training dynamics, and is limited to this setting by definition. The scope is clearly defined in the paper. However, the authors checked the limitations box in the check list without providing further justification in the main text or in the checklist. A brief sentence about the scope of the work or a reference to lines in the main text would be appropriate.

作者回复

2024-08-05

We thank the reviewer for their comments and for their careful reading of our paper. First of all, we will of course revise the notation to keep it consistent between the two parts and also make the transition and connections between the two parts clearer. We thank the reviewer for their advice.

Weaknesses:

we will ensure that the notations are homogeneous in the whole paper for the final version
Yes, the projections and projected weights in the theoretical section correspond to the discussed singular value decomposition described in the empirical evaluation. We agree that the paper would clearly improve by discussing and making explicit this connection. We will do it in the revised version. The susceptibility is analyzed in the theoretical section, see line 127. We show how the susceptibility diverges when the quantity $N^{-1} \sum_i w_i^2 \sim 1$ showing explicitly that the transition is of second order. We will add in the appendix a comment on the critical exponent.
We will add structure to the numerical part in the final version, where an additional page can be added. It is true that the negative term is where the difficulty relies on when training the RBM. In practice, at the beginning of the training the sampling is very easy (the chains mix very quickly) and very crude algorithms such as one-step CD can be used to compute the gradient. When the first phase transition occurs the mixing time suddenly jumps and at that point it is important to correctly compute the negative term. In this case, the sampling can introduce a strong bias, the theoretical analysis of which is not easy since the weights are not small and, for instance, the 1-step CD would introduce non-linearity in the terms. Yet, we can check in practice, that the picture discussed in the paper does not change much when training with CD-1, as we show in the attached PDF for MNIST. Eigenvalues still turn on one by one, the first eigenvectors are very similar to those of the PCA and the growth of the eigenvalues is associated with a growth of the susceptibility along these directions. We propose to add a small subsection or appendix in the new version of the paper discussing out-of-equilibrium effects in more detail.

Questions:

The directions onto which we project the gradient in the theoretical section indeed correspond to the principal directions of the dataset. In the case with one mode it is trivial (since the weight matrix is only a vector). In the multiple features case, it is easy to see by looking at the equation right after line 491, decomposing the correlation matrix of the dataset onto the vectors $\eta$ and $\bar{\eta}$ . We will make it clearer in the final version of the work. Furthemore, the divergence of the sucesptibility in the theoretical section is already done, see line 126-127 and the equation in-between. We will be more clear about the exponent with which it diverges and how it relates directly with a second-order phase transition.
The first phase transitions can be easily detected by monitoring the eigenvalues of the weight matrix (the singular values). In the model where the variables are $\\{0,1\\}$ , the phase transitions occur when the singular values become higher than 4, while when the variables are $\\{\pm 1\\}$ , the transition occurs when the singular values become higher than 1. In that sense, it is easy to detect the first phase transition. In practice, it is then possible to adapt the number of Monte Carlo steps nearby the transitions. The next phase transitions can also be detected by monitoring the second,third, … eigenvalues of the weight matrix, although it is not clear how many of them their will be. *We do not use a Langevin sampler. Since we work with discrete variables, we rather rely on Markov Chain Monte Carlo and Gibbs parallel sampling. To answer the reviewer's question: We do not observe spikes when updating the parameters unless we set a high learning rate. In this case, we do see spikes in the eigenvalues around the transition, but not in the experiences discussed in the current version of the manuscript.

Limitations:

We will add a new section Limitations, in which we will discuss the limitations of the current analysis: (i) the extent to which these results are applicable to more complicated models, (ii) the limitations of the analytical setup, and (iii) the limitations of the numerical setup (problems with insufficient MCMC sampling).

2024-08-14

I thank the authors for their helpful response to my questions. I understand that analyzing CD-1 is more challenging, and I am satisfied with how the authors have addressed my concerns. I believe the conclusions of the paper remain valid; however, I would suggest to discuss the non-equilibrium nature of RBM training in the paper.

With the suggested improvements in the exposition, I consider this work to be a valuable contribution to the understanding of restricted Boltzmann machines from a statistical physics perspective. I would like to raise my score to 7 and recommend this paper for acceptance.

作者回复

2024-08-06

Dear reviewer,

we join the pdf bringing more details on the learning of MNIST when using Contrastive Divergenge with one Monte Carlo step (CD-1).

最终决定Accept (poster)

2024-09-25

I have read all comments and responses. Reviews appear to be consentaneous, with four positive scores of 7, 6, 6, and 6. All concerns of reviewers have been fixed. It is recommended to accept this manuscript.