Fast, accurate training and sampling of Restricted Boltzmann Machines
We design a new training pipeline for the Restricted Boltzmann Machine, capable of training quickly and accurately on hard multimodal dataset.
摘要
评审与讨论
Training RBMs is challenging and slow due to the multiple second-order phase transitions and associated slow mixing of MCMC sampling. The paper introduces a pre-training method, consisting in integrating the principal directions of the dataset into a low-rank RBM through a convex optimization procedure. The Gibbs-Boltzmann equilibrium distribution of the pre-trained model can be efficiently sampled via a static Monte Carlo process. Starting from the pre-trained model, the standard Persistent Contrastive Divergence (PCD) training procedure for RBMs partially overcomes the problem of second-order phase transitions. The pre-training method is tested on the MNIST 01 dataset, a synthesized “Mickey” dataset, and the Human Genome dataset (HGD). The method is shown to outperform the PCD algorithm and the Jarzynski reweighting method (JarRBM). The paper also introduces a new method to sample from the trained model, called Parallel Trajectory Tempering (PTT), and compares it with the Annealed Importance Sampling (AIS) method.
优点
Technically sound.
The proposed pre-training method is shown to improve on other training schemes for RBMs (namely the standard PCD method, and the Jarzynski reweighting method).
The novel PTT method is shown to improve on standard Gibbs sampling.
缺点
My main concern is about the usefulness of the RBM approach to generative modeling. (cf question Q1 below).
The PTT sampling method seems to require more memory than standard methods for RBMs.
It is unclear how novel the proposed pre-training method is. (cf question Q2 below)
The paper contains a link to a github repository that reveals the author’s identity (section 8 page 9): https://github.com/nbereux/fast-RBM
Minor. Line 465: Appendix A.2 references itself.
问题
Q1. I am missing the big picture. Could you clarify the primary motivation for studying RBMs? How do you envision RBMs improving on generative models? In what scenarios could RBMs improve on other SOTA generative methods (e.g. transformer-based generative models) ? And according to what criterion/metric can these RBMs improve over these SOTA generative methods?
Q2. Can you clarify what was done in the Reference below, and what is the contribution of the present work compared to this earlier work?
Aurélien Decelle and Cyril Furtlehner. Exact training of restricted boltzmann machines on 414 intrinsically low dimensional data. Physical Review Letters, 127(15):158303, 2021.
局限性
See Weaknesses.
We thank the referee for his comments and questions.
Q1: The main reasons for using RBMs are twofold: they are easy to interpret and are particularly well suited for tabular datasets with discrete variables such as DNA or protein sequences, both in terms of generation power and sample efficiency, which makes them particularly suitable for inference applications in physics, bioinformatics or neuroscience. Transformer-based generative methods require a large amount of training data. Most applications in science cannot afford this luxury, as experimental annotations are expensive or private. Moreover, once these models are trained, apart from their impressive generating power, it is almost impossible to extract understandable information from their parameters, which is the typical goal of data-driven scientific studies.
We copy below a list of recent references published in high-impact journals using the RBMs in biology:
- [Bravi, B., et al. Probing T-cell response by sequence-based probabilistic modeling. PLOS Computational Biology, (2021)]
- [Di Gioacchino et al. Generative and interpretable machine learning for aptamer design and analysis of in vitro sequence selection. PLoS computational biology, (2022)]
- [Bravi, B. et al. RBM-MHC: a semi-supervised machine-learning method for sample-specific prediction of antigen presentation by HLA-I alleles. Cell systems, (2021)]
- [S. Quiroz Monnens et al, The Recurrent Temporal Restricted Boltzmann Machine Captures Neural Assembly Dynamics in Whole-brain Activitye Life (2024)]
- [Bravi, B. A transfer-learning approach to predict antigen immunogenicity and T-cell receptor specificity. ELife, (2023)]
- [Tubiana, J.et al. Funneling modulatory peptide design with generative models: Discovery and characterization of disruptors of calcineurin protein-protein interactions. PLoS computational biology (2023)]
- [Bravi, B. Development and use of machine learning algorithms in vaccine target selection. npj Vaccines, (2024)]
We say that RBMs are interpretable precisely because their architecture is simple enough to allow a high level of analytical treatment. For example, their energy function can be rewritten in terms of an explicit many-body spin interaction system. This analogy makes it a very powerful inference engine that overcomes the standard Ising inverse approaches restricted to pairwise interactions. The structure of the probability distribution can also be explored with perturbative tools that allow, for example, to hierarchically cluster data for phylogenetic reconstruction (see [8]) or to extract possible mutational paths between data points [Mauri, E., Cocco, S., & Monasson, R. (2023). Mutational paths with sequence-based models of proteins: from sampling to mean-field characterization. Physical Review Letters, 130(15), 158402.].
Q2: the question of the novelty is answered in the Author rebuttal section
Weaknesses:
- memory costly PTT: It is true that the PTT sampling requires more memory than a standard sampling procedure, but it allows the generation of samples in models where it is simply impossible to generate samples in a reasonable amount of time otherwise. Memory is much easier to handle than time, especially in our case the models are moved to GPU memory one at a time, and only when we need to generate samples of them. Otherwise, they are kept in RAM. Even though the method requires more memory than standard Gibbs sampling, it is a very limited amount when compared to sample large models. Moreover, pre-training allows keeping this memory cost very low (in our case, only 7 times more memory was needed than standard sampling). We will mention this in the limitations section.
Minor comments:
- We will correct the problem with the auto-reference.
- We apologize for the non-anonymized GitHub link, we were not very sure how to do it otherwise.
I thank the authors for their detailed explanations. They have addressed my main concern, and the novelty compared to prior works on RBMs seems convincing to me too. Overall, their response motivates me to increase my score.
In the introduction section of the paper, I'd encourage you to further motivate the usefulness of the RBM approach compared to other generative models (especially compared to SOTA transformers).
The work proposes to pretrain RBMS with a recently developed convex approximation, the restricted coulomb machine and then fine-tune the model using standard techniques like PCD. Further, a novel sampling technique, PTT is proposed that can sample from the final trained model by employing a sequence of model snaphsots during training, that are then connected via replica exchange in the style of parallel tempering.
Experiments are conducted and results show that the sampled distributions, when projected to the first few principal components, match the true distribution better. Moreover, log-likelihood comparisons based on single training runs show that the proposed method starts at much higher likelihood values due to the initialisation and there is some evidence that the resulting model als reaches higher likelihood values. Further experiments for PTT show that it is more likely to jump between clusters of the distribution than PCD based on gibbs sampling
Disclaimer: This is an emergency review. I have not had the time to do a detailled analysis, or implement/reproduce any of the results. While I am expert in the field, I will adapt my confidence accordingly. I will not fullfy abide to the review format.
Edit: An edit has been performed that only included changes of the format, but not the content.
优点
-
The use of the approximation using the restricted coulomb machine as an initialisation is an interesting idea that is worth investigating and the results in Fig 3C suggest that this pretraining approach is very effective.
-
The idea of using the training models as sampling steps is an interesting approach as well.
In general, i can see that both techniques can become tools in a more general RBM toolbox, however both of them seem to be incremental changes, even though they could impact RBM training a lot.
缺点
The main weaknesses of the present work are in two areas: experiments/comparisons and language, of which I only deem the former critical, while the latter will limit the potential impact of the article in the machine-learning community. As a result of the weaknesses, this paper has a number of misleading claims or claims that are not supported by the data presented in the work.
Experiments/Comparisons:
-
The authors dismiss using PT for being "expensive" and do not compare against it. This is against evidence that even single steps of PT chains with a moderate number of parallel chains can be more efficient than PCD using similar resources. This is especially interesting, since the authors use PCD-100 for training, which allows a lot of resources for PT. Note that the authors themselves reference [40] which shows an order of magnitudes improvement over Gibbs sampling already with 10 chains. Another example is given by [*] https://www.sciencedirect.com/science/article/pii/S0004370219301948 where the authors used PT for training with k=50 chains and only a single sampling step per iteration. Note that this paper uses a very simple heuristic to improve PT sampling: choose the reference (temperature ->infty) distribution as not the uniform, but the marginal data distribution. This is in contrast to [40] that used the uniform distirbution, leading to a significantly worse baseline.
-
There are almost no comparisons of log-likelihood values or values of normalisation constants for the proposed technique. While some are given in Figure 3C, they are only single run and only using approximated likelihoods. Due to this, the phrase " significantly higher log-likelihoods" in the conclusion is NOT supported by the data, given there is not enough data to test for significance or even measure the variability.
-
The training is also cut short, or the RBMs trained are not powerful enough, since 3B shows clear artefacts in all samples, indicating that none of the machines approximate the dataset well. Since RBMS with enough latent variables are universal function approximators, we can not get a definitive statement of whether the proposed pretraining does allow for better likelihoods. Since the experiments are not very expensive, this reviewer would propose to at least repeat experiments in order to obtain error bars on Fig 3C.
-
The learning rate of 0.01 used in the experiments seems to be on the high end. This is not only bad for PCD training, but also for obtaining high likelihood values, and could explain big parts of the leveling out of the graph in Fig 3C. While it is okay to use a high learning rate in the beginning, keeping it constant over the course of training seems like an oversight.
-
For PTT especially, there are no good comparisons that compare the quality of the samples in terms of representing the true distributions. While visual examples are shown that show visually good mixing, the baseline to compare to is again PCD, and not PT, nor stacked tempering. Since, again [40] showed that both alternatives clearly beat gibbs sampling given the same amount of resources, we do not know how good this sampling scheme really is, compared to strong baselines. This reviewer suggests to compare in at least one experiment the proposed approach to an approach with known normalisation constant and then measure how well estimators based on these samples work, similar to [*]. This would also partially verify 3F.
-
As an addendum to the previous point: [] showed that the performance of AIS improves significantly with the choice of reference distribution. Using the same distribution as [] or the pretrained distribution proposed in this work, might diminish any performance gains by PTT. This would still be a major improvement to the state of the art.
Language:
- While in general well written, this work reads like targeting a physics community, not the ML community. As a result, this paper includes slang terms mostly encountered in statistical physics/thermodynamics, which do not have a clear meaning in the statistics community. This includes terms like "phase transitions" (first and second order), "equilibrium models" (not consequential for the article, nor the review, but this reviewer genuinely does not know what this term is supposed to mean), "critical transitions", "relaxation times", "free energy". Since most of these terms appear in sections where the authors try to explain the method and/or its consequences, a significant number of readers will not be able to understand those reasonings as they do not have a grasp of the physical analogies. This reviewer would propose to replace some of the terms by the statistical equivalent, or to introduce them.
- Some of the explainations and reasonings are misleading. The initial paragraph highlights that RBMs are supposedly interpretable. While they are simple models, Binary RBMs are still universal function approximators and thus it is highly unlikely that the latent space has any meaning that aligns with any human interpretable semantics. If the authors disagree with this, This reviewer encourages them to add a citation to line 25.
- Missing citations: the datasets used should be cited.
- Figure references missing: the article does not always refer to the figure they are talking about, e..g, line 205. In general references should include the subfigure letter as the article does later, e.g., lines 263+
问题
Suggestions are added under weaknesses.
局限性
The authors mark "Yes" for point 2 "Limitations" on the checklist. This reviewer has not found that the authors discussed the limitations of their work. This especially includes the guidelines that say that authors should "reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs." However, in fairness, the authors answer "[NO]" in question 7, discussing the presence of error bars or significance tests to measure significance. However, since this is not part of the final publication, and the authors include misleading claims about significance of results, this must be discussed in the main text, or the phrasing weakened.
I have not found any information on runtime or CPU/GPU hours, but the CPU/GPU were reported, so point 8 is partially fulfilled.
On the PT In the attached PDF, we include a sketch explaining the difference between first and second order first transitions in the attached Fig. 1. In Fig. 2 D-E, we illustrate the effect of the disappearance of a peak upon lowering the temperature in the HDG dataset and the failure of the PT strategy of this dataset (Fig.2C). We also compare these results with the data sampled using 10^7 ordinary Gibbs MCMC steps (that should be long enough to guarantee equilibrium given the measures of Fig.4D).
The reviewer suggests using an improved version of PT using a non-uniform reference configuration. We were not aware of this modified algorithm, and we will discuss it in the new version of the manuscript and this reference, and compare it with the PTT. We agree with the referee that this reference probability will add a bump at the center of the probability distribution, which should to mitigate (not making them disappear) the effect of the first order transitions. Yet, our guess is that the efficiency of this algorithm must depend a lot on the properties of the dataset and the spatial location of the barriers, and it should struggle also with datasets with multiple clusters. We could not do a comparison during the rebuttal week due to the lack of available resources, but we will do it for the final version of the paper.
This trapping phenomena associated to the existence of first order transitions in temperature is reproduced when trying to compute the partition function using a temperature annealing, and it is beyond the failures observed in the computation of AIS. Instead, the trajectory tempering only crosses second order transitions, which means that a high temperature mode smoothly splits on two when lowering the temperature, which avoids the disappearance of modes.
We thank the reviewer for its comments and suggestions. 1. PT: The reviewer disagrees with us on the overall performance of standard PT in RBMs. We have several comments on this. First, the reviewer disagrees with the statement that PT is costly and assesses that the reliability of PT in RBMs has been well established in previous references. It is more or less textbook knowledge in statistical physics that the PT algorithm does not work in the presence of a 1st order phase transition (Ph-Tr). The reason is that if new distant modes simply emerge when the temperature decreases, the chains coming from high temperatures will never nucleate them if the number of MCMC steps performed between temperature-swaps is not long enough because the barriers between modes are large. As discussed theoretically in [41], this is the case when dealing with datasets clustered in a low-dimensional space (which is the case for genetic or protein data). Schematically, for these datasets, one tries to encode several distant modes at , which are usually not symmetric around the zero magnetization. Since the temperature is only coupled to the energy and not to the entropy, a change in temperature leads to “1st order” Ph-Tr (because the modes are more sensitive to temperature changes the farther they are from the origin, see [41]). As a result, there are modes that simply disappear from the equilibrium measure when the temperature is increased. On the other hand, the chains located at these vanishing clusters at low temperatures melt into other clusters when they visit the high temperatures. When iterating the standard PT process, the chains occupying these clusters will gradually disappear, giving the false impression that the configurations belonging to these clusters are not typical at , when they are. In contrast, the “2nd order” transitions are characterized by a merging of the modes. If the “2nd order" Ph-Tr occurs at temperature, this would mean that one hump gradually splits into two (or more) as the temperature decreases. The transitions that occur along the training trajectory are 2nd order transitions, and the transitions encountered when the temperature is increased (in clustered data sets) are 1st order Ph-Tr. Both statements are analytically proven in RBMs (see [26,27,28] for the trajectory, and [41] for the temperature).
We further elaborate on this point in the official comments.
2. LL: The LL was shown in Fig. 3 for MNIST and in Fig.5 for the HGD. We will include the LL of the remaining dataset in the Supplemental. We will add error bars to the curves by repeating several times the estimation. Following the reviewer’s suggestion, we will compare the different strategies to compute the LL in a controlled set-up where the partition function can be exactly computed. In particular, we will consider RBMs trained with the datasets discussed in the paper but with only a few hidden nodes (we can compute the exact LL as long as ). We will also illustrate these cases to discuss the differences between PT and PTT.
-
It is not clear to us which artifacts the reviewer refers to. The samples in Fig. 3B are real samples from the learned RBM and not the probabilities for each pixel. The white dots that sometimes appear in the black background are rare events due to fluctuations at T=1. Further away, one starts to observe overfitting effects. If necessary, we can use the mean values of the pixels instead of samples, which smooths the images into gray levels.
-
The learning rate we used seemed to work well in our case and was maintained for all experiments and methods we used, for the sake of comparison. While we agree that keeping the learning rate constant is generally not the best strategy, we also believe that it is an easier setting to compare different methods and avoid adjusting more and more hyperparameters that would introduce hard-to-control effects. Moreover, adjusting the learning rate would similarly affect both strategies using PCD (i.e., with or without pre-training). We propose to show experiments with smaller learning rates in the Appendix, showing that the overall behavior between training schemes remains unchanged. We will also include error bars in the LL figures to illustrate the typical fluctuations of this measure.
-
We will include in the appendix the comparison between the samples obtained with PTT and those obtained after sampling “classical” MCMC steps (several hours long) the HGD model. We show this data in the attached PDF. They both perfectly match. We also show that the PT sampling fail to reproduce the upper cluster. We could not compare with the modified PT during this rebuttal week, but we will do it in the final version.
Languages:
-
We thank the referee: we should have defined better these quantities, and we will do it in final version.
-
RBMs are much easier to interpret than other models based on DNN. The main reasons for this are, first, that the weights of the latent variables interact directly with the visible nodes and therefore offer many possibilities to analyze them: SVD, looking at their value as a function of the visible nodes, etc. For example, Ref [15] shows that the protein functionality can be inferred from the RBM's latent space. Second, Ref. [7] showed that the RBM can be mapped to a system of interacting visible nodes, from which not only pairwise couplings between nodes can be inferred, but also higher-order couplings. Third, ref. [8] showed that the minima of the approximated free energy can be used to cluster data hierarchically. One can accurately approximate the free energy of the RBM because it has a very simple structure. There are also many recent papers that use the RBM for interpretive applications in biology. A list can be found in the rebuttal of reviewer SD84.
-
We will add the citation in the final version.
-
We will add more precise citation to the figures for the final version.
Regarding point 1
The rebuttal does not answer my points. I was not arguing that PT would perform better overall, but that it would perform better than PCD. This is well established for RBMs. That PT can fail is well known, but less often than PCD - in both examples shown in the pdf the authors provided, PCD would fail, since the distances of the modes are large, while PT would only fail in one of the two cases.
The authors argue based on [41] that the first order transitions are more common. I have taken a look at [41] and It is not clear to me why the analysis should apply to the case of binary RBMs in general. The analysis is based on an continuous relaxation, by arguing that the discrete spin vectors lie on a low dimensional subspace of the data and then some of the discrete variables are replaced by variables taking values in [-1,1] to model this subspace. This changes the topology of the space from the discrete metric to a continuous metric. I see no reason to doubt the analysis on the continuous topology. But it is not clear to me how the two types of phase transitions can be differentiated on the original discrete topology or to what degree this is an artefact of the relaxation approach used. E.g., we know of continuous RBM types where all transitions are second order: the gaussian binary RBM (since it is a mixture of gaussians with mean that goes to 0 as temp->infty). Indeed, in the original work proposing PT for RBMs ( [r1] http://proceedings.mlr.press/v9/desjardins10a/desjardins10a.pdf ) the authors showcase PT on a toy dataset that uses 4 modes in a 4x4 image dataset. This dataset fulfills the basic assumption of [41] that the data points cluster around a few common points that are distant in the discrete topology (and also the continuous topology), and [r1] shows that PT outperforms PCD by a wide margin at similar computational budget.
point 2: Thanks for providing additional experiments. However, to make the claims regarding better LL, you need to also perform multiple runs since all your results are dependent on the RNG and initialisation.
point 3+4: The artefacts i refer to are the errors in the samples produced. We would expect that a well-trained RBM would learn that large areas of the images are black without any noise. The presence of this noise on all trained models signals that the dataset is not learned well. This is most likely due to the large learning rate used. Note that already [r1] and the work by Tieleman,2008 ([r2] https://www.cs.toronto.edu/~tijmen/pcd/pcd.pdf ) used smaller learning rates with a decaying schedule.
About the learning rate: As discussed with other referees, we believe that it would be better to reduce the learning rate for practical applications. To allow comparison between methods, we apply the same scheme to all training methods. A high learning rate exacerbates the problems associated with the lack of thermalization during PCD training, and not the other way around. Furthermore, the learning rate is not an absolute value. The optimal value for each machine strongly depends on the number of visible and hidden nodes used. Also on the number of MCMC steps. The machines discussed in Tielemann’s paper do not have the same number of nodes as ours, so the direct comparison of lr does not make sense.
Our samples show no artifacts. Tielemann's work shows the probabilities, not the variables. This is obvious because the numerical images are displayed in grayscale and not in black and white pixels. The display of probabilities is common in the literature for binary variables and image data. When we display the probabilities, we get a perfect black in the background, as in Tielemann, and our digits are also better shaped and more diverse when we work with the full MNIST and this lr. Finally, the (random) presence of white pixels could be more related to the value of the bias, which is set at the beginning of the training. The bias is initialized to reproduce the frequencies without weights before it starts learning. However, for pixels that do not fluctuate, a threhold is used to avoid infinite bias. The flipping of some spins is probably related to a “weak” threshold.
**Coulomb relaxation. ** Concerning the question whether the presence of a first order phase transition is just an artifact of the Coulomb relaxation: actually this is slightly more involved to show it for RBM than for Coulomb machine because for Coulomb machines the energy term is directly proportional to inverse temperature but we could add to the paper a formal argument which goes schematically as follows: just writing the local minima of the free energy (function of magnetization in the reduced intrinsic space) and looking at how the equilibrium is displaced with temperature shows that their energy vary in a non-linear way with temperature and non-uniformly regarding their position on the magnetization manifold, while the entropy contribution do not change. This necessarily implies that changing even slightly the temperature will break the highly sensitive equilibrium obtained between these state corresponding to the multimodal distribution and will typically favorise one state among all. This scenario is expected to occur as soon as the data are located on low dimensional space ( ) compared to a large embedding space (, ) which is quite common in our point of view, at least for the type of data we are interested in.
We thank the reviewer for its reply. Our paper and also ref. [41] focus on binary-binary RBMs in describing highly clustered datasets, where the different modes are separated by large barriers that are already visible at the PCA level, where typical training approaches drastically fail.
On the comparison of PT over PCD. The cases discussed in the rebuttal are good examples of situations where a PCD training can work better than a PT training (the version of PT where a temperature swap is followed by only one or a few Gibbs steps). In these data sets, the equilibrium measure can be sampled by performing many accumulated Gibbs steps, but not by performing many PT steps. This is because visiting the higher temperatures makes the sampling less ergodic than at T=1 because each time the chain visits high temperatures, a cluster is suppressed and standard sampling does not have enough time to re-nucleate it before visiting high temperatures.
You can solve this problem by significantly increasing the number of MCMC steps between temperature exchanges and/or the number of temperatures in the ladder, but in this case the overall acceleration of PT becomes less clear because you are forced to sample systems in parallel that are not particularly useful for anything else. The total cost for sampling steps at is then . For this reason, we have said that PT is computationally very demanding. Ref [40], for example, shows an order of magnitude improvement over AGS in the jumps' timescale in the case of MNIST 0/1, but at the cost of simulating 10 systems in parallel which is much slower as reported in Table S4. They also report no further acceleration when is increased from 10 to 1000. For proteins and the 2D Ising model, the situation is even worse.
Finally, we would like to mention that the ref https://www.sciencedirect.com/science/article/pii/S0004370219301948 cited by the reviewer analyzes machines trained with CD-25 and CD-1. The CD training produces very pathological models with a high LL but with a very strange dynamic behavior and, more importantly, models that are completely unable to generate proper samples (see ref. [21] or ref. [r2] suggested by the reviewer). We are not sure to what extent the results of this work are affected by this.
About the toy dataset: The dataset used in reference r1 provided by the reviewer is not a good example. It is simply too small (N=16). The divergence of times associated with bimodality increases with exp(N). For N=16, the thermalization problems associated with first-order phase transitions are much smaller than the ones we address in our work.
On first-order transitions in RBMs: First-order transitions in temperature are common when working with “clustered” data sets. This is not necessarily the case for most datasets where training algorithms are tested in RBMs.
We disagree with the reviewer's statement that first-order transitions do not occur in other versions of RBMs. Could the reviewer give us a reference stating that all transitions in the Gaussian binary RBM are second order? We would be very interested in such results for our research.
Both first and second order transitions have been reported in Gaussian Mixture Clustering, see for example [Lesieur, T., De Bacco, C., Banks, J., Krzakala, F., Moore, C., & Zdeborová, L. (2016, September). Phase transitions and optimal algorithms in high-dimensional Gaussian mixture clustering. In 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton) (pp. 601-608). IEEE] or [J. Schneider (Phys Rev E 1998), First-order phase transitions in clustering], the latter showing precisely that first-order transitions in temperature are frequent in this problem. Moreover, using the method proposed in [25], it can be measured that an RBM trained with a simple data set consisting of a Bernouilli mixture of two clusters undergoes a first-order transition when is interpolated from 1 to 0.
On the other hand, the same arguments about RBMs learning PCA through second-order phase transitions in the early stages of training also apply to binary-binary, binary-Gaussian, and Gaussian-binary RBMs, as in Ref [Decelle, A., & Furtlehner, C. (2021). Restricted Boltzmann machine: Recent advances and mean-field theory. Chinese Physics B, 30(4), 040202].
The manuscript suggests to apply
Aurélien Decelle and Cyril Furtlehner. Exact training of restricted Boltzmann machines on intrinsically low dimensional data. Physical Review Letters, 127(15):158303, 2021.
to initialise persistent contrastive divergence (PCD) learning for RBM training and estimating the log likelihood / partition function of RBMs.
优点
While one may argue that the novelty is limited because the interesting theoretical work was done in the abovementioned paper by Aurélien Decelle and Cyril Furtlehner, the idea is sound and the approach may be useful in practice.
缺点
The novelty is limited because the interesting theoretical work was done in the abovementioned paper by Aurélien Decelle and Cyril Furtlehner.
My main criticism refers to the empirical evaluation, and I think these questions should be addressed:
- Was enough effort out into the baseline methods (including hyperparameter choice)?
- What would be an example where the PCA decomposition is misleading (i.e., not helping or even slowing down the process)?
- What about MNIST with all 10 digits?
Details (not ordered by importance):
-
„On the diametrically opposite side (on interpretability) are generative ConvNets [9, 10], where the energy function is formulated as a deep neural network, which are capable of synthesizing photorealistic images but are almost impossible to interpret as a physical model.“: Not clear, perhaps add half a sentence to elaborate.
-
„second-order phase transition“: define what this is already in the beginning
-
Beginning of section 2: I suggest to add the analysis in
Fischer, Igel. Bounding the Bias of Contrastive Divergence Learning. Neural Computation, 2011 https://direct.mit.edu/neco/article-abstract/23/3/664/7646/Bounding-the-Bias-of-Contrastive-Divergence?redirectedFrom=fulltext
to the discussion of the limitations of CD.
-
„much better than those obtained with the standard Annealing Important Sampling (AIS) techniques“: In [42], several methods are discussed, in particular one based on Bennett’s Acceptance Ratio method (BAR), which performed in general better than standard AIS. How does the proposed method perform in comparison to BAR?
-
What if linear PCA is not well suited to find a good representation of the data because of a highly non-linear latent structure?
-
I am not fully happy with the selected benchmark tasks. In particular: How does the method perform on MNIST with all 10 digits?
Minor comments:
The reference list should be revised. Inconsistent capitalisation, author first name abbreviations, etc.
问题
My main criticism refers to the empirical evaluation, and I think these questions should be addressed:
- Was enough effort put into the baseline methods (including hyperparameter choice)?
- What would be an example where the PCA decomposition is misleading (i.e., not helping or even slowing down the process)?
- What about MNIST with all 10 digits? See my comments below for further questions?
局限性
I think the limitations should have been explored in more depth. What would be an example where the PCA decomposition is misleading (i.e., not helping or even slowing down the process)?
On the PCA: The reviewer doubts the reliability of the PCA to best describe the multimodality of the data. We agree with the reviewer that the slowest mode could be related to a non-linear decomposition of the dataset and that, if this were the case, our method would not be particularly useful to avoid this. Our model is designed for data that is highly clustered in a low-dimensional space, which is typically already captured at the PCA level. This is the case of the datasets analyzed in this work (and the typical situation encountered in dna/protein datasets). We will try to make this distinction clearer and discuss this limitation in a new section: "Limitations".
The RBM learning is triggered by the encoding of the covariance matrix (and thus the PCA), it is only at a second stage when it starts to encode the higher-order statistics. The problem is that if the dataset is very clustered (as in the cases discussed in the work), the permanent chain used to estimate the negative part of the gradient remains very far from equilibrium from that point and on (chains typically get trapped in one of several isolated modes no matter their statistical weight), leading to very biased trainings. In these situations, our method can really make a difference. At variance, if the encoding of the first PCA components does not imply dividing the phase space in separate clusters (as it is the case for instance when dealing with the entire MNIST dataset), the encoding of these first directions do not imply insurmountable mixing times (in Fig. 3C of ref. [21] in the manuscript estimated that the first transition implied a growth of thermalisation times of order only 100 MCMC steps, which can be easily tackled by PCD), which means that a pre-training makes no particular speed-up in that case. The situation could be different for larger dimensions due to the critical-slowing down (ref.[28] explicitly computes the mixing times for training using resized versions of MNIST). For this reason, training the entire MNIST datasets is much easier than training the 0-1 MNIST one.
In our experience, however, for highly clustered datasets, the strongest time divergences appear at the beginning of training (when learning the PCA) because it is the moment where different modes are separated by very large barriers, later stages are typically much easier to sample because barriers become finite. This statement was quantified for some datasets used in this work in ref. [25] (see for instance Fig. 8 in [25] for the 0-1 MNIST). This is not the case for the entire MNIST, where mixing times softly grow along the training (see Fig. 3 [21]).
To make these statements clearer, we propose to show the explicit evolution of the exponential correlation times of the different models along the training time, which show very sharp jumps at an initial stage when the first eigenvalues of W grow, further decrease to much lower values that only very softly increase in time.
I have read the rebuttal.
Challenge: Explain what you mean by second-order phase transitions by defining it using rigorous math without using any analogy from physics.
I did not criticize the choice of relying on PCA. However, for each heuristic it is important to also think about the cases where it does not work. I would even suggest to include an example for a problem where the approach does not help or even leads to worse results (e.g., because PCA does not nicely decompose the problem).
Along the same line, I support to include and discuss the experiments on full MNIST. It is instructive to see that you do not get a speed-up here and that you have an explanation for this.
One of the goals of statistical physics is to determine the typical configurations to be expected for a set of model parameters. In the case of RBMs, the question is what typical visible (and/or hidden) variables one expects to observe. For example, one can ask what typical energies one expects to observe for a set of parameters . For this purpose, let us rewrite the partition function as a sum over all possible values of the energy : where is the number of states with energy , is the entropy, and is what we refer as free energy ( is the free energy per variable, with the total number of variables). The intrinsic free-energy is a self-averaging quantity, which means that it does not depend on for large values of . All , and depend on the model parameters.
For the sum in is dominated by the states with minimum free energy (and same applies for the probability ). A first order transition occurs in the parameters when the first partial derivative of , with respect to each of the parameters, is discontinuous (in physics, the parameter is usually the temperature). A second order transition occurs when the first partial derivative is continuous, but the second partial derivative is discontinuous. The same discussion done in , can be done with other observables, like the magnetization .
In the case of the RBM, the first learning transition can be mapped to a very simple model, the so-called Curie-Weiss model, as recently described in Ref. [Bachtis, D., Biroli, G., Decelle, A., & Seoane, B. (2024). Cascade of phase transitions in the training of Energy-based models. arXiv preprint arXiv:2405.14689.]. The Curie-Weiss model is fully solvable, and for this reason the meaning of a second-order transition is perhaps easier to understand here. Let's give it a try. The Curie-Weiss model consists in a set of discrete variables . We define the (exponential) distribution over these variables as where is a parameter of the model and the partition function. The question is to understand the “structure” of the distribution as a function of . We expect that for small, each spin behaves as an isolated Bernoulli random variable, while for a large value of , the distribution of the system is dominated by configurations where the variables have (almost) all the same signs. The key mathematical aspect is to study the system in the limit where and the transition between both regimes. A way to study this distribution is to study the moment generating function adding a term in the exponential and computing the partition function as a function of . It is possible to show (rigorously in this precise model), that the probability of of the system is given by (in the large limit) where is the large deviation function. The structure of is such that for , it is a convex function with only one minimum in , while for it has two symmetric minimum . The values (in function of ) of the are continuous: for , and then positive until it saturates to as . In practice, it means that the distribution pass from an unimodal distribution to a bimodal one as increases. The point where it happens is at and is called a second order phase transition. A particular interesting properties is that at , the function develop a non-analyticity in its second-order derivative, that can be linked with long-range fluctuations.
For a recent review [Kochmański et al, Curie–Weiss magnet—a simple model of phase transition, Eur. Journal of Physics 2013], and for more rigorous results [Sinai, Theory of phase transition: rigorous results, 1982].
As explained in ref. [26,27], the encoding of PCA arises naturally in the analysis of the learning dynamics of RBMs trained with maximum likelihood and initialized with small weights. This means that at the beginning of the learning, the singular vectors of the weight matrix will align with some of the principal directions of the dataset. But, the behavior at longer learning time may differ strongly from this picture, and in fact, for many datasets (for instance images), we observe in practice that the SVD of the weight matrix gets completely decorrelated from the PCA of the dataset at long training times. This has been shown empirically in [26], and we observe it in general. Yet, what we can not tell is if the model will first learn the PCA and stop there, or at the contrary will converge at the end of the training to more complex decompositions of the dataset. In the case of MNIST, we clearly see that for “good” training, the directions are not anymore perfectly aligned with the PCA of the dataset. For very clustered dataset and low-dimensional, the most important PCA directions seem to be very stable along the training.
Since the standard maximum likelihood training is triggered by the encoding of some directions of the PCA, we find very unlikely that the pre-training approach can be damaging for any dataset, as the weight matrix are completely free to evolve according to the gradient and the directions are not fixed. We just pre-train what should be normally learned if the negative part of the gradient was perfectly computed during the first epochs of the training.
We know that image datasets are not typically clustered at the level of the PCA, despite some like CIFAR being very multimodal. We do not expect this method to be of a very good help there.
MNIST
We will include the full MNIST experiments in the final version of the paper. We agree with the reviewer that this discussion helps a lot to understand the limitations of the method.
We thank the reviewer for its careful reading of the paper and for its constructive comments. We attempt to respond to the comments individually below.
The answers to the concerns about novelty of this paper have been answered in the section "Author Rebuttal", as they were shared by several reviewers.
Concerning the rest of comments:
On the PCA The reviewer doubts the reliability of the PCA to best describe the multimodality of the data. We agree with the reviewer that the slowest mode could be related to a non-linear decomposition of the dataset and that, if this were the case, our method would not be particularly useful to avoid this. Our model is designed for data that is highly clustered in a low-dimensional space, which is typically already captured at the PCA level. This is the case of the datasets analyzed in this work (and the typical situation encountered in dna/protein datasets). We will try to make this distinction clearer and discuss this limitation in a new section: "Limitations". We further elaborate our answer in the official comments.
MNIST: We did not discuss the entire MNIST dataset because it is much easier to train than the 0-1 version because barriers are much less pronounced in that case. MNIST dataset can be well-trained with a PCD-10 scheme, and in fact we do not observe any particular improvement of using the pretraining strategy (nor do we observe that it is getting worse because the quality is limited by the number of steps k used at the second stage of the training PCD-k). Being larger CelebA is a much better test for our algorithm because PCD-10 struggles to get decent images. We will include some analysis in the new version of the paper (and discuss the performance on the complete MNIST in the Supplemental).
Baseline methods: Referee wonders if we did any effort to optimize the baseline methods for a meaningful comparison. Our approach proposes only to include a pre-training before initiating the standard training methods. In this sense, we tried to optimize the baseline methods first and use the same hyperparameters later. Nevertheless, we agree with the referee that no discussion about the hyperparameters was included in the manuscript, and it is an important point. We will include some figures in the supplemental material showing the performance at different learning rates and number of added hidden nodes.
Comments:
-
Interpretability Simple EBMs can be directly mapped to spin ferromagetic models in physics, for instance a binary Boltzmann machine without hidden units is formally identical to a (pairwise) disordered Ising model, and thus the trained model parameters can be interpreted as physical pairwise interactions between both variables. A strategy (typically referred as inverse Ising approach) widely used in data-driven interpretability approaches in biology, such as the direct coupling analysis (DCA) of protein families or the inference of neuron connections from spike data in neuroscience. Similarly, RBMs can be mapped to a slightly more complex spin model where interactions between spins are not only pairwise but multibody. The interactions between variables in a deep convolutional generative neural network cannot be easily written in that way, and the most prominent interactions between variables cannot be easily inferred from the analysis of the network. Furthermore, the analogy between neural networks and spin systems allows us to apply all the machinery developed in physics during the last decades to explore the equilibrium behavior of complex systems to analyze the learning or probe the probability distribution function. We will extend the paragraph to make it clear and add a section in the SM illustrating the mapping to a physical system.
-
Following the same physical analogy, one can describe the encoding of features during the training as a thermal annealing process, along which each new mode in the model is introduced through a phase transition which is analogous to a paramagnetic to ferromagnetic transition in temperature. These kinds of transitions are denominated in physics as second order because they are characterized by a divergence on the second order statistics of the system's order parameter (in this case, the fluctuations of the magnetization - the variables projected on the each of the PCA components-). Second order phase transitions, also known as critical transitions, have many interesting properties such as universality (identical exponents rule the divergences in very different physical systems satisfying same basic symmetries and dimensions), the emergence of system size correlations between variables, and the apparition of also universal dynamical features, such as a sudden arrest of the dynamics, an effect known in physics as critical slowing down, which in simulations is reflected by a sharp growth of the mixing times with a power of the number of variables. These dynamical effects have recently been studied in the training of RBMs in [[D. Bachtis et al, Cascade of phase transitions in the training of Energy-based models, arXiv:2405.14689 2024]].
We acknowledge that phase transitions are not common knowledge in computer science and statistics (another referee complained about the same point), so we propose to include a whole pedagogical discussion in the Supplemental Material explaining the mapping between the RBM and a physical system and the dynamical characteristics of first and second order phase transitions as they are both important to understand the properties of the training process and the failures of the sampling one.
-
We thank the referee for pointing out the interesting reference by Fischer&Igel that was unknown to us, and will be discussed in a further revision of this work.
-
We also thank the good suggestion about comparing our estimation of the log-likelihood against the BAR method, which will be taken into account in the final version of the paper.
We thank all reviewers for their comments. We answer the question about the novelty of this new work in this Author rebuttal, as this question was raised by several reviewers. We answer the remaining questions directly in each reviewer's rebuttal.
On the novelty: Reviewers yudf and SD84 raised concerns about the novelty of the paper, especially regarding the connections with the Decelle&Furtlehner paper. While we rely on the mapping between the RBM and the Coulomb machine proposed in that paper to approximate the low-rank RBM, our work significantly extends the applicability to real data on three ways.
First, because the D&F's work only studies very simple, low-dimensional synthetic data sets with specific modes and regular features to cover the low-dimensional space, and mainly focuses on theoretical aspects. The first added value of the present work is to move from theory to practice, where many details need to be specified to make the technique work for arbitrary data. In particular, we extend this technique up to four intrinsic dimensions, including the direction of the bias, which requires a special treatment. This bias improvement is crucial for processing image data and allows us to achieve an additional direction. We have also corrected the D&F's calculation of entropy to ensure that true equilibrium samples can be obtained through a static Monte Carlo procedure, which is crucial to sample fast the trained machines. We will highlight these improvements to the method for real data in a new version.
The second novelty is that we propose to use this construction as pre-training, which has not been investigated in previous work. We show that this combined scheme allows us to deal with real datasets highly clustered in a low-dimensional space (characteristic of genetic or protein data) where standard methods fail. These datasets are much more complex than the ones treated in D&F, as they are also clustered in more directions than the ones we can approximate with the low-rank RBM. In summary, our combined approach not only allows us to match statistics across a few dimensions, but also to capture the fluctuations of high-dimensional real data sets. The use of RBMs on this type of data is particularly interesting from a scientific point of view, as RBMs work very well with tabular discrete data even in situations where data is scarce (which is the case for biological datasets), while allowing systematic extraction of biologically interpretable data from them.
And third, beyond the construction of the Restricted Coulomb Machine, we propose a new and general sampling strategy. This new strategy is supported by recent developments in the theoretical understanding of the learning process in RBMs. In particular, we exploit the progressive encoding of modes through second order phase transitions in the training, in combination with the fast sampling of low-rank RBMs to propose a strategy that is very similar to the standard parallel tempering method, but where the parameters of the models (saved along the training process) are exchanged instead of the temperature. We show that this strategy can be used both to generate equilibrium samples in highly clustered models, where generating samples using standard methods is not feasible in a reasonable time, and to obtain reliable log-likelihood calculations.
In the attached PDF file, you will find some figures answering the reviewer NWGy's questions about the PTT sampling method and its differences from the standard parallel temperature method.
The paper proposes to initialise RBMs based on a recently developed convex approximation (Aurélien Decelle and Cyril Furtlehner, 2021) for further PCR based training. Reviewers were in general not negative about the paper, and found the suggested methodologie interesting. However, some issues remained after the rebuttal. Specifically, the write up should be improved by adapting to the vocabulary of the ML community and, more importantly, the experimental evaluation should be improved (including multiple runs to allow for evaluation of statistical significance of the results, proper hyperparameter selection ,and results on full MNIST).
On a side note: The author forgot to remove the GIthub link that reveals his name, which I did not take into account but could lead to a direct rejection.