Restricted Spectral Gap Decomposition for Simulated Tempering Targeting Mixture Distributions
We bound the restricted spectral gap of simulated tempering chain for mixture distributions and show that simulated tempering Metropolis–Hastings efficiently samples from Gaussian mixtures.
摘要
评审与讨论
There has been a lot of work in recent years on establishing the complexity of Langevin, MALA or related MCMC algorithms when sampling from a log-concave density. However, such approaches fail short when the target distribution is severely multi-modal. A related line of work is to study the complexity of tempering-based algorithms (such as parallel tempering, simulated tempering, or tempering SMC) under the weaker assumption that the underlying Markov kernels mixed well at least locally, in various subsets forming a partition of the sampling space. This paper continues this line of work, and used the concept of restricted spectral gaps to derive new, tighter complexity bounds for simulated tempering.
The first part derives new result under very general assumptions on the target distribution and the local kernels, whereas the second part specialises these results to the case where the target distribution is a Gaussian mixture, and the local kernels are random walk Metropolis kernels.
优缺点分析
Strengths
-
The paper is nicely written, I enjoyed reading it.
-
From a cursory reading, it looks like a significant improvement over previous results of the same type. In particular, the paper is quite close in spirit to Ge et al. (2018), but Ge et al consider the continuous-time settings (which is typically easier to deal with, because no Metropolis correction is required), and this paper uses novel tricks such as the the projected Markov chain (see Remark 3). Disclaimer: I have not read Ge et al. (2018).
Weaknesses
-
A journal paper might be a better format for presenting this kind of results.
-
The direct relevance to ML is a bit debatable, especially the part on sampling from Gaussian mixture distributions (since it is a toy problem).
问题
- It would be nice if the authors could discuss a bit more the relevance of their work to multi-modal target distributions arising in ML, e.g., posterior distributions for neural networks.
局限性
- 42: there are more recent/relevant references for SMC samplers than Liu & Chen, e.g. Jasra et al. (2006, JRSSSB, read paper) the book of Chopin and Papaspiliopoulos (2020), or Syed et al (https://arxiv.org/abs/2408.12057).
最终评判理由
Solid piece of work, I was happy with the initial version (rating 5), and I see no reason to change it.
格式问题
none
We sincerely appreciate your careful review of our work and your positive feedback. Below please find our point-by-point responses.
Weakness 1. We appreciate the perspective that the depth and technical nature of our results might be well-suited to a journal format. However, we expect that our contributions will be of strong interest to some conference audience as well, since quite a few similar works [1, 2] have been presented at conferences like NeurIPS and ICLR. We hope our paper will likewise contribute meaningfully to ongoing discussions in this area.
Weakness 2. We thank the reviewer for the thoughtful question. Theorem 5 establishes the convergence guarantees for STMH when sampling from a mixture of Gaussians with shared covariance matrix. This result can be naturally extended to target distributions that are sufficiently close to such Gaussian mixtures, as shown in [1]. In such cases, the time complexity would additionally depend on closeness between the actual distribution and the Gaussian mixture approximation. We also anticipate that similar techniques can be adapted to handle mixtures of log-concave distributions that are translates to each other, or more broadly, distributions that are well-approximated by such mixtures.
Additionally, there are realistic applied settings where our assumptions may approximately hold, making the proposed framework potentially applicable. For instance, [3] showed that the posterior distribution for Bayesian variable selection (which is one of the most common model selection problems in statistics) can be approximated by a mixture of Gaussians under certain conditions.
Another relevant application arises in the context of (Bayesian) deep neural networks. Consider the case when the target is . In the context of DNN, is the loss composed with a neural network function. In this setting, the complexity of sampling will depend on the structure of . If the widths of the neural networks tend to infinity, several works have shown that satisfied PL-type condition and hence by recent works like [4, 5, 6] satisfies Poincare or Log-Sobolev type isoperimetric inequalities. Therefore, algorithms like Langevin Monte Carlo, proximal samplers or MALA could be used to sample from the posterior. However, in the practical regime where the width is finite, the corresponding does not satisfy any isoperimetric inequality and so is highly multimodal, and the proposed algorithm is likely to be useful.
Question 1. This has been addressed in the response to Weakness 2.
Limitation. We thank the reviewer for pointing out the more relevant references related to sequential Monte Carlo methods. We will incorporate them into the revised version of the manuscript.
[1] H. Lee, A. Risteski, and R. Ge. "Beyond log-concavity: Provable guarantees for sampling multi-modal distributions using simulated tempering Langevin Monte Carlo." Advances in neural information processing systems 31 (2018).
[2] O. Chehab, A. Korba, A. J. Stromme, and A. Vacher. "Provable convergence and limitations of geometric tempering for Langevin dynamics." In The Thirteenth International Conference on Learning Representations (2025).
[3] I. Castillo, J. Schmidt-Hieber and A. Van der Vaart. "Bayesian linear regression with sparse priors." The Annals of Statistics 43.5 (2015): 1986-2018.
[4] Y. Gong, N. He and Z. Shen. "Poincare inequality for local log-Polyak-Łojasiewicz measures: Non-asymptotic analysis in low-temperature Regime." arXiv preprint arXiv:2501.00429 (2024).
[5] A. Y. Chen and K. Sridharan. "Optimization, isoperimetric inequalities, and sampling via Lyapunov potentials." arXiv preprint arXiv:2410.02979 (2024).
[6] S. Chewi and A. J. Stromme. ``The ballistic limit of the log-Sobolev constant equals the Polyak-Łojasiewicz constant." arXiv preprint arXiv:2411.11415 (2024).
I am happy with the authors' reply. The "weaknesses" I mentioned were minor points anyway.
This paper presents a new theoretical framework for analyzing simulated tempering algorithms, particularly in the context of sampling from multimodal distributions. The authors focus on bounding the restricted spectral gap, which allows for analyzing mixing times without requiring global spectral gap assumptions—often hard to establish in high dimensions or multimodal settings. Their main contribution is a discrete-time Markov chain decomposition theorem that yields lower bounds on the restricted spectral gap of the simulated tempering chain. This framework is applied to a simulated tempering Metropolis–Hastings (STMH) sampler targeting Gaussian mixture models, showing that the algorithm mixes in time polynomial in the mode separation and logarithmic in the target accuracy.
优缺点分析
Strengths:
- Quality: The technical quality of the paper is high. The decomposition theorem is rigorously developed and soundly argued. The authors make a meaningful extension to discrete-time chains, which is valuable for practical MCMC algorithms like Metropolis–Hastings. The proofs, though deferred to the appendix, are sketched well in the main text and the assumptions are clearly stated.
-Clarity: overall, the paper is clearly written, especially for readers familiar with MCMC theory and spectral gap analysis. However, certain key steps (e.g., interpretations of Assumptions 2 and 3, or the intuition behind Equation (2)) could benefit from additional motivation or examples. The paper assumes a fairly expert audience.
-Significance: the contribution is significant for the MCMC and sampling community, especially in contexts where standard spectral gap bounds are either loose or intractable. The approach improves upon prior work (e.g., Ge et al., 2018) by simplifying the projection chain and relaxing some assumptions. Polynomial dependence on d (mode separation) is significantly better than many standard samplers (e.g., unadjusted Langevin or proximal methods) which suffer exponential dependence on d.
-Originality: the paper offers a novel discrete-time decomposition for restricted spectral gap analysis. While inspired by prior continuous-time analyses, the authors substantially modify the approach, especially in how they define and analyze the projected chain. The use of restricted spectral gap (vs. full spectral gap) as the main analytical tool is also a meaningful shift.
Weaknesses:
- Limited assumptions: (eg Ass 2) The paper does not offer constructive guidance or examples on how to ensure this assumption holds in practice for general mixture models. -The dependence on dimension d is exponential in their result (hidden in constants due to the use of random walk Metropolis–Hastings).
- limited experiments: the empirical study focuses only on a symmetric 2D Gaussian mixture, which is a toy example. There are no comparisons with baseline methods (e.g., Langevin, annealed Langevin, or proximal samplers), and no error bars or statistical significance reported.
问题
On Assumption 2 : While the theoretical condition in Equation (2) is essential to the decomposition, in practical settings (especially for general Metropolis–Hastings chains), verifying or constructing Mij that satisfy the inequality seems challenging. Can the authors provide further guidance or heuristics on how to check or enforce this assumption in practice?
局限性
The conclusion discusses some limitations, in particular on the assumptions 2/3.
最终评判理由
kept my score
格式问题
no
First of all, we sincerely appreciate your careful review of our manuscript and your thoughtful feedback. Below please find our point-by-point responses.
Weakness 1. We agree that Assumption 2 can be difficult to verify in general and typically needs to be assessed on a case-by-case basis. However, since our analysis is based on the restricted spectral gap, it is sufficient to verify this assumption only within a subset of the state space—often a compact set—rather than globally. This makes the assumption more tractable in practice. Specifically, for other Metropolis-Hastings samplers such as MALA (Metropolis-adjusted Langevin algorithm), inequality (2) in Assumption 2 essentially reduces to bounding the acceptance and proposal probabilities. Since we are focusing on compact subsets of the state space, obtaining such bounds is usually feasible. Inequality (3) in Assumption 2 essentially assumes a bound on the restricted spectral gap of each component chain. Again, on compact spaces, there are standard techniques for obtaining such bounds, such as the path methods of [1]. Whether the resulting mixing time bound is sharp has to be checked on a case-by-case basis.
Regarding the dependence on dimension , we acknowledge that the constants in our complexity bound scale exponentially with when using the random walk Metropolis–Hastings (RWMH) kernel. This is a known limitation of RWMH in high-dimensional settings and has already been discussed in Remark 5. We will make it more explicit in the revised version.
[1] Yuen, Wai Kong. "Applications of geometric bounds to the convergence rate of Markov chains on Rn." Stochastic Processes and their Applications 87.1 (2000): 1-23.
Weakness 2. Thank you for the feedback. The goal of our experiment was to illustrate the theoretical result, and we found that the empirical behavior aligned well with the theory. However, we agree that the example used is a toy setting, and we have not included comparison with baseline methods or error bars. In the revised version, we will follow your suggestion to improve the presentation of this toy 2D Gaussian mixture example and strengthen the empirical evaluation by adding a more realistic example.
Question 1. This has been addressed in the response to weakness 1.
Thank your for your answer, and for the other answers to reviewers that I have read as well they are clarifying. I keep my positive opinion on the paper and will maintain my score.
The paper analyzes the spectral gap of simulated tempering under a model of multimodality. In contrast to many related work, one strength of section 2 of the manuscript is that it does not assume a specific algorithm for the per-chain base MCMC algorithm. The authors then specialize the general framework to a specific algorithm in section 3 (namely, random walk MH) to demonstrate the utility of their approach.
优缺点分析
I appreciate the honesty of the authors in stating that "It is important to note that this approximate STMH sampler cannot be implemented in practice, as it requires knowledge of the component distributions". I am OK with that leap of faith. Just as statisticians find value in a data model that is wrong but useful, theory can benefit from this kind of model-based approach. Of course, it is always good to describe the limitation of the model, and while a good effort has been made already by the author, I will add that it would be good to mention that real world multimodal posterior have an exponential number of components, and that irregular posteriors are not necessarily difficult because of multimodality, concentration to a manifold being another example.
I really enjoyed the modularity of the theory, with the general framework in section 2 being applicable to other types of samplers. This modularity reminds me the work of Biron-Lattes et al (JASA, 2024), but from a different perspective.
A bit more discussion on whether the analysis has methodological recommendations associated with it would be good, if something can be said.
The paper briefly mentions in Remark 5 that the new bound has an exponential dependence in the dimensionality. I feel this should be flagged in the introduction, as well as by adding a row in Table 1 to include the proposed bound.
The paper falls in the usual fallacy of using upper bound comparisons to make statements about comparing algorithms, for example "while 254 STMH achieves a more favorable logarithmic dependence on ε" - what is being compared there are upper bounds, so no statement should be made about "STMH achieving a more favourable dependence".
The introduction promises results on the approximation of the target distribution. However, the closest formally stated result in the main text consists in bounds in TV with respect to P, which is ST's stationary distribution. Now, in contrast to the typical scenario where the target distribution is a marginal of the augmented target (such as with PT or HMC), with ST the target appears as a conditional, not a marginal. Hence I was not able to see exactly what result delivers the abstract's promises. Theorem 5 seems a likely candidate but it is only informally stated and the reader is sent to "Appendix C" in which again I could not exactly locate a result on TV of conditionals of ST.
There description of the classical ST algorithm in the supplement makes the notation very confusing. Algorithm 1 is presented as taking as input a deterministic number of steps. Then, in Algorithm 2, the inner loop calls Algorithm 1 to obtain a fixed number of samples. However, I infer from the importance sampling estimator in the line below that these samples are only the random subset at level s. However an algorithm with that kind of random stopping time does not match with how Algo 1 was defined. So I recommend to fix in Algo 1 what is the stopping time, to make it random, and also make explicit how the output samples are collected (to emphasize that it is NOT the list of all samples produced in contrast to standard MCMC algorithms). This makes me a bit worried that there is a gap between the actual algorithm and the analysis, because I did not see analysis of the random stopping time involved to get a prescribed number of samples. (alternatively, if the algorithm meant s to be random, then the algorithm is just ill defined since there could have been zero visit making the procedure undefined).
Relatedly, the manuscript is often unclear on whether the adaptive MCMC algorithm estimating the normalization constants is being analyzed, or the homogeneous MCMC ST from Def 2.
问题
My main question is the issue of the marginal/conditional described above.
局限性
Yes
最终评判理由
I am satisfied with the reviewer's responses to my comments.
格式问题
Looks good.
We sincerely appreciate your careful review of our work and your insightful and constructive feedback. Below please find our point-by-point responses.
A bit more discussion on whether the analysis has methodological recommendations associated with it would be good, if something can be said.
We thank the reviewer for the thoughtful question. First, the main methodological contribution is that our theory provides practical guidance on how to tune algorithm parameters, such as the temperature ladder, as described in Appendix C.4. The conditions we have on the temperature ladder are not artifacts of our proof techniques (though a proof is not provided): intuitively, we need (1) the two adjacent temperatures to be close enough so that temperature swaps are likely to be accepted, and (2) the highest temperature to be large enough so that the tempered target can be explored efficiently by an algorithm like random walk Metropolis-Hastings or MALA. The conditions we have on make these intuitions precise. Second, our theory characterizes the dependence of the complexity of STMH on key parameters such as mode separation, component weights, and condition number. In practice, one can use it to approximately estimate how the runtime would change if the problem becomes more difficult.
The paper briefly mentions in Remark 5 that the new bound has an exponential dependence in the dimensionality. I feel this should be flagged in the introduction, as well as by adding a row in Table 1 to include the proposed bound.
We thank the reviewer for the suggestion. We agree that the exponential dependence on dimension should be made more transparent, and we will highlight this point in the revised version.
The paper falls in the usual fallacy of using upper bound comparisons to make statements about comparing algorithms, for example while 254 STMH achieves a more favorable logarithmic dependence on $\varepsilon - what is being compared there are upper bounds, so no statement should be made about STMH achieving a more favourable dependence.
We agree with the reviewer that the original phrasing may be misleading. Our intention was simply to highlight that the bound we obtain for STMH represents a state-of-the-art theoretical result among existing known upper bounds. We will revise the wording to make this distinction clear in the revision.
Question 1. We thank the reviewer for raising this truly insightful question. We did consider this very subtle issue when developing our proof, and the current version has already addressed this, as we now explain. We recognize that the writing needs to be improved at a few places to further clarify the related argument.
First, to clarify, Algorithm 1 is always run for a fixed number of steps and returns the sample one collects at the last step (we will clarify this in the revision.) If the resulting sample does not come from the desired temperature level, then Algorithm 1 is simply re-run for another steps. We do not run Algorithm 1 for a random number of steps until the desired temperature is reached (so no random stopping time is involved). Next, in Lemma 4, we analyze the total variation distance between the conditional distributions given the temperature index, which can be applied to the output of Algorithm 1 conditioning on the temperature level. Finally, to properly estimate the overall complexity of the whole algorithm including estimation of normalizing constants, in Lemma 27 we analyze how many times we need to re-run Algorithm 1 with steps (for fixed ) in order to obtain a sample from the desired temperature level. Note that we essentially followed the simulated tempering algorithm construction given in [1] but this final step (i.e., Lemma 27) apparently was missing in the proof of [1]. We will make this reasoning more explicit in the revised version.
Using a random stopping time (e.g. the first hitting time after N steps of a given temperature level) could be potentially problematic, since due to the Markovian nature of the algorithm, the distribution of such a sample collected can be severely biased. Of course, we are aware that our construction of Algorithm 1 is not practical; it is designed to facilitate the proof by producing independent samples for estimating normalizing constants.
[1] Ge, Rong, Holden Lee, and Andrej Risteski. ``Simulated tempering Langevin Monte Carlo ii: An improved proof using soft Markov chain decomposition." arXiv preprint arXiv:1812.00793 (2018).
Thank you for the rebuttal. One part of the answer "If the resulting sample does not come from the desired temperature level, then Algorithm 1 is simply re-run for another steps" makes me slightly uncomfortable since such strategy could be highly inefficient in practice. Classical ST theory justifies using the full trace via regeneration theory. Otherwise, other answers look good.
Thanks for the comment! We absolutely agree with you that such a formulation of Algorithm 1 is not ideal. It is constructed this way purely to facilitate the proof: we need i.i.d. samples to quantify the sample size required for a desired accuracy in estimating normalizing constants.
In reality, as the reviewer points out, one can estimate the normalizing constants using the whole trajectory (instead of re-running Algorithm 1 many times and only collecting the very last sample conditioned on a particular temperature level). But the dependence among these samples complicates the non-asymptotic error analysis of the resulting estimator, which would require different techniques. To our knowledge, to apply regeneration theory or Markov chain concentration inequalities to bound the error of the estimator, one typically needs to first obtain a bound on the (unrestricted) spectral gap of the chain, which is challenging in our case.
Since this step is not the main focus of the entire proof, we choose to use this formulation of Algorithm 1, same as in Ge et al. [1].
Hi,
Thanks for your review and engagement with the authors.
Can you update your comment to indicate how their response affects your score (if it does)? You indicate you're still uncomfortable about their response, can you say whether that's a blocker to acceptance?
Thanks.
The work provides a framework of analysing the convergence of STMH algorithms using a restricted spectral gap. First, in Theorem 2 they provide a general bound on the Spectral Gap. Next, in Theorem 5, the authors provide a time bound on obtaining samples (with a STMH algorithm) from a distribution \epsilon TV distance from a (multivariate normal mixture) target distribution. They frame the theoretical result they achieve amongst other theoretical results in the literature and give comparisons. Finally they offer an empirical result and observe how close their bound is in practice.
优缺点分析
Strengths:
-
The problem of sampling from mixture distributions is well known and is a fundamental task in MCMC sampling. Work towards this problem is certainly valuable.
-
Concise introduction, which efficiently lays out the context of the problem. It also provides a literature review which appears comprehensive and is easy to follow and understand.
-
The mathematics is easy to follow, and is well structured and concise. The bound achieved in Theorem 5, a time bound on achieving samples within \epilson TV distance from the target, seems very strong and useful. I also greatly appreciate the comparison to STLMC, it provides details of the relative strengths of their results and provides some intuition.
-
I appreciate the details regarding the proof of Theorem 5, and also like the fact this is included in the main text. It offers a lot of intuition for the various pieces of the proof, and gives readers an understanding from a high level.
Weaknesses:
-
I would like to see more justification for this work: what examples in the literature are there of people using STMH in practice? What implications will this work have on the wider field, how should it be interpreted etc?
-
I would appreciate some text on the interpretation of a spectral gap. I would also appreciate a more gentle/intuitive introduction to the simulated tempering algorithm.
-
I would appreciate a more intuitive (in words) explanation of many of the assumptions of the paper (namely, Assumptions 1 and 2). Are they standard assumptions? Are they likely true in reality? What are the consequences on performance if they aren’t true? More generally, there is very little intuition given to many of the assumptions, constants, and bounds throughout. Perhaps a large readership will be familiar with such concepts, but to those outside that group this is hard to interpret and thus use.
-
I would appreciate more justification as to why we should care about convergence results on the STMH algorithm - what are the implications of such results? What is the practical advice to be gained from this paper?
-
There is very little discussion on their empirical results: why is it only ‘approximately’ linear (Figure 1) - my understanding is that given the theoretical results it should be exactly linear? Is the discrepancy down to the assumptions not holding in practice?
Minor:
-
Why is Theorem 2 not Theorem 1 (same with Theorem 5, shouldn’t this be Theorem 2)?
-
What is meant by [Line 178] ‘which can be compared to the continuous-time decomposition theorem of Ge et al. [2018]’ - is it that the bounds align, if so, say that. Or is it that the results can now be directly compared?
-
Section 3 studies STMH on a multivariate Gaussian mixture, I think this should be clear from the Section title.
问题
-
My understanding is that r_i is a hyperparameter of the STMH algorithm which a practitioner would have to specify beforehand. As such, is it realistic to specify r_i as in Line 220? When one knows the target density (e.g. multivariate Gaussian mixture) perhaps this is possible, but this is not a situation in which practitioners will find themselves?
-
To what extent can we understand whether the Assumptions on which Theorem 5 relies are true for the MVN-mixture case? Are the empirical results merely an inspection of the underlying assumptions and ‘what happens if they are not true’? More discussion on this would be appreciated, and would further justify the existence of the results section.
-
What about those same assumptions applied to more realistic applied settings? Do you have any thoughts regarding what would happen?
局限性
Yes
最终评判理由
The authors have responded clearly to my biggest concerns.
The other reviewers clearly agree the paper has merits, is easy to read, and is interesting.
I leave my initial scores in tact, as I still feel my score reflects the merits of the paper.
格式问题
N/a
First of all, we sincerely appreciate your careful reading of our work and your detailed feedback. Below please find our point-by-point responses.
Weakness 1. Thanks for the question. Simulated tempering (ST) is a widely used MCMC sampling technique (see, e.g., the textbook [1]). The use of STMH dates back at least to the seminal work of [2] on the Ising model, and it remains as one of the most popular methods for tackling metastability in the chemical/statistical physics literature [3]. In Bayesian statistics, the theory and methodology of tempering-based methods have also been studied extensively since the work of [4].
ST can be combined with any local MCMC kernel. In this work, we choose to focus on the random walk MH algorithm, which is a simple but illustrative and analytically tractable example. We analyze the complexity of STMH using a new decomposition theorem, which can also be used to analyze the convergence of ST combined with other local schemes such as MALA. Further, we expect that this technique can be adapted for analyzing other tempering-based algorithms such as parallel tempering (which has been utilized in many deep learning tasks [5, 6]), and those more advanced variants of ST such as non-reversible ST [7]. Although random walk MH could be a naive choice in some applied settings, the broader contribution of this work lies in the general theoretical framework we develop, which can inform the analysis and design of general tempering-based samplers.
[1] S. Brooks, A. Gelman, G. Jones, and X. Meng, eds. "Handbook of Markov Chain Monte Carlo.'' CRC press, 2011.
[2] E. Marinari, and P. Giorgio. "Simulated tempering: a new Monte Carlo scheme.'' Europhysics Letters 19.6 (1992): 451.
[3] Z. You, L. Li, J. Lu, and H. Ge. ``Integrated tempering enhanced sampling method as the infinite switching limit of simulated tempering." The Journal of Chemical Physics 149.8 (2018).
[4] C. J. Geyer, and E. A. Thompson. "Annealing Markov chain Monte Carlo with applications to ancestral inference." Journal of the American Statistical Association 90.431 (1995): 909-920.
[5] C. Smith, Q. T. Campbell, and T. Albash. ``Optimizing temperature distributions for training neural quantum states using parallel tempering." Physical Review E 111.5 (2025): 055306.
[6] W. Deng, Q. Feng, L. Gao, F. Liang, and G. Lin. "Non-convex learning via replica exchange stochastic gradient mcmc." In International Conference on Machine Learning, pp. 2474-2483. PMLR, 2020.
[7] M. Biron-Lattes, T. Campbell, and A. Bouchard-Côté. "Automatic regenerative simulation via non-reversible simulated tempering." Journal of the American Statistical Association 120.549 (2025): 318-330.
Weakness 2. We thank the reviewer for the suggestion. Roughly speaking, the spectral gap of a Markov chain quantifies the average-case convergence rate towards the stationary distribution; a larger gap implies faster mixing and thus more efficient sampling. However, in many scenarios, the spectral gap can be difficult to bound or vanish quickly, and the restricted spectral gap serves as a more robust and informative technique.
To explain simulated tempering more intuitively: the key idea is to augment the state space with a temperature index that allows the chain to move between a family of distributions, ranging from the original target distribution (low temperature) to a flattened version (high temperature). At higher temperatures, the probability landscape becomes smoother, allowing the chain to move more easily between modes. Transitions between temperature levels are governed by a Metropolis–Hastings rule, and samples from the target distribution are obtained by collecting states only at the lowest temperature (target distribution). This helps overcome the energy barriers between modes and promotes global exploration.
Weakness 3. Thanks for the suggestion. Assumptions 1 and 2 are motivated by the structure of mixture models, also used in prior works such as [8]. Intuitively, the sampler's convergence can be arbitrarily slow if (i) the mixture components are very far away from each other, or (ii) the mixing of the local chain targeting a single component is very slow (e.g. due to non-log-concavity). So some conditions are always needed to achieve a meaningful bound. We also want to clarify that these assumptions are only made for the decomposition theorem, and when applying the decomposition theorem to Gaussian mixtures, we rigorously prove that these assumptions hold. We agree that a more intuitive explanation would improve clarity, and in the revised version, we will include more discussion on the intuition behind the assumptions and interpretation of key constants and bounds.
[8] R. Ge, H. Lee, and A. Risteski. "Simulated tempering Langevin Monte Carlo ii: An improved proof using soft Markov chain decomposition." arXiv preprint arXiv:1812.00793 (2018).
Weakness 4. Thanks for the question. First, our result characterizes the complexity of the algorithm (e.g. polynomial dependence on D), which tells practitioners how the runtime of the algorithm scales as the problem becomes more difficult. Note that the dependence on other key parameters such as component weights and condition number is given in our proof of Theorem 5 in Appendix C.4.3. Second, our result also provides practical guidance for how the tuning parameters of ST should be selected such as the temperature ladder; see Appendix C.4.
Weakness 5. We thank the reviewer for the thoughtful question. In Figure 1, we empirically study the convergence behavior of STMH as a function of mode separation. The observed relationship is approximately linear rather than exact, likely because (i) we use the rate at which the empirical sample mean approaches the target mean as a proxy for convergence, rather than measuring the total variation distance directly, and (ii) our theoretical result provides only an upper bound on the convergence time, rather than an exact characterization.
Minor 1. We will correct it in the revision. Thanks for pointing it out.
Minor 2. Our result is similar in spirit to the decomposition theorem in Ge et al. (2018) but not directly comparable, as our results applies in a discrete-time setting under different assumptions. We will clarify this in the revision.
Minor 3. Thanks for the suggestion. We will update the section title accordingly.
Question 1. Thank you for the question. To clarify, is not needed to run the algorithm, and we define exactly because is typically unknown in practice. When implementing the STMH algorithm, one needs to evaluate the acceptance probability given in Equation (1), and this definition of ensures that the acceptance probability is independent of and relies solely on the estimate . In Appendix C.4, we show how sufficiently good estimates for can be determined, which makes the whole STMH algorithm fully implementable. We understand that the writing might have caused confusion and will clarify this in the revision.
Question 2. Thanks for the thoughtful question. Thm 5 establishes the convergence guarantees for STMH when sampling from a mixture of Gaussians with shared covariance matrix. This result can be naturally extended to target distributions that are sufficiently close to such Gaussian mixtures, as shown in Ge et al. (2018). In such cases, the time complexity would additionally depend on closeness between the actual distribution and the Gaussian mixture approximation. We also anticipate that similar techniques can be adapted to handle mixtures of log-concave distributions that are translates to each other, or more broadly, distributions that are well-approximated by such mixtures. However, as demonstrated by Ge et al. (2018), some seemingly mild violations of the assumptions, such as component covariance matrices differing by a constant factor, can lead to exponential time complexity for any reasonable algorithm with similar guarantees.
Question 3. Following up on our answer to Q2, there are realistic applied settings where our assumptions may approximately hold, making the proposed framework useful. For instance, [9] showed that the posterior distribution for Bayesian variable selection (which is one of the most common model selection problems in statistics) can be approximated by a mixture of Gaussians under certain conditions. Another relevant application arises in the context of (Bayesian) DNNs. Let the target be . In the context of DNN, is the loss composed with a neural network function, and the complexity of sampling will depend on the structure of . If the widths of the neural networks tend to infinity, several works have shown that satisfied PL-type conditions and hence recent works like [10, 11, 12] could be used to show satisfies Poincare or Log-Sobolev type inequalities. Therefore, algorithms like LMC, proximal sampler or MALA can be used to sample from the posterior efficiently. However, in the practical regime where the width is finite, the does not satisfy any isoperimetric inequalities and is highly multimodal, where the proposed algorithm is likely to be useful.
[9] I. Castillo, J. Schmidt-Hieber and A. Van der Vaart. "Bayesian linear regression with sparse priors." The Annals of Statistics 43.5 (2015): 1986-2018.
[10] Y. Gong, N. He and Z. Shen. "Poincare inequality for local log-Polyak-Łojasiewicz measures: Non-asymptotic analysis in low-temperature Regime." arXiv:2501.00429 (2024).
[11] A. Y. Chen and K. Sridharan. "Optimization, isoperimetric inequalities, and sampling via Lyapunov potentials." arXiv:2410.02979 (2024).
[12] S. Chewi and A. J. Stromme. ``The ballistic limit of the log-Sobolev constant equals the Polyak-Łojasiewicz constant." arXiv:2411.11415 (2024).
I thank the authors for their thorough rebuttal - they have addressed all my concerns and have consequently increased the clarity of the paper.
W1: Thanks for clarifying this point. I think that the fact the proposed analysis and theoretical work can be extended to other popular MCMC sampling algorithms is a real strength of the paper.
W2 + W3: Thanks for the clarification here. I reiterate what whilst many readers will already be familiar with such concepts, I think the proposed additions to the main text will further this papers appeal and impact.
Q1 + Q3: Thanks for clearing this up.
Q2: This is interesting, I feel this should be noted in the main text.
My concerns were mostly asking for additional clarification in the text which the authors have promised to follow up with. The authors succinctly addressed these points and have suggested clear ways of explaining my questions, thus I feel strengthening the overall submission. That being said, I feel my initial score of 5 is a fair assessment, and I think this will make a great paper in NeurIPS.
This paper tackles the problem of sampling from mixture distributions. The authors consider simulated tempering with a local MCMC sampler and develop a lower bound on the restricted spectral gap of the resulting algorithm which allows analyzing mixing times without restrictive global assumptions. These results were obtained using an improved theoretical framework for analyzing simulated tempering algorithms. The proposed method is applicable to a broader range of problems than existing approaches – i.e. can be used to develop efficient algorithms for them – and was validated on mixtures of Gaussians.
The reviewers identified a number of strengths:
- The problem is important and hard and the approach to solve it is novel – substantially modifying existing techniques.
- The technical quality of the paper is high – the decomposition theorem is rigorously developed and the argument sound.
- The results are significant and the authors are honest about the limitations of their work.
The weaknesses of the paper were minor – e.g. improve clarity and presentation of both the method and results, provide constructive guidance on how to actually use the method for real problems, and clarify exponential dependence on dimension hidden in a constant. All of these concerns were addressed through the author discussion and all reviewers feel that the paper is a clear accept.