AutoStep: Locally adaptive involutive MCMC
We introduce a new locally-adaptive MCMC algorithm that automatically adjusts its step size parameter on the fly while maintaining pi-invariance.
摘要
评审与讨论
This paper introduces AutoStep MCMC, a novel class of locally adaptive involutive MCMC methods that dynamically select the step size parameter at each iteration based on the local geometry of the target distribution. The proposed method extends previous adaptive MCMC techniques by integrating step size adaptation within involutive MCMC frameworks, ensuring π-invariance, irreducibility, and aperiodicity. Theoretical results establish non-asymptotic guarantees on expected energy jump distance and computational cost per iteration. The experiments demonstrate the robustness and efficiency of their AutoStep method for RWMH and MALA using KSESS on various distributions. They compare with several baseline adaptive samplers, including NUTS, adaptive RWMH, adaptive MALA, drHMC and slice sampling.
update after rebuttal
Thanks to the authors for the additional work and clarification. After reading the initial response to my questions, as well as the other reviews and replies in general, I will keep my score.
给作者的问题
- Can AutoStep MCMC scale to high-dimensional models? such as posterior sampling for large BNN.
- Can you provide more computational complexity analyses of AutoStep MCMC?
- Can AutoStep MCMC be extended to stochastic gradient MCMC like SGLD and SGHMC while preserving its key advantages?
论据与证据
The paper makes the following key claims, which are generally well-supported:
- AutoStep MCMC is π-invariant, irreducible, and aperiodic under mild conditions. This is supported by rigorous proofs (Proposition 4.2, 4.5, and Corollary 4.6).
- AutoStep MCMC adaptively selects step sizes to balance exploration and acceptance rate, improving mixing efficiency. The claim is substantiated by theoretical bounds and empirical evaluations on benchmark distributions.
方法与评估标准
The methodology is well-structured and clearly builds on established MCMC principles. The key contribution is the introduction of a dynamically adaptive step size selection mechanism that does not rely on global tuning but instead adjusts step size based on local properties of the target distribution. The evaluation criteria include acceptance rates, KSESS per unit cost and min KSESS per unit cost, which are appropriate for assessing MCMC performance.
理论论述
The theoretical foundations of the paper are solid, with detailed proofs demonstrating the correctness of the proposed method.
- Proofs of π-invariance, irreducibility, and aperiodicity (Proposition 4.2, 4.5, and Corollary 4.6).
- Bounds on expected energy jump distance and computational cost per iteration (Proposition 4.11 and Corollary 4.10).
实验设计与分析
The experimental setup is comprehensive, with evaluations on:
- Synthetic distributions (Gaussian, Laplace, and Cauchy) to assess the benefits of using the symmetric step size criterion, the efficiency of the round-based tuning procedure and the robustness of .
- Bayesian inference problems (linear regression, orbit fitting, mRNA transfection models).
- Comparisons with various adaptive MCMC baselines.
补充材料
The supplementary material was reviewed, including proof details, experimental setting, parameter tuning and additional results.
与现有文献的关系
The paper is well-situated within the existing MCMC literature, particularly within the adaptive MCMC and involutive MCMC frameworks. It builds upon:
-
Adaptive MCMC methods by introducing a locally adaptive approach.
-
Involutive MCMC methods by integrating adaptive step size selection.
-
Recent work on adaptive step size selection in MALA, which the authors improve upon by ensuring irreducibility and aperiodicity.
The discussion could be further enriched by addressing connections to other step size tuning techniques, such as stochastic gradient-based methods in deep learning.
遗漏的重要参考文献
The paper adequately cites prior work but could benefit from a broader discussion of step size adaptation in high-dimensional Bayesian inference.
其他优缺点
Strengths:
- Novel and well-motivated approach to adaptive step size selection.
- The paper is theoretical rigour, with solid proofs establishing key properties.
- The experimental results indicate the proposed method outperforms competing baselines.
- The method is robust to the step size initialization, enhancing its practical usability.
Weaknesses:
- There are no detailed analyses and empirical results about the computational trade-offs (e.g., additional overhead of adaptive step size selection).
- The experiments are too toy. The scalability of the proposed method for high-dimensional problems is unclear.
- Some step size adaptation choices (e.g., choice of thresholds) could be further justified.
其他意见或建议
- Provide more discussion on the computational cost of AutoStep MCMC compared to non-adaptive methods.
- Conduct a scalability study on high-dimensional problems (e.g., large Bayesian posterior models).
- Discuss the applicability of AutoStep MCMC to more gradient-based samplers beyond MALA (e.g., HMC, NUTS).
Can AutoStep MCMC be extended to stochastic gradient MCMC like SGLD and SGHMC while preserving its key advantages?
Thanks, that’s a great point! Our focus in this paper is on exact, invariant MCMC methods, but it is indeed possible to extend AutoStep to SG MCMC. One approach would be to resample a mini-batch at each iteration and feed it to Alg 2 to estimate a step size .
However, as noted by Johndrow et al. (2020), subsampling offers little speedup or accuracy gain over exact methods. While approximate MCMC (without MH correction) can be useful during the burn-in for estimating a reasonable , in the stationary phase, the computational gains from subsampling may not outweigh the loss in robustness.
Johndrow, J. E., Pillai, N. S., & Smith, A. (2020). No free lunch for approximate MCMC. arXiv preprint arXiv:2010.12514.
Can AutoStep MCMC scale to high-dimensional models?
Thank you for bringing that to our attention. We have run three additional high-dim models (each with 30 seeds) to address your concern: the 111d HMM example from posteriordb, a 133d stochastic volatility model (Kim el al, 1998), and a 312d IRT model (Curtis et al, 2010).
The table below presents the minESS/sec metrics, which generally shows good performance. The per second timing includes the full computational time (burn-in, selection, all aspects of tuning, etc). We did not include KSESS results, as generating reliable reference distributions via PT requires days of computation, which was not feasible within the discussion period.
| Sampler | Model | q5 | Median | q95 |
|---|---|---|---|---|
| │ AutoStep RWMH | hmm_example | 19.3966 | 31.3897 | 360.596 |
| │ AutoStep RWMH | stochastic_volatility | 1.33824 | 2.04454 | 2.3855 |
| │ AutoStep RWMH | irt | 4.63187 | 4.71388 | 4.7435 |
| │ AutoStep MALA | hmm_example | 10.4717 | 13.4765 | 33.5944 |
| │ AutoStep MALA | stochastic_volatility | 4.78546 | 5.9194 | 7.23065 |
| │ AutoStep MALA | irt | 0.838859 | 0.959374 | 1.023 |
There are no detailed analyses and empirical results about the computational trade-offs (e.g., additional overhead of adaptive step size selection)...... Provide more discussion on the computational cost of AutoStep MCMC compared to non-adaptive methods.
Please note that we provide ESS/sec (Fig. 5) metrics in our experiments, which account for the full computational time (including burn-in, the step size selection, all aspects of tuning, etc). These metrics are intended to reflect the trade-off between statistical efficiency and computational cost. Beyond this, we also have theoretical results (Prop 4.9, Cor 4.10) that put bounds on the expected number of doubling/halving steps per iteration, which capture the dominant additional computational cost over fixed step size methods.
Some step size adaptation choices (e.g., choice of thresholds) could be further justified.
There is certainly room for further research on better choices of the distribution over . The only strict requirement is that the support is on all of (i.e. ), to guarantee irreducibility. Intuitively, one must also ensure that are “typically” bounded away from 0 and 1 to avoid many overly aggressive or conservative steps. But aside from that, there is room for creativity: for example, one might consider potentially making the distribution favor known optimal acceptance rates of MCMC algorithms. That being said, in our experimentation we found that the uniform distribution on achieved the goal of maintaining a reasonable step size, and we believe other improvements (like different mass matrix adaptation) are unlikely to have a far larger influence on performance.
Discuss the applicability of AutoStep MCMC to more gradient-based samplers beyond MALA (e.g., HMC, NUTS).
We kindly refer you to our earlier response to Reviewer JzQk, where we discuss the applicability of AutoStep to HMC and NUTS.
Can you provide more computational complexity analyses of AutoStep MCMC?
Note that the cost is dominated (in the long run asymptotically) by the average number of log-prob evaluations per iteration. The number of log-prob evaluations is in turn controlled by the number of doubling/halving steps in AutoStep. Prop 4.9 places an upper bound on this quantity, which provides the desired result. To apply this result, problem-specific analysis is generally required; however, in Cor 4.10, we are able to specialize the result for a representative class of target distributions in the large/small regime (i.e., when the method is poorly tuned). This result shows that the extra cost incurred by AutoStep grows by a factor of , indicating that the additional cost of AutoStep should generally be small.
The authors propose a method to tune the step size of MCMC algorithms so that, at each iteration, the acceptance rate is not too high (exploitation) or too low (exploration).
update after rebuttal
I thank the authors for their clarification on a minor point that I raised. My overall assessment has not changed and I will maintain my score.
给作者的问题
论据与证据
The authors claim a full theoretical analysis that does indeed seem quite complete.
方法与评估标准
Yes.
理论论述
Not in detail.
实验设计与分析
The experiments mainly substantiate the claim that tuning the step size makes the sampling algorithm robust the initial step size, which should be the case by design.
Figure 5 in particular shows that by a certain measure of efficiency (KSESS), the authors' method performs better. However, there is no visual evidence that tuning the step size helps the MCMC algorithm locate the target modes faster, and therefore converge faster.
补充材料
No.
与现有文献的关系
The authors clearly discuss a related paper by Bou Rabee et al.
遗漏的重要参考文献
Related works are clearly discussed by the authors. Could the authors discuss the link between their paper and Cyclical MCMC, which varies the step size cyclically between bigger and smaller values to alternate between mode exploration and exploitation.
其他优缺点
Strength: the paper is very clearly written.
其他意见或建议
Thank you for your time and insightful questions. We hope that our answer has addressed all your concerns.
There is no visual evidence that tuning the step size helps the MCMC algorithm locate the target modes faster, and therefore converge faster.
Please note that locating modes is not a primary goal of MCMC algorithms; the goal is to take draws from a target distribution that yield accurate estimates of target expectations. In our work, the focus in particular is on ensuring efficient sampling from models with varying scale. The KSESS / sec metric results presented in our manuscript in Figure 5 captures both sampling correctness and computational efficiency, and is the standard method used in the field for comparing sampling algorithms.
Could the authors discuss the link between their paper and Cyclical MCMC, which varies the step size cyclically between bigger and smaller values to alternate between mode exploration and exploitation.
Cyclical MCMC (Zhang et al, 2020) and AutoStep both modulate the step size, but in fundamentally different ways. Cyclical MCMC follows a prescribed, periodic schedule and is not designed to tune the step size or adapt to the target distribution. Furthermore, Cyclical MCMC is not guaranteed to provide asymptotically correct estimates of expectations unless the number and spacing of temperatures in one cycle is controlled carefully to account for the mixing behaviour of the kernels.
In contrast, AutoStep adapts the step size based on the local geometry of the target and is always an exact method, i.e., it comes with a guarantee of -invariance.
The paper proposes a MCMC method (called AutoStep MCMC) with locally adaptive step size selection. This method generalizes the previous involutive methods (e.g. RWMH, MALA, HMC etc) allowing adopting step size which is randomly drawn from some conditional distribution. The class of involutive MCMC methods is considered. Theoretical properties of MCMC kernel are studied (e.g. invariant distribution, aperiodicity etc). Robustness and scalability properties are also studied.
给作者的问题
Is it possible to provide numerical experiments with mixtures of distributions?
论据与证据
yes
方法与评估标准
yes
理论论述
I haven't checked the proves of main results
实验设计与分析
yes
补充材料
I checked the part with numerical experiments
与现有文献的关系
The paper contributes to the long list of locally adaptive MCMC methods
遗漏的重要参考文献
I would add comparison with modern methods using adaptive kernels (e.g. using normalizing flows); See Local-Global MCMC kernels: the best of both worlds, NeurIPS 2022 or Adaptive Monte Carlo augmented with normalizing flows, PNAS 2022. Or modern sampling methods e.g. GFlowNets for continuous distributions or optimal control methods for sampling (Theoretical guarantees for sampling and inference in generative models with latent diffusions, COLT 2019)
其他优缺点
Strengths New locally adaptive MCMC method; e.g. invariant distribution, aperiodicity etc). Robustness and scalability properties are also studied.
Weaknesses I would add more numerical experiments with mixtures of distributions (e.g. high dimensional gaussian mixtures) or latent distributions of generative models (e.g. GANs); see e.g. paper Your GAN is Secretly an Energy-based Model and You Should use Discriminator Driven Latent Sampling
其他意见或建议
I would add comparison with modern methods using adaptive kernels (e.g. using normalizing flows); See Local-Global MCMC kernels: the best of both worlds, NeurIPS 2022 or Adaptive Monte Carlo augmented with normalizing flows, PNAS 2022. Or modern sampling methods e.g. GFlowNets for continuous distributions or optimal control methods for sampling (Theoretical guarantees for sampling and inference in generative models with latent diffusions, COLT 2019)
We are grateful for your thoughtful feedback. We are glad to address your questions one by one.
Is it possible to provide numerical experiments with mixtures of distributions?
Thank you for the suggestion. Please first note that AutoStep is designed to help handle multiscale behaviour in targets (e.g., funnel structure); our methods are not expected to perform substantially differently than a well-tuned fixed-step-size method on multimodal / mixture targets. To help address multimodality, one could instead incorporate AutoStep in a parallel tempering scheme. One potential advantage of AutoStep in that setting is that it will choose an appropriate step size for each chain in the ensemble (which will be different at each temperature), unlike fixed step size methods that each must be tuned individually.
Please also note that the orbital model (see Fig. 13) already has multiple modes/ridges/varying geometries/etc.
I would add comparison with modern methods using adaptive kernels (e.g. using normalizing flows, Local-Global MCMC, adaptive Monte Carlo augmented with normalizing flows, GFlowNets, etc.)
We appreciate the suggestion, though we note that many of the listed methods (e.g., normalizing flow-based MCMC, GFlowNets) typically require significant training time, and are not exact methods. An interesting recent result from He, Du et al (2022) shows almost all neural samplers require several orders of magnitude more target evaluations compared to parallel tempering, itself an expensive meta MCMC algorithm.
Beyond this, neural methods and AutoStep use quite different approaches, and a direct comparison may therefore not be particularly meaningful. That being said, we agree that integrating AutoStep into such frameworks in future work could be an interesting extension.
He, J., Du, Y., Vargas, F., Zhang, D., Padhy, S., OuYang, R., ... & Hernández-Lobato, J. M. (2025). No Trick, No Treat: Pursuits and Challenges Towards Simulation-free Training of Neural Samplers. arXiv preprint arXiv:2502.06685.
This work proposes a framework for adaptive MCMC that enables sampling parameters to be optimized for the current location in the state space at each sampling step. In particular, the work focuses on adaptively adjusting the step size parameter which is found in common MCMC algorithms. The key challenge for such an approach is maintaining detailed balance, since naive adjustment of the step size could disrupt the MCMC process so that it no longer follows the intended stationary distribution. Building upon established properties of involutive functions for MCMC sampling, the work proposes to augment the sampling space with a distribution over soft acceptance bounds and a distribution of the step size parameter that depends on the current state, MCMC randomness, and the soft acceptance bounds. By carefully formulating this involutive function, detailed balance can be guaranteed by standard results about involutive MCMC, and irreducibility and aperiodicity proofs follow in a straightforward way from the irreducibility and aperiodicity of the non-augmented MCMC samplers. A theoretical analysis of the time for selecting a step size and for the expected movement with each sampling step is presented. Experiments are conducted showing that the method achieves similar efficiency regardless of the initial step size parameter tuning, favorable comparison with the related adaptive step size selection method AutoMALA, and competitive performance with other adaptive MCMC samplers.
## After rebuttal: My view of the paper remains similar. The proposed method is a straightforward and natural way to incorporate adaptive step size in standard MCMC algorithms. Although the method currently does consistently not surpass strong samplers like NUTS, it is possible that future design choices based on this sampler could provide further improvement.
给作者的问题
- This work demonstrates detailed balance, irreducibility, and aperiodicity for the proposed sampler. However, it must also be shown that the proposed sampler is not null recurrent to establish ergodicity. Can this be shown?
- Can you provide a detailed explanation of the shortcomings of ESS and superiority of KSESS? A convincing answer to this question could cause me to increase my rating score.
- Given the strong performance of NUTS in Figure 5, is it possible to combine the proposed method with NUTS to further improve performance?
论据与证据
The theoretical claims appear valid to me. Involutive MCMC in the augmented sampling space that includes a distribution over acceptance probabilities and step size parameter is a very clean way to include adaptive step size selection while still satisfying detailed balance. Experimental results provide convincing evidence that the method is extremely robust to the initial step size tuning over several orders of magnitude. The comparison with existing adaptive MCMC techniques shows competitive performance with existing adaptive methods as claimed, although the proposed method does not achieve top performance on any specific scenario.
方法与评估标准
The overall methodology of using involutive MCMC to build a framework for adaptive step size selection seems very appropriate for the problem at hand. The experimental section provides a fairly thorough comparison of the proposed method with representative SOTA methods. One major choice that does not follow standard methodology is the use of a newly proposed metric KSESS to measure MCMC sampling efficiency rather than the typical ESS method. It is claimed that ESS did not accurately characterize sampling performance, but few details are provided about the inadequacy of ESS and the superiority of KESS. Further explanation of this choice would greatly help to solidify the evaluation methodology.
理论论述
A variety of theoretical claims are made in this paper. The most important ones are that the proposed method satisfies detailed balance and that the proposed method is irreducible and aperiodic. I carefully checked these claims and they appear valid. These proofs are fairly straightforward thanks to the clever MCMC formulation. Theoretical results about the expected runtime of step size selection and the state space movement of the proposed method are presented. These claims appear reasonable but I did not carefully check the proof details.
实验设计与分析
Overall, the experimental design and analysis is appropriate. As mentioned above, the one major area where the experiments do not follow the typical protocol is the use of KSESS as a replacement for the standard ESS metric to measure sampling efficiency. More discussion of this choice would strengthen the paper.
补充材料
Proof and experiment details are provided in the appendix. A separate supplementary materials file is not provided.
与现有文献的关系
This work seeks to provide a principle framework for including an adaptive step size parameter in common MCMC sampling algorithms. Prior works are limited by being restricted to a pre-selected set of step sizes, restriction to use the same step size across the entire state space, or limitations on how frequently the step size can be adjusted during the MCMC process. The present work provides a clear and straightforward way to adjust step sizes that can span a vast range of orders of magnitude, is adapted at each new state space location, and can be adapted at each step. Such methodology has the potential to be widely adopted by the community. Nonetheless, while experiments show the proposed method has reasonable performance, it does not have top performance in any given scenario.
遗漏的重要参考文献
N/A
其他优缺点
N/A
其他意见或建议
N/A
Thank you for your insightful questions. We hope that our answer will address all your concerns.
The proposed method does not achieve top performance on any specific scenario.
While AutoStep does not outperform all other methods across all examples, we would like to emphasize that it consistently performs well across a wide range of scenarios. This broad robustness is due to the local adaptivity of AutoStep: it is resilient to the choice of initial value and the geometry of the targets, albeit with some additional cost associated with tuning at each step (Figure 2).
It is claimed that ESS did not accurately characterize sampling performance, but few details are provided about the inadequacy of ESS and the superiority of KESS.
Thank you for bringing this up, this is an important point! Standard ESS estimates do not incorporate knowledge of the target distribution directly, and so can indeed be misleading if the sampler fails to explore the target distribution (Elvira et al, 2022). As an extreme illustrative thought experiment, if the sampler fails severely and does move at all, the sequence of states is indistinguishable from an iid sequence from a Dirac delta, and typical ESS estimates will report a high value despite the correct value being approximately 1. We saw a version of this issue in our experiments; for example, in an experiment using adaptive RWMH on the kilpisjärvi model, the sampler produced samples with almost no spread (variance on the first dimension was 2.472e-6), and yet the minESS reported was 45.47. In contrast, KSESS reported 0.75, which correctly identified the sampling failure. In other words, the traditional ESS estimate was nearly two orders of magnitude higher than it should have been.
Note that we do not recommend using KSESS estimate in practical data analysis, as it requires knowledge of the target or expensive computation to obtain a gold-standard sample set. We use the KSESS only for the purposes of comparing different sampling methods in a research context. We will clarify this in the camera-ready.
Elvira, Víctor, Luca Martino, and Christian P. Robert. "Rethinking the effective sample size." International Statistical Review 90.3 (2022): 525-550.
It must also be shown that the proposed sampler is not null recurrent to establish ergodicity. Can this be shown?
Thanks for the question! Please note that for ergodicity (i.e., an MCMC law of large numbers for -a.e. initial state), it is sufficient that the chain is -invariant and irreducible. Our theory provides these two statements. See, e.g., Theorem 5 of Geyer’s lecture notes on Markov chains at https://www.stat.umn.edu/geyer/8112/notes/markov.pdf — per the discussion on p12-13, stationarity can be replaced by -a.e. initialization and -invariance. We do note that if one wants a central limit theorem, more analysis and conditions are required.
Given the strong performance of NUTS in Figure 5, is it possible to combine the proposed method with NUTS to further improve performance?
This is a very interesting point, thanks for bringing this up! The AutoStep framework is naturally applicable to all involutive methods with step size parameters. We think it is likely that with the right augmentation, one could view NUTS as such an involutive MCMC scheme, but we are not totally certain and leave a detailed development of that to future work. It can certainly be applied to HMC which is known to be an involutive scheme.
However, HMC (and NUTS if it turns out to be involutive) involve a fairly expensive involution. Based on our experiments to date, we suspect that methods with cheaper involutions (like RWMH, MALA) are likely to benefit more from using AutoStep. In particular, we have conducted preliminary experiments with AutoStep applied to HMC with a reasonable path length, and the results suggest that the minESS per cost achieved by AutoStep HMC is approximately 30% of that achieved by traditional HMC with hand-tuned "optimal" parameters on benchmark problems. However, those results are very preliminary, and it is possible more engineering effort could improve AutoStep HMC further.
I read the other reviews and author responses. I still have an overall positive view of this paper.
Regarding performance vs. other methods like NUTS: In my view, this is still the main limitation of the paper. I feel convinced that the method can improve the performance of basic methods like RWMH and MALA, which are widely used. The sampler design is very nicely done and it seems like a useful contribution for the community. This paper provides a potential framework for improving more powerful samplers like HMC and NUTS in future works. If this was demonstrated, the paper would be very strong.
Regarding ESS/KSESS: That explanation makes sense, and I am more convinced that KSESS is appropriate for this work. I suggest including these details in revisions for clarity.
Regarding ergodicity: From my understanding, the Birkoff Ergodic Theorem referenced by the rebuttal applies only if the Markov chain is initialized from its invariant distribution. My understanding is that positive recurrence is still needed to show that a Markov chain initialized from an arbitrary distribution will be ergodic and therefore converge to the stationary distribution (I could be wrong about this). It might be worth including some discussion of this in the paper, although it is a relatively minor point and the same issue applies to generic MCMC design and it not restricted to the proposed method.
Thank you for your positive feedback! We are glad to hear that your concerns regarding the KSESS metric have been resolved. We will include additional details in the camera-ready version. Below, we provide further clarification on the two remaining points you raised.
Regarding ergodicity
Your understanding of the Birkhoff Ergodic Theorem is correct! However, positive recurrence is not needed to enable initialization from another distribution. More precisely, the Birkhoff theorem states that there exists a measurable set such that , and initializing the chain in yields the desired convergence a.s. Consider now any other initialization distribution such that dominates (a weak condition in practice). Since dominates, implies ; in other words, initializing from will also almost surely initialize in , the set on which convergence holds. We'll add a brief clarification in the text to make this point more precise.
Regarding performance vs. other methods like NUTS
Thank you for your positive feedback on our work and for recognizing the improvements to widely used methods; we really appreciate it! We completely agree that extending AutoStep to NUTS and HMC would be an exciting direction. We are currently working on that project, but we believe the current contribution stands on its own as a complete paper, and we plan to present the extension in future work.
All reviewers agreed that this is a nice paper which introduces a clear and sound methodology for incorporating adaptive step sizes into MCMC algorithms without requiring a fixed, finite adaptation period. While there were some concerns that the method might not be as performant, when applied to random walk MH (or MALA), as a well-tuned NUTS sampler (and thus the method as-presented likely would not be the state-of-the-art in many scenarios), I would agree that the reasons to accept this paper would clearly outweigh this concern. As this is a novel approach to a long-standing concern, it may well additionally inspire future follow-on work.