Hamiltonian Monte Carlo Inference of Marginalized Linear Mixed-Effects Models
摘要
评审与讨论
This work considers Bayesian inference via MCMC methods (specifically: Hamiltonian Monte Carlo) in linear mixed-effects models (LMMs). Due to linearity and Gaussianity, the random effects can be analytically integrated out so that the MCMC algorithm needs to only target a distribution on a smaller space which does not include the random effects. This typically leads to more efficient MCMC algorithms. However, as the authors show, a naive marginalisation is costly, incurring a complexity that is cubed in the number of observations and the size of the vector of random effects. The main contribution is to show that, by exploiting sparsity, the random effects can be integrated out in a way that incurs (typically) the same computational complexity as evaluating the likelihood without any marginalization. The authors demonstrate that marginalization in this way improves the performance of the HMC algorithm compared with "no marginalization" both in terms of the effective sample size and effective sample size per second.
优点
- I think this work makes a good contribution to the field: The proposed implementation of the marginalization (especially when applied to all random effects) does seem to greatly enhance the efficiency of the HMC algorithm in some applications and, as the authors prove rigorously, this does not come at the cost of an increased computational complexity.
- The presentation is good and the writing is clear.
- There is sufficient empirical demonstration, in my view.
缺点
Title
I don't think the title (and the emphasis on HMC in the abstract and introduction) is entirely adequate. The wording "marginalized Hamiltonian Monte Carlo" implies to me that some component specific to HMC (e.g. the momentum variable) is analytically integrated out. However, the marginalization here is completely orthogonal to HMC. Indeed, it could be exploited within any other MCMC algorithm or even within other Monte Carlo schemes like sequential Monte Carlo samplers).
Presentation
Overall, I think the presentation is good. Just a few minor comments:
- In Table 1 and Figures 2, 3 and 4, referring to the "no marginalization" strategy as "HMC" is confusing since all methods use HMC. This is also inconsistent with, e.g., Table 2. Better replace "HMC" by "no marginalization".
- A number of symbols are not or not clearly defined, e.g., on Page 2, define , (is only defined much later), and (should be ?). More generally, the fact that symbols without the subscript denote the collection over all should be mentioned (unless I missed it).
- From the context, it is clear what the authors mean by "recovery" and "ancestral sampling" but unless these terms are somehow standard in the literature, I think they should be more clearly explained to avoid confusion.
- Perhaps the distinction between "classifications" and "groups" (around Lines 88 to 92) could be made more clear, especially because the examples in parentheses say "subject", "gender", and "age" in both cases which is a bit confusing.
- I couldn't quite understand Section 6.3. What does "marginalization in a vectorized way" mean?
- Check the bibliography; it has many mistakes, e.g., missing capital letters in proper nouns and journal names.
Other
- L241: document -> documentation
问题
- Can this be extended to multivariate observations?
- As the authors mention, the "no marginalization" implementation seems to be slower than the proposed implementation that performs (some) marginalization. This seems surprising. Do the authors have an explanation for this?
局限性
I don't foresee any negative societal impact.
I don't think the title (and the emphasis on HMC in the abstract and introduction) is entirely adequate.
Thanks for this comment. Please see the general response. Briefly, we propose to change the title and revise the introduction to clarify that the method is generally applicable, but mainly focus on HMC for the scope of the paper for the reasons described in the general response..
... Better replace "HMC" by "no marginalization". A number of symbols are not or not clearly defined ... "recovery" and "ancestral sampling" ... should be more clearly explained ... the distinction between "classifications" and "groups" could be made more clear ... Check the bibliography
We appreciate your meticulous efforts in reviewing our submission. We will make a complete revision of this manuscript to incorporate your suggestions.
What does "marginalization in a vectorized way" mean?
We refer to our method as vectorized marginalization, in contrast to the automatic marginalization in [1], which works only with scalar random variables. Compared to that baseline, we achieve the same goal of marginalization, but with complete vectorization in the computation. As far as we know, no existing works in automatic marginalization solve LMMs with vectorization.
Can this be extended to multivariate observations?
For multivariate observations, sparsity still exists in the model structure and our method should work with no issue.
... "no marginalization" ... slower than ... marginalization. ... Do the authors have an explanation for this?
There are two benefits from marginalization that may improve the sampling speed. The first is the reduced state dimension , which leads to less discretization error for the numerical integrators and thus larger step size. The second is better geometry. There are certain structures like funnels that make adaptive HMC converge to a small step size. Such structures can be removed by marginalization, which makes HMC more efficient.
[1] Lai, J., Burroni, J., Guan, H., & Sheldon, D. (2023, July). Automatically marginalized MCMC in probabilistic programming. In International Conference on Machine Learning (pp. 18301-18318). PMLR.
Thank you for the clarifications!
The paper addresses the challenge of marginalizing latent variables in linear mixed-effects models (LMMs) to improve Hamiltonian Monte Carlo (HMC) sampling. The authors propose an algorithm for automatic marginalization of random effects in LMMs, leveraging fast linear algebra techniques. The experiments evaluate the performance of the proposed marginalisation on LMMs from several disciplines by using the No-U-Turn sampler in NumPyro and assessing performance via effective sample size (ESS), number of divergent transitions, and running time.
优点
This work focuses on an important problem in Bayesian modelling and inference: specifically the automatic marginalization of continuous latent variables in probabilistic models. While prior work exists that explores such approaches for discrete random variables, this is less so the case for continuous variables, so I believe the contribution is valuable to the community.
To the best of my knowledge this contribution is original, as I am not familiar with other work that explores such continuous variables automatic marginalization for probabilistic programming.
I also thought the idea that marginalization, similarly to reparameterization, can alleviate difficult characteristics such as funnels in HMC is well articulated and personally it gave me a new perspective.
缺点
My two main concerns about the paper are regarding the experiments and about the significance of the results.
Experimental Details:
Overall, I find that the experiments lack sufficient details and metrics:
- There is not enough information about hyperparameters used for each method and model, e.g. the step size for NUTS. Does the approach rely solely on NumPyro's adaptation routine? If yes, what parameters were used for that?
- The evaluation can use additional diagnostics and metrics to provide more evidence that marginalization improves inference: for example trace plots, and visualisation of divergences can be very useful. Effective sample size alone is not enough for properly assessing the quality of inference.
- I don’t think the number of divergences is a reliable metric, without additional context, such as their location in the posterior distribution or the step size used. In the case with section 6.2, the divergences are not neccesarily indicative of any problems. In my opinion, the 12 divergences observed for R1, R2 are just 0.012% of all drawn samples, which is negligible. Where are those divergences located in the posterior plot? Are they grouped in one place?
- The choice of priors for various models seems to me quite broad and lacks justification.
Significance of Results
The claim that marginalization is always beneficial is not novel, as it follows from the Rao-Blackwell theorem. The paper mentions Rao-Blackwellization, but doesn’t go into details.
Edit: the rebuttal clarified on all of these points.
问题
- Can you provide more detailed convergence diagnostics for your experiments, such as trace plots, for example?
- Are divergences grouped in specific regions of the posterior distribution?
- What is the impact of varying HMC parameters (e.g., step size, or adaptation routine parameters) on the number of divergences and the effective sample size?
- Could you elaborate on the choice of priors in your experiments? Why were these particular priors selected?
Edit: the rebuttal has answered all my questions.
局限性
The last section of the paper addresses a few limitations of the approach, including addressing distributions beyon normal. However, the paper can benefit of further discussion on one, in my opinion, fundament limitation of this approch, which is that it only considers cases where analytical marginalization is possible. How many practical cases are covered by this constraint?
Thank you for the questions about the experiments. We respond to your questions below, where we will reference additional results uploaded in the PDF response. Broadly, these additional results show additional metrics and experiments consistent with our original findings that marginalization improves inference.
Can you provide more detailed convergence diagnostics for your experiments, such as trace plots, for example?
In the additional PDF, we have included trace plots for the ETH instructor evaluation model as Figure 1. We will include more trace plots to analyze the convergence in the next version of the paper. We also compute for shorter chains on the model and present the results in Table 1 of the additional PDF. The results show that proper marginalization is effective in improving the efficiency of HMC for the model with a denser trace plot and better R-hat values.
Are divergences grouped in specific regions of the posterior distribution?
We notice the divergences of R1,R2 are related to specific combinations of values for and some of the parameters in . Figure 3 in the additional PDF shows the locations of divergences. Also, the divergence numbers vary with different random seeds. While M2 and M2,R1 never report divergence, the numbers of divergences for R1,R2 are 12, 12, 528, 1373, 233 out of 10,000 samples for the first five random seeds. See Table 2 in the additional PDF for more details. We will include those results of divergence in Section 6.2.
Does the approach rely solely on NumPyro's adaptation routine? … What is the impact of varying HMC parameters (e.g., step size, or adaptation routine parameters) on the number of divergences and the effective sample size?
We mainly use the default adaptive routine of NUTS in NumPyro, which has an adaptive step size with dual averaging, adaptive and diagonal mass matrix, target acceptance probability of 0.8 and maximum tree depth of 10. For the ETH instructor evaluation model, hardness is observed without marginalization, so we set the maximum tree depth to 12. It is possible to reduce the step size by increasing the target acceptance probability. We have tested with different target acceptance probabilities. On the ETH instructor evaluation model, we find benefits of marginalization are consistent with a different choice of the target acceptance probability (Figure 2 in the additional PDF). Notibally, marginalizing may not be beneficial in ESS in the original setting, but there are improvements when the target acceptance probability is 0.9. On the grouse ticks model, divergences still exist after increasing the target acceptance probabilities to 0.95 (Table 2 in the additional PDF).
Could you elaborate on the choice of priors in your experiments? Why were these particular priors selected?
For datasets with existing models such as those in [1], we follow their choices of the priors. For other models, we use generic and weakly informative priors [2]. Our conclusions are orthogonal to the choice of priors.
The paper mentions Rao-Blackwellization, but doesn’t go into details.
Thanks for this question. We will expand on the relationship to Rao-Blackwellization in the paper. In short, Rao-Blackwellization reduces the variance of an estimator for some quantity by taking the conditional expectation over some variables and sampling the remaining variables. In this paper, on the other hand, we improve the speed of obtaining samples from the remaining variables (by improving mixing times and/or reducing the cost per iteration). So while Rao-Blackwellization is spiritually similar, the Rao-Blackwell theorem does not apply to our situation. The point of our method is to obtain samples from the non-marginalized variables faster, not to integrate over the marginalized variables in some expectation. If one were interested in an expectation (of all the variables) one could apply Rao-Blackwellization on top of the samples returned by our method.
More formally, if one is interested in some expectation one can view Rao-Blackwellization as transforming the estimator as
where is the number of samples (if the samples are coming from MCMC, essentially the same equation holds, with being the effective sample size).
In our paper, by marginalizing the variables we aim to have a more effective MCMC method for . Then, for each sample , we recover via exact sampling from . So, essentially, we would improve the expectation of some function through the transformation
where is a larger number of samples. That is, the improvement simply comes from more (effective) samples, due to faster mixing and/or more iterations per second.
Finally since the distribution of given is tractable, one could imagine applying both our technique and Rao-Blackwellization, with the transformation
where again . (In general it might be necessary to use ancestral sampling from to estimate the inner expectation, but in many cases this should be quite cheap.) This would be interesting to explore more fully in future work.
[1] Nicenboim, B., Schad, D., & Vasishth, S. (2021). An introduction to Bayesian data analysis for cognitive science. Under contract with Chapman and Hall/CRC statistics in the social and behavioral sciences series.
[2] Andrew Gelman et al. (2024) Prior Choice Recommendations.
Dear authors, thank you for your response. I appreciate the extensive rebuttal and the additional results you presented in the attached pdf. Thank you, in particular, for the plots visualising where divergences happen - I think these really help highlight that there is indeed a problem with the geometry that your method seems to address.
This has addressed all major concerns I had and I will increase my score accordingly. Looking forward to seeing the updated version of the paper when available.
The paper studies marginalization in linear mixed-effects models with a Gaussian likelihood. Marginalizing out variables is a common trick to improve the convergence of MCMC. As pointed out, for the linear mixed-effects models with a Gaussian likelihood, naively doing so will actually increase the computation as there are a few linear algebra operations that is costly. The paper introduces two tricks using the matrix inversion lemma to speed them up. Some experiments are performed to demonstrate the effectiveness and to compare against the reparameterization alternatives.
优点
originality
The paper presents a simple improvement to a known problem.
quality
The technique presented in the paper is relatively simple and straightforward. Such techniques are widely used in Gaussian processes community.
clarity
The paper content is easy to follow however the title is a bit misleading as the technique is not specific to HMC but generally applicable to all MCMC algorithms.
significance
The model family studied in this paper is pretty narrowed and even in this scope it's unclear how the presented method works compared to other automatic marginalized methods.
缺点
Models studied are narrowed
Only models with Gaussian or log-Gaussian likelihood is studied.
Experiments are not enough
- In the experiments there is only naive baselines. Other alternative automatic marginalization methods are discussed in the related work section but not compared against.
- Only 1 dataset (section 6.1) and 1 toy target (section 6.2) are studied.
问题
Is the method specific to HMC?
局限性
Limitations are discussed in section 7 however I do think those limitations make the paper a bit weak (not enough contributions) for acceptance.
Overall, reviewer UJHx suggests that similar techniques are known or standard and faults the paper for not comparing to them, but provides no details or citations. It is impossible to accept these claims without justification. We respond to specific comments below.
Such techniques are widely used in Gaussian processes community.
Although the first step is applying the matrix lemmas, we use additional structures specific to LMMs afterward to obtain a fast running time. This goes beyond what is commonly done in GPs.
Only models with Gaussian or log-Gaussian likelihood is studied.
We have discussed the scope in our general responses. One point of the paper is that normal marginalization covers many cases with continuous observations, given proper transformation functions such as the log function.
In the experiments there is only naive baselines. Other alternative automatic marginalization methods are discussed in the related work section but not compared against.
Automatic marginalization methods exist in the literature, but all except [1] have very different scopes and don’t apply to our setting, e.g., they marginalize discrete variables only or are tailored to sequential Monte Carlo or algebraic exact inference algorithms. The comparison with [1] is in Section 6.3. We are not aware of other baselines that are applicable to the models we consider. Reviewer UJHx provides no details or citations to claim that baselines are missing.
Only 1 dataset (section 6.1) and 1 toy target (section 6.2) are studied.
Reviewer UJHx counts 2 datasets but in our experiments, there are in total 13 datasets from Section 6.1 to Section 6.4.
Is the method specific to HMC?
Our method works for any MCMC algorithm that accepts a black-box model. We highlight using HMC from practical perspectives. See the general responses.
[1] Lai, J., Burroni, J., Guan, H., & Sheldon, D. (2023, July). Automatically marginalized MCMC in probabilistic programming. In International Conference on Machine Learning (pp. 18301-18318). PMLR.
Thanks for the clarification. Two of my main concerns are resolved and I've raised my score accordingly.
This paper considers the problem of Bayesian inference in linear mixed-effects models. This is a popular class of models used throughout many areas of application. Inference in such models depends on being able to determine values of latent variables (fixed and observed effects) given the observed data. This is typically done by writing a model of interest in a probabilistic programming language and then running an automated inference algorithm on it. For example, the Stan package runs a variant of Hamiltonian Monte Carlo on the model. In general, this can be a complex problem from a sampling point of view. The main contribution of this paper is to speed up vectorized random effects marginalization, which simplifies the sampling problem and leads to faster sampling. The authors note that marginalization tends to always be beneficial, and illustrate their proposed methodology on a well-chosen set of real-life examples.
优点
I like the clear motivation and formulation of the problem, including the challenges and limitations of the proposed methodology. The problem is of obvious interest and practical relevance and would be of interest to a wide audience. The presented solution is clean and straightforward in its use of determinant computation and matrix inversion lemma. The experimental results are convincing and of interest.
缺点
The proposed methods are limited to Gaussian models, which the authors note. I would have appreciated some more commentary in this direction. The marginalization approach seems sufficiently interesting to be used widely, whenever possible.
问题
What sort of prior structures allow for this to be applied?
Are there methods that can be developed that perform approximate marginalization in some way?
How hard is it to automate marginalization in practice?
局限性
Yes.
What sort of prior structures allow for this to be applied?
In the context of LMMs, the distributions of the effects are assumed to be normal. We do not restrict the priors of other variables.
Are there methods that can be developed that perform approximate marginalization in some way?
In the general response, we briefly discussed one direction of generalization to models with discrete likelihoods via data augmentation. We think approximation, such as Laplace approximation, is another way to generalize exact marginalization. A pioneering work in this direction is [1]. However, the approximation leads to biased MCMC algorithms. Pseudo-marginalization could be used to correct the bias, but in our preliminary work, the computational advantages and tradeoffs are quite unclear; for example, embedding Newton-based optimization for Laplace’s method inside pseudo-marginal HMC required the computation of third order derivatives.
How hard is it to automate marginalization in practice?
There are two levels of automatic marginalization. The first is to marginalize given high level abstractions of the model, such as R formulas. It is possible to compile the abstractions into concrete models (e.g., probabilistic programs) while marginalizing one or more of the effects using our methods. The second is to marginalize given user-written probabilistic programs of LMMs. In such a case, some compilation or tracing technique is needed to convert the user’s program to a model representation suitable for manipulation. For example, the authors of [2] used program tracing to construct a graphical model representation that could be programmatically analyzed and transformed. To apply this methodology to LMMs, a special parser would also be needed to match the models to LMMs. We think these are interesting extensions of our work in the aspect of probabilistic programming languages.
[1] Margossian, C., Vehtari, A., Simpson, D., & Agrawal, R. (2020). Hamiltonian Monte Carlo using an adjoint-differentiated Laplace approximation: Bayesian inference for latent Gaussian models and beyond. Advances in Neural Information Processing Systems, 33, 9086-9097.
[2] Lai, J., Burroni, J., Guan, H., & Sheldon, D. (2023, July). Automatically marginalized MCMC in probabilistic programming. In International Conference on Machine Learning (pp. 18301-18318). PMLR.
Thank you for these clarifications.
General response
Title of the work
It’s a good point that the method is not limited to HMC. We will adjust the title to better reflect our contributions. We propose the alternative "Hamiltonian Monte Carlo Inference of Marginalized Linear Mixed-Effects Models". In our revision, we will also rework the introduction to distinguish between the contribution of marginalization and the application to HMC.
While marginalization could be applied with any MCMC algorithms that target a log density, we chose HMC as the kernel and in the title with the following considerations:
- HMC is the working algorithm of the influential package BRMS, and has shown great success in solving real world linear mixed-effects models (LMMs).
- With HMC, the benefits of marginalization can be justified as reducing the state dimension , thereby reducing the computation time of reaching an independent sample with the Leapfrog integrator, whose complexity is approximately . Such a bound may not exist in the literature for all MCMC algorithms, so we were conservative in defining the scope of our method.
Normal / log-normal likelihoods
Normal and log-normal likelihoods in the paper cover a lot of studies with continuous response variables, and it is direct to generalize our method to other continuous domains with proper transform functions. The package lme4 [1] has over 79,000 citations and a substantial number of those studies have continuous observations. We have included 9 studies in cognitive science in our experiments as demonstration. Also in the Bayesian workflow, normal models can act as template models [2] when developing more complicated models. We believe the scope of this work has broad interest and will lead to exciting future works.
Exact marginalization in the context of discrete likelihoods (eg. Bernoulli, Binomial, Poisson) often involves data augmentation tricks such as [3] and other related works. The generalization of our methods to those likelihoods with the tricks may also require a novel MCMC algorithm, which is an interesting avenue for future work but beyond the scope of this paper.
[1] Bates, D., Mächler, M., Bolker, B., & Walker, S. (2014). Fitting linear mixed-effects models using lme4. arXiv preprint arXiv:1406.5823.
[2] Greengard, P., Hoskins, J., Margossian, C. C., Gabry, J., Gelman, A., & Vehtari, A. (2023). Fast methods for posterior inference of two-group normal-normal models. Bayesian Analysis, 18(3), 889-907.
[3] Frühwirth-Schnatter, S., Frühwirth, R., Held, L., & Rue, H. (2009). Improved auxiliary mixture sampling for hierarchical models of non-Gaussian data. Statistics and Computing, 19, 479-492.
This manuscript considers Bayesian inference in a popular class of models, linear mixed-effects models. The authors discuss strategies to improve the performance of Hamiltonian Monte Carlo sampling in this class of models, including how to use ideas from numerical linear algebra to accelerate the running time dramatically.
During the initial review period, the reviewers generally praised the work, identifying only some minor concerns, which were adequately addressed during the author-reviewer discussion period. Following the discussion period, there was consensus in favor of accepting the work.