Amortized Variational Transdimensional Inference
摘要
评审与讨论
The paper proposes a new normalizing flow architecture to handle “transdimensional posteriors”, that is, posteriors across models (with potentially different support) and the model index. They provide some theoretical insights on their models and evaluate it on some relatively simple benchmarks.
优缺点分析
tengths:
- The paper addresses an important problem of doing efficient inference across a large model space
- The theory seems sound based on the limited time I took to study it.
Weaknesses:
- autoregressive flows are in some sense outdated compared to coupling flows and nobody really uses autoregressive flows in practice anymore. I don’t quite understand why this kind of flows where chosen. Is this necessary or just some arbitrary decision.
- The solution to the model index surrogate seems a bit unncessarily technical. why would we consider a GP vs. also just using a neural network as your second proposal seems to suggest?
- The experiments are not super convincing to me. Are the achieved results good enough to justify the use and interpretation in practice? E.g., how good is the entropy achieved in Experiment 1?
问题
- In which sense is the approach “amortized”. Do you mean amortized over models? Or some other space? I am asking because the word “amortized” is used is many different ways in the literature.
- Do I understand it correctly that we are using reverse KL for learning and hence need an analytic likelihood?
- I consider the use of autoregressive flows as a weakness (see above). Why is that necessary? Could we use other flow architectures or do we need the specific autoregressive structure for the transdimensional part?
- You say that rather than training a density for each individual model seperately, it is preferable to construct a single density. In which circumstances specifically is this likely to be true?
局限性
Limitations were not really mentioned in the discussion I believe. E.g., what are potential scalability issues to high dimensional model and/or parameter spaces? I assume there are some limitations in that regard but I don't see it discussed. Similarily, the limitations of the chosen architectures doesn't seem to be discussed really.
最终评判理由
Autheros responded thoughtfully to my concerns and partially addressing them.
格式问题
none
We thank the reviewer for their time in evaluating our submission. While we note the critical tone of the review, we appreciate the opportunity to respond and clarify aspects of our work. We have made a sincere effort to address all points raised. All comments and questions will be addressed individually in an inline fashion.
Analytic likelihood: Do I understand it correctly that we are using reverse KL for learning and hence need an analytic likelihood?
The paper does indeed use a reverse KL loss for learning as mentioned in equation 2. The implementation requires a target density that can be evaluated up to a normalizing constant and is differentiable (usually using standard autograd tooling built into widely accepted libraries like PyTorch, TensorFlow or JAX). In the Bayesian setting, where the target is proportional to the product of the likelihood and prior, this requirement translates to only needing the likelihood to be differentiable and evaluable up to normalization. The likelihood does not need to have a closed‐form density; in fact, this is a motivating problem behind transdimensional inference, where the normalizing constant of each model is intractable. We will aim to clarify the text on this point.
Flow choice: I consider the use of autoregressive flows as a weakness... Why is that necessary? Could we use other flow architectures or do we need the specific autoregressive structure for the transdimensional part?
In the variational inference setting, inverse autoregressive flows are preferred over coupling flows due to:
-
Faster sampling during training (Kingma et al. 2017; Durkan et al. 2019). Autoregressive transformations use MADE which has an asymmetrical computational cost depending on the direction of the transformation. Inverse Autoregressive Flows are implemented to have computational cost (and are parallelizable) for sampling and evaluation of those samples as a by-product, which is the only operation used in the variational inference optimization loop. This means conversely the evaluation of out-of-distribution data points has the sequential cost due to sequential inverse evaluation operation that is often cited as the drawback of these flows for generative models. Since the use case is primarily to search over a large model space in a variational inference setting, and to obtain expectations with respect to the amortized variational density, this is not a limitation in our setting.
-
Higher expressiveness overall (Coccaro et al. 2024).
The desired factorization that is proved in Proposition 2.2 does rely on using the autoregressive property of the flow, however we do not see a barrier to deriving a CoSMIC architecture specifically for coupling flows in applications outside variational inference, such as simulation-based inference. This would be an interesting topic for future research. We will add some comments along these lines in the Discussion.
Jointly learning densities for models: You say that rather than training a density for each individual model separately, it is preferable to construct a single density. In which circumstances specifically is this likely to be true?
This approach is preferable in large model‐space settings. For instance, the model space of directed acyclic graphs (DAGs) scales super‐-exponentially: in the non‐linear DAG example, the model space of 10‐node DAGs has cardinality . In such scenarios, it is infeasible to store individual normalizing flows for each model. We will revise our statement to include this explicit example.
Entropy gap + architectural limitations: How good is the entropy achieved in Experiment 1?
Figure 2 offers a holistic assessment of inference quality relative to a sampling baseline using RJMCMC. The entropy is reduced as flow expressivity increases; it also reduces with the size of the model probability (Figures 2, 5-10). The entropy gap scales with the model‐space size (Figure 4) when the flow architecture is held fixed. This matches expectations about neural density expressivity and indicates that flow design should be tailored to the problem domain. We will improve our description of this in the Experiments by bringing Figure 11 plus discussion to the main paper as a visual indicator of how good it is, and clarify the limits of the flow architecture in the Discussion.
Simplicity of benchmarks: They ... evaluate it on some relatively simple benchmarks.
We agree that the models used are initially relatively simple. That said, the robust variable selection problem is designed to be easily scaled to increased complexity in dimension and model space cardinality (see Figure 4, Appendix E.1). We also tackle non-linear DAGs, which are NP‐hard and non‐trivial in both computational complexity and setup. Our goal is not to dominate state‐of‐the‐art benchmarks but to introduce a novel variational method generally applicable to challenging domains with few comparable approaches. We will include a statement along these lines in the revision.
Scalability: What are potential scalability issues to high dimensional model/parameter spaces?
Thank you for raising this question. Our response to reviewer cEAN on scalability provides some background on potential scalability issues to high dimensional parameter spaces and high cardinality model spaces. In addition, in Section 3 we derive scaling properties for the GP-surrogate model distribution. In particular, high cardinality model spaces do require careful construction of a context encoder . This encoder is learned along with the flow, and is intended to non-linearly project to a higher dimension to improve expressiveness of the residual block described in Appendix A.3. We will add a paragraph to the discussion and experiment setup describing this consideration in the revised version, and refer the reader to the appendices for details.
Model index surrogate: seems ... unnecessarily technical vs just using a neural net
Thank you for this astute observation. We agree that in practice it is safe to default to using Monte Carlo gradient estimation using a categorical distribution in low cardinality problems, or a neural net like MADE in high cardinality problems. However, the GP approach offers a different framework with its own flexibility, offers competitive performance, and has the benefit of theoretical convergence properties that we provide in Section 3. We will add a summary of this clarification to the discussion in the revised paper.
Amortisation: In which sense is the approach “amortized” (clarification)?
Here we mean that, as discussed in Section 3, we train one shared recognition network that (i) is conditioned on model index to output in a single forward pass, and (ii) undergoes heavy optimization during training so that any subsequent inference over the posterior/target support requires only relatively cheap evaluation of the model distribution and evaluation of the IAF (no per-model VI runs). We will improve the text in Section 3 to make this clearer.
Additional References
Coccaro, Andrea, Marco Letizia, Humberto Reyes-González, and Riccardo Torre. 2024. “Comparison of Affine and Rational Quadratic Spline Coupling and Autoregressive Flows Through Robust Statistical Tests.” Symmetry 16 (8). https://doi.org/10.3390/sym16080942.
Thank you for responding in detail to my questions and concerns. I will raise my score to 4.
The paper proposes a flow-based variational approximation method for inference over models with varying parameter dimensionality, supporting targets where some parameters can be present or non-present and where the dimensionality of the each parameter group can vary as well. Varying dimensionality within each group is handled by 'dimension saturation' that expands the dimensionality to some assumed maximum, whereas a few alternative methods are proposed for addressing the discrete choice over the groups. The method is demonstrated in two experiments but its properties are not analysed in detail.
优缺点分析
The paper addresses an important problem formulation and extends the flexibility of flow-based approximations. The solution is novel and highly non-trivial, despite largely leveraging existing tools as part of the solution. The paper provides good formal treatment of the topic, defining all necessary quantities clearly and provides useful theoretical guarantees for some non-trivial components. For modelling the weights over the parameter groups the authors provide multiple alternative solutions even though already a single one would have been sufficient for publication.
The breadth of the technical contribution is hence commendable, but this comes with a cost: The presentation if very dense and the paper is difficult to read for anyone who is not already familiar with all of the technical components and theories needed for the development. Moreover, the point of presenting three alternative solutions in Section 3 is a bit lost when the evaluation largely skips this question and just presents the overall performance without even clearly telling which variant was used. This is a bit problematic also in terms of the conclusions: Section 6 states the choice depends on the cardinality of the model space, but this is not really backed up by empirical evidence. Since the MC estimator does not seem to be empirically validated in detail, I wonder whether it is useful to describe it in the main paper. Perhaps that method variant itself could be in the appendix, leaving space for bringing in some of the extra experimental evidence from the appendix to the main paper, or for opening up the method description for a broader audience?
The experimentation is done nicely on two rather different kind of problems, but the presentation in the main paper is a bit superficial and we do not learn very much about how the method really works since the experiments mostly just confirm it work. For example, for Section 5.1 there are missing details that are only in the Appendix (the true ) and Figure 2 only compares alternative versions of the proposed method (that all look fairly similar) without providing any context in form of alternative methods like sampling. The appendix includes such comparison (Figure 11 showing the approximation is extremely similar to MCMC) that would have been highly useful already in the main paper, or at least the main paper should clearly tell the result is there.
问题
-
How well the Monte Carlo estimator for high cardinality works in practice? If used for problem of low cardinality, is it roughly as good as the GP solution or substantially worse? If worse, should we expect it to be accurate for high cardinality?
-
Corollary 3.2 and the paragraph below that talks about convergence of the UCB-based solution but makes no reference to , implying that already the mean (using ) would work as well. Can you clarify what is happening here? I probably missed something about the claim.
局限性
Yes
最终评判理由
The paper is in general of high quality with no major open issues. The authors addressed my questions during the review well, but they were minor aspects and there is no need for me to change the overall evaluation.
格式问题
None
We thank the reviewer for their thoughtful and constructive comments, and we appreciate their interest in our paper. The feedback provided has helped us refine our approach, and we hope that our responses clarify and strengthen the manuscript. All comments and questions will be addressed individually in an inline fashion.
Presentation and clarity: Presentation is dense and paper is difficult to read. The paper should open up to a broader audience.
On reflection we agree, and will aim to use the additional page for the camera-ready version to improve overall paper accessibility and readability for a wider audience. This will be in addition to the specific changes discussed below.
Model weights variants: The point of 3 alternative solutions is lost when the evaluation largely skips this question. The performance of the MC gradient estimator as depending on model space is also not validated in detail.
Thank you for this insight. We will a) improve the text in Section 3 to make clearer the potential value and need for using different model weights estimators, and b) explicitly state in all experiments which particular estimator was used. Here, every experiment uses the Monte Carlo gradient estimator, whether as the main approach (Figures 2 & 3, Table 1) or in comparison to the variants (Figure 4). We discuss the MCG estimator performance in a separate comment below.
Detail of experimentation: The presentation in the main paper is a bit superficial and we do not learn very much about how the method really works.
Thank you – we agree. In line with the above responses we will use the additional manuscript space to expand upon experimental details and content. Some of this will entail bringing some of the content currently in the Appendix into the main paper (e.g. Figure 4 plus discussion, and if possible, Figure 11 and discussion). It will also involve improving the quality of the discussion, and bringing missing detail regarding the experimental setup (such as the true ) into the main paper.
High cardinality MC estimator: How well the Monte Carlo estimator for high cardinality works in practice? If used for problem of low cardinality, is it roughly as good as the GP solution or substantially worse? If worse, should we expect it to be accurate for high cardinality?
In Appendix E.1, Figure 4, we conducted a comparison between the diagonal surrogate, categorical Monte Carlo gradient, and neural (MADE, Germain et al. 2015) Monte Carlo gradient (MCG) approaches on problems of increasing model space cardinality on two different levels of misspecification for the robust variable selection problem type. We fixed the number of iterations and the architecture of the normalizing flow (RQS). From this, we observed that the MCG methods are competitive with the Gaussian surrogate approach, and in some cases performed slightly better (though this is subject to tuning of both approaches). The benefits and drawbacks of each approach cannot be made in a broad statement, but rather are problem specific. For instance, we provided theoretical guarantees for the surrogate approach in Section 3. Also, there could be scope for future research into exploiting model space structure for the surrogate approach. The benefit of the MCG methods is, as you summarised, applicability to very high cardinality model spaces where we use a neural (MADE) architecture for the model distribution. Figure 4 shows comparable empirical performance on the robust variable selection problem in model space cardinalities up to . The drawbacks are not concrete: we do not show the same theoretical guarantees, but this could be a topic of future research.
In order to communicate this comparison more clearly, we will move this content (Figure 4 plus discussion) to the main paper, and improve the clarity and quality of the discussion.
UCB convergence: Corollary 3.2 and the paragraph below that talks about convergence of the UCB-based solution but makes no reference to , implying that already the mean (using ) would work as well. Can you clarify what is happening here? I probably missed something about the claim.
Indeed, the value of should not affect the asymptotic convergence rate, according to our theoretical analysis, though, in practice, it helps the algorithm in the short term. This independence on is due to the result in Corollary 3.2 mainly only depending on maintaining a positive sampling probability over all models in the model space, which then allows us to apply the second Borel-Cantelli lemma to bound the posterior variance of the GP surrogate (Lemma B.2) and control the KL divergence of the approximation.
Thank you for the clarifications. I have no further questions at this point, and no reason to change my overall evaluation.
Spending the extra page to expand the background and ensure self-sufficiency is a good idea, and Figure 4 would indeed be highly useful in the main paper. In case you cannot move all of the planned supplementary material to the main paper (e.g. Figure 11), you can focus on ensuring the reader is aware that the information is provided in the appendix. That is, briefly explain they gist and the implications, while possibly still leaving the details and the figures in the appendix.
Thank you for your advice. We agree, and will focus on these points with the extra page.
The paper introduces a flow-based framework for variational inference on a space composed of discretely indexed parameter sets of differing dimensions, i.e., for transdimensional inference. To achieve this, independent auxiliary variables are introduced to increase the dimension of each set up to a prescribed value (defined as the maximum dimension within the transdimensional space). The authors refer to this approach as CoSMIC.
By demonstrating that the learned distribution is factorizable with respect to these auxiliary variables (Corollary 2.3), the authors construct a Gaussian Process (GP)-based surrogate for approximating the posterior distribution (Section 3.1). They also introduce a low-variance gradient estimator for the learning objective for when the discrete component of the transdimensional space exceeds the computational memory limit (Section 3.3). An empirical analysis of the proposed methods in simulated experiments confirms their effectiveness (Section 5).
All in all, the work is well-articulated and clearly motivated. In spite of the limited experimental evidence, I believe that CoSMIC paves the road for further research on amortized VI for problems like variable selection and structure learning.
优缺点分析
Strengths
-
The paper is well-written, clearly motivated, and mathematically precise.
-
The authors propose a simple and effective solution to an important problem in the field of variational inference.
-
The authors clearly distinguish the use cases for the surrogate-based and the Monte Carlo approximations of the target distribution - depending on the cardinality of the discrete component of its support.
-
Applications on variable selection and structure learning highlight the method’s usefulness on well-known problems.
-
Experimental results attest the effectiveness of CoSMIC. However, the empirical analysis could be more comprehensive; see below.
-
Reproducibility: the code has been publicly released in the supplementary material.
Weaknesses
-
We are often interested in sparse solutions to the presented applications of variable selection (Section 5.1) and structure learning (Section 5.2). However, CoSMIC requires padding each parameter set to a prescribed dimension, which may unnecessarily increase the method’s complexity. Could the authors elaborate on the introduction of sparsity priors to the proposed solutions?
-
I did not completely understand Figure 2’s bottom row. Is there a discernible pattern when looking at the plots from left to right? I would expect the cross-entropy to decrease as we increase the complexity of the variational density.
-
Bayesian structure learning and causal discovery with non-linear mechanisms has been a topic of major interest in recent years. VGB [1], JSP-GFN [2], DiBS [3], and BCD Nets [4] are just a few examples. Authors should discuss the relatively restricted baseline choice (DAGMA) in further detail.
-
[Minor] Proposition 3.1 is straightforward; any distribution over a discretely indexed set can be uniquely represented as a distribution over the set of indices. The paper would be strengthened if the corresponding space had been used for a discussion on, e.g., further baseline methods.
[1] Bayesian learning of Causal Structure and Mechanisms with GFlowNets and Variational Bayes. arXiv.
[2] Joint Bayesian Inference of Graphical Structure and Parameters with a Single Generative Flow Network. NeurIPS 2023.
[3] DiBS: Differentiable Bayesian Structure Learning. NeurIPS 2021.
[4] BCD Nets: Scalable Variational Approaches for Bayesian Causal Discovery. NeurIPS 2021.
问题
See Weaknesses.
Also, when discussing reversible jump MCMC (Section E.4), the defined ratio for the death move is always greater than or equal to one. I believe should refer to the ratio within the minimum operator. Is that right?
局限性
Limitations of CoSMIC have been properly addressed throughout the manuscript.
最终评判理由
This paper introduces a novel and principled approach for approximated Bayesian inference in spaces of varying dimension. The authors have clearly addressed my concerns, and I will keep my positive score.
格式问题
The paper is adequately formatted.
We sincerely thank the reviewer for their detailed and insightful comments, as well as for their strong interest in our work and methodology. Their expertise is evident in the depth of the review, and we are grateful for the opportunity to respond. We have carefully considered each of the points raised and have done our best to address all concerns. All comments and questions will be addressed individually in an inline fashion.
Experimental detail: Empirical analysis could be more comprehensive.
We agree, and we will use the additional page for the camera-ready version to expand upon experimental details and content. Some of this will entail bringing some of the content currently in the Appendix into the main paper (e.g. Figure 4 plus discussion, and if possible, Figure 11 plus discussion). It will also involve improving the quality of the discussion (emphasising important points), and bringing missing detail regarding the experimental setup (currently in Table 2) to enable more tangible understanding and discussion of the most relevant points.
Sparsity: If we’re interested in sparse modelling solutions, do we really need to increase the modelling complexity to ? Can we elaborate on introducing any "sparsity priors" to the proposed solutions?
Thank you for this insightful question. In terms of what is implemented in the experiments: We use a prior that induces “extra” sparsity in the real-data example for non-linear DAGs (Sachs et al 2005). The use of this prior is to further down-weight the probability of graphs with more edges in order to reach an acceptable level of closeness to the “consensus” graph in Sachs et al 2005. In no other experiment do we use sparsity-inducing priors.
As to the wider question: Introducing sparsity-inducing priors will tend to produce a posterior with little weight on non-sparse solutions (unless justified by the data). Because we know (see experiments and Discussion section) that most of the effort in approximating the posterior focuses on high posterior model probability models, this means that even though the flows are defined in -dimensional space, the extra dimensions used in the higher-dimensional but lower-probability models will have little impact on minimising the loss. From this perspective the ability of VTI to approximate the posterior is not sensitive to this extra dimensionality, even though in practice the CoSMIC setup requires its specification. In general though, this question does open the door to future research on how to reduce these practical requirements (beyond hard-specification of a reduced ), perhaps via a more dynamic (run time) setup, though this may present implementational difficulties. Thank you for raising this question – we will include a discussion on this point in the Discussion of the revised version.
Figure 2: I did not completely understand Figure 2’s bottom row. Is there a discernible pattern when looking at the plots from left to right? I would expect the cross-entropy to decrease as we increase the complexity of the variational density.
We apologise for the lack of clarity in Figure 2. You are correct to expect the cross-entropy to decrease when looking from left to right. In Figure 2 there are actually two problem types: medium misspecification (panes 1, 2, 3), and high misspecification (panes 4, 5, 6). Left to right (increasing flow expressivity) for each problem type we observe decreasing . We will improve the plot with a clear dividing line between the two problem types to improve readability, and also improve the caption and surrounding text to make this more clear.
DAG experiment baselines: Authors should discuss the relatively restricted baseline choice (DAGMA) in further detail.
Thank you, we agree and will include the cited papers in the discussion. So far, we have run additional benchmarks using DiBS/DiBS+ (Lorch et al. 2021) in addition to DAGMA for the 10-node MLP DAG synthetic study (please see the revised figures in the tables below, these have been added to Figure 3 in the revised paper), and we expect to run further benchmarks before the end of the reviewing period. We expect the additional benchmark performance to be close to VTI performance. (As an aside: we have found a small error in our plotting code, and DAGMA performance is now much closer to VTI in line with expectations.) What we are aiming to demonstrate with this analysis is that the performance of a general methodology like VTI can be competitive with an application-specific methodology (like DiBS, DAGMA). In general, of course, one would expect the application-specific methodology to perform best. We will improve the discussion on this point in the Experiments section, along with the new benchmark results.
Proposition 3.1: is relatively straightforward ... The paper would be strengthened if the space had been used for a discussion ...
We agree, and will recover this space to be used for the improvements discussed above/below.
MCMC acceptance probability: Also, when discussing reversible jump MCMC (Section E.4), the defined ratio for the death move is always greater than or equal to one. I believe should refer to the ratio within the minimum operator. Is that right?
Thank you for picking up on this error. We mistakenly wrote that the acceptance probability is the reciprocal of , whereas it should read
,
1,r_\mathrm{birth}$$,
\alpha_{\mathrm{death}} = \mathrm{min}\left1,\dfrac{1}{r_{\mathrm{birth}}}\right$$.
We will correct this in the manuscript. The acceptance probability is implemented correctly in the code.
Extra baseline results for figure 3
F1
| Sample size | VTI | DAGMA | DiBS | DiBS+ |
|---|---|---|---|---|
| 16 | 0.165±0.021 | 0.275±0.034 | 0.013±0.006 | 0.035±0.014 |
| 32 | 0.248±0.029 | 0.376±0.030 | 0.056±0.019 | 0.100±0.030 |
| 64 | 0.370±0.042 | 0.467±0.032 | 0.098±0.028 | 0.162±0.049 |
| 128 | 0.374±0.035 | 0.454±0.028 | 0.164±0.035 | 0.236±0.049 |
| 256 | 0.420±0.035 | 0.472±0.031 | 0.227±0.035 | 0.338±0.033 |
| 512 | 0.466±0.033 | 0.499±0.035 | 0.266±0.033 | 0.405±0.045 |
| 1024 | 0.516±0.038 | 0.522±0.045 | 0.309±0.028 | 0.437±0.025 |
SHD
| Sample size | VTI | DAGMA | DiBS | DiBS+ |
|---|---|---|---|---|
| 16 | 24.143±0.610 | 32.889±1.149 | 22.089±0.869 | 21.857±0.941 |
| 32 | 24.274±0.856 | 29.667±2.629 | 21.689±0.912 | 21.136±0.941 |
| 64 | 22.218±1.045 | 23.111±1.746 | 21.278±0.954 | 20.226±1.195 |
| 128 | 24.226±0.931 | 24.444±1.268 | 20.622±0.991 | 19.486±1.262 |
| 256 | 24.908±1.040 | 22.778±1.412 | 20.067±0.955 | 18.090±0.957 |
| 512 | 26.296±1.190 | 22.333±1.918 | 20.100±0.932 | 16.889±1.201 |
| 1024 | 25.610±1.576 | 20.889±1.908 | 19.800±0.844 | 17.000±1.018 |
Brier
| Sample size | VTI | DAGMA | DiBS | DiBS+ |
|---|---|---|---|---|
| 16 | 20.844±0.792 | 32.889±1.149 | 21.993±0.901 | 21.811±0.961 |
| 32 | 20.102±0.820 | 29.667±2.629 | 21.264±1.010 | 21.101±0.941 |
| 64 | 18.428±0.962 | 23.111±1.746 | 20.526±1.088 | 20.168±1.193 |
| 128 | 19.966±0.653 | 24.444±1.268 | 19.298±1.209 | 19.394±1.248 |
| 256 | 21.283±0.975 | 22.778±1.412 | 18.049±1.171 | 17.923±0.968 |
| 512 | 23.956±1.257 | 22.333±1.918 | 17.268±1.131 | 16.888±1.201 |
| 1024 | 24.173±1.587 | 20.889±1.908 | 16.224±1.022 | 17.000±1.018 |
AUROC
| Sample size | VTI | DAGMA | DiBS | DiBS+ |
|---|---|---|---|---|
| 16 | 0.527±0.008 | 0.531±0.020 | 0.503±0.002 | 0.509±0.004 |
| 32 | 0.553±0.013 | 0.593±0.022 | 0.515±0.005 | 0.527±0.008 |
| 64 | 0.612±0.021 | 0.657±0.018 | 0.526±0.008 | 0.547±0.015 |
| 128 | 0.611±0.018 | 0.651±0.016 | 0.545±0.011 | 0.570±0.017 |
| 256 | 0.633±0.021 | 0.662±0.019 | 0.564±0.012 | 0.603±0.013 |
| 512 | 0.663±0.021 | 0.676±0.023 | 0.576±0.013 | 0.632±0.020 |
| 1024 | 0.706±0.026 | 0.693±0.027 | 0.590±0.012 | 0.641±0.011 |
Thank you for the detailed and clear answers. I will keep my score.
This paper introduces an extension of conditional normalizing flows for transdimensional target distributions, where the parameter dimensionality depends on the discrete model index . The authors propose a novel variational transdimensional inference (VTI) framework with two implementations: one leveraging Bayesian optimization via Gaussian process surrogates, and the other using Monte Carlo gradient estimation, which is more suitable for large-scale settings. Theoretical results are provided under certain assumptions, including convergence guarantees for marginal target distributions. The method is validated through numerical experiments on Bayesian robust variable selection and structure discovery in directed acyclic graphs (DAGs), demonstrating its effectiveness in challenging transdimensional inference problems.
优缺点分析
Strengths:
- As far as I am concerned, designing variational inference approaches for transdimensional posterior estimation where models have varying parameter dimensionality, is new. This represents a novel and important contribution to the field.
- The paper also provides theoretical results regarding the convergence guarantees for the marginal distribution over the model space .
Weaknesses:
- The experimental evaluation is limited in scope. The proposed method is not compared with alternative approaches in the first application (Bayesian misspecified robust variable selection), and only a single baseline is considered in the second (Bayesian nonlinear DAG discovery). Including comparisons with additional relevant baselines (e.g., [1], [2]) would better contextualize the method's performance.
- The paper offers limited discussion of the computational cost and scalability of the method, particularly for the Bayesian optimization variant involving Gaussian process surrogates. A more detailed analysis of runtime, memory requirements, and scaling behavior—especially in high-dimensional model spaces—would enhance the practical relevance of the work.
- Variational inference methods are known to be prone to mode collapse, potentially underrepresenting multi-modal posterior distributions. It remains unclear whether the proposed approach is susceptible to this issue, and if so, whether any design choices (e.g., CoSMIC flows or masking strategies) help to mitigate it. A discussion or empirical check on this point would be valuable.
[1] Cheng, C., & Shang, Z. (2022). Robust Bayesian variable selection with model misspecification. Journal of Multivariate Analysis, 190, 104960. https://doi.org/10.1016/j.jmva.2022.104960
[2] Miller, J. W., & Dunson, D. B. (2019). Robust Bayesian inference via coarsening. Journal of the American Statistical Association, 114(527), 1113–1125. https://doi.org/10.1080/01621459.2018.1429275
问题
-
Are there other existing methods that could be adapted to the tasks used in your experiments? Including a broader set of baselines would help clarify whether the proposed approach achieves state-of-the-art performance.
-
The problem setting of transdimensional target distributions is compelling. Could you provide additional real-world application domains where such distributions naturally arise, beyond the two studied examples?
-
How is the variational distribution amortized over the model space ? More specifically, how is the conditional dependence on the model incorporated into the formulation of ? A more detailed description of the conditioning mechanism would be helpful for understanding and reproducing the method.
局限性
- The method relies on careful design of conditional flows (CoSMIC flows) and masking strategies, which may require nontrivial tuning and domain knowledge to generalize to new tasks.
- The empirical evaluation is limited in baseline comparisons and application domains, making it difficult to assess how broadly the method generalizes or whether it consistently outperforms existing approaches.
最终评判理由
The authors have adequately addressed my concerns, and I will keep my positive score.
格式问题
None
We thank the reviewer for taking the time to engage with our manuscript and for their feedback. We appreciate the effort to provide a thorough assessment of the work, and we value the opportunity to improve our submission. All comments and questions will be addressed individually in an inline fashion.
Baseline comparisons: Including comparisons with additional relevant baselines ... would better contextualize the method’s performance.
We agree that baselines must be appropriate to the developed methods. The goal of this paper is to introduce the very first method to achieve a flexible, trans-dimensional variational algorithm (VTI), which we obtain by developing the CoSMIC architecture for an inverse autoregressive flow. In this sense, the relevant baseline is the ground-truth, to show that the method both works and performs as expected. This is the case with the first experiment (Bayesian misspecified robust variable selection) in which we compare to the ground truth obtained by RJMCMC output. To our knowledge, there are no other baseline methods capable of producing a variational approximation to the transdimensional posterior.
For the second experiment (DAGs), we agree that additional baselines are more important. Here, because of the immense size of the model space a ground truth is difficult to obtain, and so we evaluated VTI performance via typical DAG diagnostics in Figure 3 and Table 1. We agree that in this setting comparison to a single method (DAGMA) is not sufficient. So far, we have run the additional benchmarks DiBS/DiBS+ (Lorch et al. 2021) (please see the revised results in the tables in response to zSgg). We expect to run further benchmarks before the end of the reviewing period. We expect the additional benchmark performance to be within the range of DiBS/DIBS+ and DAGMA. What we are aiming to demonstrate with this analysis is that the performance of a generic methodology like VTI can be competitive with an application-specific methodology. One would expect the application-specific methodology to perform best. We will improve the discussion on this point in the Experiments section. The additional results will be added to Figure 3.
We will use the additional page for the camera-ready version to expand upon experimental details and content. Some of this will entail bringing some of the content currently in the Appendix into the main paper (e.g. Figure 4 plus discussion, and if possible, Figure 11 plus discussion). It will also involve improving the quality of the discussion (emphasising important points), and bringing missing detail regarding the experimental setup (currently in Table 2) to enable more tangible understanding and discussion of the most relevant points
Computational costs: The paper offers limited discussion of the computational cost and scalability of the method, particularly for the Bayesian optimization variant involving Gaussian process surrogates. A more detailed analysis of runtime, memory requirements, and scaling behavior—especially in high-dimensional model spaces—would enhance the practical relevance of the work.
Although we do not provide a comprehensive study of the computational costs and scaling in all settings (model space cardinality, maximum dimension of the model parameters), we do provide a limited study of performance when scaling the model space and parameter dimension simultaneously in Figure 4, Appendix E.1 (as discussed in the response to reviewer ctpC). Since our work builds on IAFs it is more natural to refer to e.g. (Papamakarios et al. 2021) which provides a general overview of the computational costs. As a rule-of-thumb, we found in practice that the size of and choice of transform for the flow composition in is subject to the problem type, parameter dimension, and model space cardinality. Problems requiring a highly expressive parameter posterior should use rational quadratic spline transforms in the flow composition. Problems with high parameter dimension may require additional transforms in the composition. Problems with high model space cardinality may require a higher number of blocks in MADE (hidden layers) and/or additional transforms in the flow composition, as well as a context encoder. A brief description of the choice of flow architecture is in Appendix A.2.
For the surrogate-based estimator, we apply incremental, fixed-cost updates over the model space at each time step of the stochastic gradient descent loop. Although a naïve implementation of GPs would incur a cost per SGD step , equations 25 and 26 in the appendix describe the updates for the posterior mean and variance of a GP surrogate which we used in practice. In them, the kernel matrix is only evaluated over a -sized batch, hence leading to a matrix inversion cost per SGD step. Evaluating entries of the posterior predictive variance over a model space with cardinality costs per SGD step, as . Over a total of steps of SGD, we have a cost of due to surrogate updates. Hence, for moderately sized model spaces, these fixed-cost updates incur a lower cost if we can update the posterior covariance matrix over the entire model space.
We agree these ideas need greater prominence, and so will provide both enhanced and additional discussion as part of moving Figure 4 content (plus discussion) from the Appendix to the main paper.
Mode collapse: Variational inference methods are known to be prone to mode collapse... . It remains unclear whether the proposed approach is susceptible to this issue, and if so, whether any design choices (e.g., CoSMIC flows or masking strategies) help to mitigate it. A discussion or empirical check on this point would be valuable.
This is an important question that we did not address in detail. In extending VI to the transdimensional setting we inherit the same strengths and weaknesses of classical VI, and we make no claim to have eliminated the challenges of mode-collapse. Users of this method would need to take the same steps to manage it in the flows as in the non-transdimensional setting. In the case of mode collapse for the model distribution , we address this (within the limits of the specification of ) with our exploration versus exploitation strategies discussed in Section 3.
Thank you for raising this point. We will include a detailed discussion around this issue in the revised manuscript.
Other applications: Could you provide additional real-world application domains where such distributions naturally arise, beyond the two studied examples?
Trans-dimensional posteriors commonly arise in many other areas, including phylogenetic tree topology search (Everitt et al. 2020), geoscientific inversion (Sambridge et al. 2013), gravitational wave detection (Littenberg and Cornish 2009). The revision will include these citations.
Amortization: How is the variational distribution amortized over the model space ?
As discussed in Section 3, we train one shared recognition network that (i) is conditioned on model index to output in a single forward pass, and (ii) undergoes heavy optimization during training so that subsequent inference—across any model or dataset—requires only a relatively cheap feed-forward evaluation (no per-model VI runs). We will improve the text in Section 3 to make this clearer.
Conditioning mechanism detail: how is the conditional dependence on the model incorporated into the formulation of ?
An equation describing the conditioning mechanism is provided in Appendix A.3, lines 564-571, given inputs and context , where the context is what we use as the conditioner. This approach is inherited from the nflows package and based on the work in Papamakarios et al. (2017) and Durkan et al. (2019). Additionally, we use a context encoder to enhance expressivity over the model space, described briefly in Appendix A.2 lines 550-552. We appreciate that this could be better explained, and so we will expand the description in the Appendix to more clearly describe the architecture of the context encoder for each experiment.
Tuning complexity: The method relies on careful design of conditional flows (CoSMIC flows) and masking strategies, which may require nontrivial tuning and domain knowledge to generalize to new tasks.
Thank you for raising this point. We respectfully disagree that the design demands non-trivial masking strategies; in our opinion it is substantially less specialized than competing methods. We provide a cookbook approach to designing the masking (Figure 1 and Appendix A.2) enhancing automatability. Extending to other problem types is more straightforward than, for example, designing new RJMCMC proposals. The need for domain knowledge extends only to that required for the set up and specification of a typical Bayesian transdimensional inference problem in the choice of priors and likelihood.
In order to make this clearer for the reader, we will improve the Discussion on this point in the revised version.
References
Everitt, R.G., Culliford, R., Medina-Aguayo, F., Wilson, D.J. 2020. “Sequential Monte Carlo with Transformations.” Statistics and Computing 30: 663–76.
Littenberg, T.B. and Cornish, N.J., 2009. Bayesian approach to the detection problem in gravitational wave astronomy. Physical Review D—Particles, Fields, Gravitation, and Cosmology, 80(6), p.063007.
Sambridge, M., Bodin, T., Gallagher, K., Tkalcic, H. 2013. “Transdimensional Inference in the Geosciences.” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 371 (1984): 20110547.
Thanks for your detailed response. The only remaining issue is about the representation of different models , or in your language, the context, which seems important for the conditional dependence on in the amortization approach. Are the models (or model index) treated as integers here or something else? For examples, when corresponds to different graphs in DAG learning, how is represented?
Another question is about the index for the model specific parameters. In the current formulation, it seems that all the models use the same order (i.e., index) for the parameters. As flow-based models (e.g., normalizing flows) are sensitive to the order of parameters, I am not sure if this would lead to extra constraints on the models that can be amortized over. Are the parameters across different models required to have similar alignment before hand?
Are the models (or model index) treated as integers here or something else? For examples, when corresponds to different graphs in DAG learning, how is represented?
There is no hard restriction on the representation of , so long as each model has a unique representation. Proposition 3.1 on line 144 introduces the idea that any discrete distribution over a finite support has a unique representation as a categorical distribution, i.e. any discrete where is finite can be mapped to a categorical variable. By way of example, in the first example on robust variable selection, is represented as a binary string that is directly equivalent to , i.e. the included variables are 1, the excluded are 0 in the string. In the non-linear DAG example, the representation of a model is described in Appendix F.7 on line 884 (we link to this appendix on line 206, although we agree the mention of the construction of the model indicator could be made clearer to the reader. We will rectify this in the camera ready version.) In this example, is a concatenation of the Lehmer code for the permutation matrix and the flattened binary string from the upper triangular matrix . Additionally, it is permissible to apply a domain-specific transform before it is input as a context to the flow composition . In this case, we compute the adjacency matrix directly from and and use that as the context. It should be noted that is not the context, but rather the model indicator. The context is any function of that preserves the unique mapping to each model. In the case of DAGs, an adjacency matrix uniquely represents the DAG and is therefore appropriate as the context input to the flow composition .
We believe this level of flexibility will enable users to map the proposed method to a large variety of inference problems.
Another question is about the index for the model specific parameters. In the current formulation, it seems that all the models use the same order (i.e., index) for the parameters. As flow-based models (e.g., normalizing flows) are sensitive to the order of parameters, I am not sure if this would lead to extra constraints on the models that can be amortized over. Are the parameters across different models required to have similar alignment before hand?
We follow the typical approach in Papamakarious et al. (2021) where flows are composed with reverse permutations between each transformation. This spreads the sensitivity of parameter order approximately evenly and yields good results. We describe the model-specific reverse permutation in Appendix A.2 on line 553. Unfortunately not all of these details could fit in the main paper body but in the camera ready version we will endeavour to link the concepts you have raised to the relevant appendices.
The reverse permutation used in normalizing flows is to make the transformation more effective. The normalizing flow itself still depends on the specific order of the input. In your case, this would be a potential problem if different model has different parameters whose components can not be aligned consistently. It would be good to add some discussion on this and its potential impact on your approach.
In your case, this would be a potential problem if different model has different parameters whose components can not be aligned consistently.
In order to help us understand you correctly, could you please provide a toy example that illustrates your scenario? Specifically what you mean by alignment, so that we can see if the framework meets your criteria. For instance, if we have two models, One being , the second , then we can fix either or without any issue. The input permutation will ensure both are left aligned to the input to the flow transform. The ordering of variables as shown by the robust variable selection and DAG examples only needs to be specified within each model. Is this along the lines of what you are asking?
Yes, in your example, these parameter can be aligned. However, if corresponds to different graphs, and is some parameter associated with the graphs, say the length of the edges or the transition probabilities along the edges. When these edges can not be aligned across different graphs, how can you assign an order for the parameters consistently? If you ignore the potential miss-alignment, the amortization would be hard as the variational approximation needs to figure it out by itself.
When these edges can not be aligned across different graphs, how can you assign an order for the parameters consistently?
In a simplified DAG example where only linear dependencies (edge weights) are considered, the mapping of to edge weights is strictly determined by the permutation , this ensures consistency of the mapping across the model space. The length of is and this directly maps to an upper triangular matrix of the same dimension as the upper triangular binary matrix . Thus we have the weighted adjacency matrix .
In the non-linear DAG case, we describe the deterministic mapping of to the MLP parameters in Appendix F.
For better clarity, we will include the simplified case we described above in the appendix for the reader to better understand the mapping. In both cases, it is deterministic and consistent across all models by construction.
If you ignore the potential miss-alignment, the amortization would be hard as the variational approximation needs to figure it out by itself.
This raises an interesting question that could potentially lead to future research. The question is whether it is easier for the optimization to preserve the same mappings across models. Your suggestion of there being a mis-alignment of these variable mappings across models, would ignoring it make the optimization less efficient and produce a less expressive target distribution? More fundamentally, this then poses the following questions:
- Is there shared information between models?
- If so, would careful manual construction of the neural networks underpinning the flows allow us to exploit this versus allowing the optimization to determine this for us?
In answer to (1), it is very easy to construct an example where there is no shared information between models, even over a large model space. In the robust variable selection example, the marginal posterior of each variable across all models is significantly different, and this was by design using a misspecificed likelihood (Figure 11 in Appendix E visualises this phenomenon). We believe there is no general answer to (1). Other recent work (Kucharský et al. 2025) investigates using a conditional flow in the setting of simulation based inference (SBI, i.e. not IAF-based VI, and not transdimensional to our knowledge) to amortize over the continuous parameters of each mixture component in a mixture model. In this scenario, it cannot be expected that each mixture component would share variables, and yet they demonstrated applicability of the amortization in several realistic scenarios when compared with MCMC.
Whether or not there is an answer to (2) could be a topic for future research. This may involve trialling embeddings, meta-flows, or any of the many options available in the neural architecture toolbox. We will add discussion around these points in the camera ready version.
Kucharský, Š. and Bürkner, P.C., 2025. Amortized Bayesian mixture models. arXiv preprint arXiv:2501.10229.
Thank you for the discussion. I am now in favor of accepting the paper.
The paper tackles the challenging question of how to perform amortized variational inference in settings where there is a prior over parameter spaces of different dimensions. The approach leverages existing tools like Gaussian process surrogates, the overall methodology is highly novel and enables amortized inference methods to directly tackle challenging and important model selection problems like feature selection and structure learning. The paper is well-written although fairly dense, so the authors have promised to use the additional camera-ready page to improve the accessibility of the writing.
The paper stands out for an spotlight presentation because
- It makes multiple technical contributions that could be of wide interest, including a new type of trans-dimensional normalizing flow and the new amortization method
- It tackles a problem not currently addressed in the literature (using amortized inference for model selection)
- The empirical results are compelling and demonstrate the breadth of application