6.8

/10

Poster5 位审稿人

最低3最高5标准差0.7

3.6

置信度

创新性2.8

质量3.0

清晰度3.2

重要性3.0

NeurIPS 2025

Large Language Bayes

Justin Domke

OpenReview PDF

提交: 2025-05-10更新: 2025-10-29

摘要

关键词

bayesian inferencevariational inference

评审与讨论

审稿意见

评分: 4置信度: 42025-07-01

This manuscript introduces Large Language Bayes (LLB), a method for automating Bayesian modeling from natural language descriptions using a large language model (LLM). The primary objective of the paper is to generate multiple candidate models based on a user's prompt. Then, a final prediction is computed by creating a weighted average of the posteriors from each model. The weight of each model is determined by its marginal likelihood, which measures how well observed data is fitted.

优缺点分析

Strengths
- The paper presents an accessible tool for Bayesian modelling using LLMs through a theoretical analysis.
- Experiment results are encouraging.
Weaknesses
- LLM model sampling appears to be highly sensitive to multiple factors, including prompting, sequencing, and model. The manuscript could mitigate practical concerns related to sampling sensitivity with further experiments. (See Questions).
- The result in Theorem 2 - addressing the posterior approximation quality - isn’t feasible in practice, given the ample search space of models.

问题

When sampling PPL models, are the output generations individualized, meaning they each have their own “message thread”?
- If yes, how did the authors deal with the LLM generating/suggesting the same model multiple times? Any statistics on this? Wouldn’t this behavior possibly be a bias built into the LLM? Do you have any thoughts on how to handle this when writing the model with the PPL?
- If not, wouldn’t prior model output influence subsequent ones? That is, creating a chain of biases.
How would this method compare with traditional Bayesian learning, where the optimization problem is to find the model parameters from a distribution that could, potentially, have a prior?
Any thoughts on using the approximate ~p(z|x,m) as a score for dynamically prompting the LLM? In an iterative approach, one could use the posterior approximation as a guide for the LLM prompt template, aiming for more accurate models.

局限性

Reproducibility of results is dependent on LLM size/quality and prompting
Inference demands limit the method to smaller use-cases

最终评判理由

The authors have addressed my concerns in their rebuttal, although the response didn't address my point with the formality it would require. However, after reading other reviewers' points of view, I believe my concern is outweighed by some of the good strengths highlighted by my peers. I am changing my recommendation to a weak accept to reflect the discussion in the rebuttal phase.

格式问题

N/A

作者回复

2025-07-31

Thanks for your review!

When sampling PPL models, are the output generations individualized, meaning they each have their own “message thread”?

This is a good question. Yes, each sampled model had its own "thread", as mathematically required for model to be a valid IID sample from p(m|t).

If yes, how did the authors deal with the LLM generating/suggesting the same model multiple times? Any statistics on this? Wouldn’t this behavior possibly be a bias built into the LLM? Do you have any thoughts on how to handle this when writing the model with the PPL?

If the LLM wanted to suggest the same model multiple times, it was free to do so. We simply treated those as separate samples and processed them independently. This is correct, mathematically, but is potentially computationally wasteful. (It's silly to run MCMC multiple times on the same model.)

Note that it's extremely rare for the LLM to generate exactly the same model. However—unfortunately—it's possible to create equivalent models, and these are somewhat challenging to detect. For example, equivalent models might use different variable names, or declare those variables in a different order. For these reasons we unfortunately can't easily offer any hard statistics. However, our qualitative impression is that this happens fairly often for the very simple models (e.g. coins) with the more complex models, a large fraction of the generations are unique. (This can be seen, to some degree, by looking at the distribution of estimated marginal likelihoods in the detailed experimental results in section B.)

How would this method compare with traditional Bayesian learning, where the optimization problem is to find the model parameters from a distribution that could, potentially, have a prior?

We're very sorry, but we can't quite understand this question. We're confused by the description, which doesn't correspond to traditional Bayesian learning as we understand it. We'd like to reply, so we'd be grateful for a bit more detail on what's being asked here.

Any thoughts on using the approximate ~p(z|x,m) as a score for dynamically prompting the LLM? In an iterative approach, one could use the posterior approximation as a guide for the LLM prompt template, aiming for more accurate models.

Another good question! We have considered several variants of this idea, but haven't validated one yet.

Here's one option. Imagine you generate one model at a time and do inference on it before generating the next model. One could conceivably take some of the top scoring models and provide them as additional examples as part of the prompt. Done naively, this would, of course, bias the results. However, one could potentially correct this with importance sampling—after the chain of thought text, generate the model with the extra examples, compute the probability of that generation, and then also compute the probability using the LLM without the extra example models. In this way, one could hope that the model would converge to the true results more efficiently. We caution that we haven't actually done this yet. But we believe this general kind of strategy (along with the ideas we allude to on lines 303-307) may greatly improve the method.

审稿意见

评分: 5置信度: 42025-07-02

The paper proposes a procedure by which to use LLMs alongside Stan or another PPL for posing probabilistic inference problems: prompt the language-model with an informal description of the generative model and the target variables for posterior prediction, have it generate probabilistic programs, and then ensemble those probabilistic programs to predict the target variables. They demonstrate that this can work on some example problems.

优缺点分析

Strength: manages to still research Bayesian inference while making use of the latest computational capabilities

Weakness: model averaging as a weighting strategy can suffer from high variance and unstable weighting in the case of model misspecification (see: https://proceedings.mlr.press/v238/reichelt24a.html), which is almost sure to happen when converting from natural language to probabilistic program via LLMs. Self-normalized importance sampling similarly has bias and variance issues of its own.

Strength: the authors created a fresh benchmark of probabilistic inference problems instead of using standard ones that could be in training corpora.

问题

Can the authors systematically characterize which varieties of approximate inference they tried out in order to get the scheme we see in this manuscript? How do the experimental results vary if hyperparameters or the choice of Monte Carlo algorithms are varied?

局限性

I would like to see an answer to my questions above to really get a sense of whether the inference results shown are robust enough to "deploy" to people who are not PPL full-timers.

最终评判理由

I'm impressed with the author's response to my question and the simplicity of their model averaging method, so I've raised from score from Borderline to Accept. The computational load of such a procedure is of course quite large, but after all you don't train or use LLMs when you're compute-poor.

格式问题

N/A

作者回复

2025-07-31

Thanks for your review!

model averaging as a weighting strategy can suffer from high variance and unstable weighting in the case of model misspecification

We believe these issues are captured by our analysis in section 5.1.In particular, see the expression for the asymptotic variance of the estimate in Eq. 9. (Or a more careful analysis in C.2) This variance is determined rather the divergence between the LLM prior p(m|t) and the posterior p(m|x,t). So the main issue, as we see it, is if the LLM prior p(m|t) is reasonably broad. We allude to these issues on line 228-231, though it may be appropriate to give this discussion more emphasis. (Please also see our response to eRLK for more discussion of this point.)

Can the authors systematically characterize which varieties of approximate inference they tried out in order to get the scheme we see in this manuscript? How do the experimental results vary if hyperparameters or the choice of Monte Carlo algorithms are varied?

Sure. For the MCMC step, we immediately settled on using Stan's default algorithm with no modifications. We never experimented with any variants of this. The only question was how to estimate the marginal likelihood. To do this, we experimented a little bit with doing importance-weighted VI using normalizing flows. However, we found that this was a mess, as it's very difficult to make stochastic optimization over 10,000 models fully automatic. Thus we discarded it and switched to the current method.

The final method, while pretty expensive, is absolutely bulletproof and has very few hyperparameters to set. All that happens for each model is:

Do MCMC using Stan's default algorithm.
Compute the mean and covariance of those MCMC samples.
Create a Gaussian with that mean and covariance.
Repeat N times: Draw M samples from the Gaussian, and estimate the importance-weighted ELBO.

(See Algorithm 9 in section F)

There are literally no hyperparameters in this algorithm other than N or M. We used M=25 and N=10,000, simply on the basis that they seemed reasonable. We never experimented with tuning either of them.

Provided that Stan's MCMC method succeeds, this method is extraordinarily robust. After creating it on small test problems, we only ran it once to generate all the results. There was no secret fiddling. There may be some bias to the marginal likelihoods if the posterior for some model is poorly approximated by a Gaussian and importance-weighted VI cannot correct this. However, notice that if this happens, the failure is "gentle"—the resulting ELBO will be lower than the marginal likelihood, meaning that model gets lower posterior weight than it deserves. In this way, the method is very "safe".

审稿意见

评分: 4置信度: 32025-07-02

This paper proposes a simple but powerful way to turn a plain-language problem description into a full Bayesian analysis without hand-coding a model. An LLM is prompted to emit many candidate formal models in a probabilistic-programming language; each model’s posterior and marginal likelihood are approximated. The authors give a practical algorithm, theoretical justifications and experiments on synthetic dataset to support their proposed idea.

优缺点分析

Strength

I think the paper is very well-written, it is very easy to follow what the authors wish to express.
The paper provides a practical way for LLM users to perform principled Bayesian predictions, which leads to nice practical impact.
Theoretically, one only needs to match the target variable, which in my opinion provides strong flexibility.

Weakness

The experiments are done on synthetic data. While I would assume it is not easy to find a real world data in this case, can the authors provide some discussion on how can one apply their method in practice?
Typo: line 9 (justified "an" analyzed).

问题

Is my understanding of "only need to match target" correct?
Can the authors provide some discussion on how can one apply their method in practice?
How does the computational cost behave when applying the proposed algorithm?

局限性

Yes

最终评判理由

The authors have addressed my concerns, I remain my score.

格式问题

作者回复

2025-07-31

Thanks for your review!

Is my understanding of "only need to match target" correct?

It is true that the user only needs to specify the target variables of interest. While running, the LLM can and does invent other random variables. This is extremely convenient since most applied statistics problems involve, e.g., hierarchical modeling and partial pooling. But here the LLM takes care of all that for you. (Please let us know if we misunderstand the question.)

Can the authors provide some discussion on how can one apply their method in practice?

Sure. In practice, all the user needs to do is provide a model text (as a .txt file) and a data file (in our implementation, as a .json file). Then, our implementation uses a custom system prompt and sequence of example generations to generate a set of candidate Stan models (See sections F.1, F.2, and G). Then, each of these models is compiled into a Stan executable and (if it compiles) MCMC inference is run. Finally, those MCMC samples are used to construct a Gaussian distribution. That Gaussian distribution is then used for importance-weighted VI to estimate a marginal likelihood. Finally, all the MCMC samples are weighted by the marginal likelihoods and returned to the user.

Practically speaking, running this algorithm is a challenge. We run the LLM on a cluster with an A100 GPU using continuous batching. The generations are saved as text files. Then, separate (CPU) jobs are submitted for each to run MCMC and estimate the marginal likelihoods. The final combination is easy and can be done on any local computer.

How does the computational cost behave when applying the proposed algorithm?

Please see section F for specifics on this point. Note also that we also give some ideas for future directions to improve efficiency on lines 303-307.

2025-08-09

Dear authors,

The response have addressed my question. I remain my opinion on accepting the paper.

审稿意见

评分: 3置信度: 42025-07-02

This paper attempts to make Bayesian inference widely available by automatically deducing the appropriate Bayesian model and performing inference under the model with the user of the system only being required to provide some data and a text description of the latent variables that they are interested in.

The paper outlines a very straightforward recipe that can be very easily replicated. First, use an LLM to draw samples of valid PPLs and perform posterior inference over the latent variable with each PPL. Then use these posterior samples to do variational inference of the marignal likelihood of the data given the PPL model. Finally, return a likelihood-weighted average of the posterior from all the PPL model posteriors.

The paper also provides other methods where a pure variational inference approach can be used.

There are some results provided on few examples with a few hundred data points including examples where the data type of the latent variable is not explicitly stated.

优缺点分析

Strength : Novelty

While there are other papers on PPL generation including generation of PPLs from LLMs there are none that explicitly perform model averaging as described here in a simple practical recipe.

The authors have also provided a second inference method using a purely variational approach which might be a novel.

Strength : Broad Applicability

The ability to make Bayesian Inference broadly applicable with a simple text prompt has the possibility to introduce an unprecedented level of rigor in scientific discourse where probabilistic inference is needed for an accurate interpretation of data but where statistical expertise is generally lacking.

Weakness : Detached from BMA literature

This paper is a very natural continuation of the Bayesian Model Averaging literature but it appears to attempt to negate that connection by a rather offhand remark, "However, existing algorithms for Bayesian model averaging (e.g. reversible jump MCMC [8, chapter 11]) do not appear to be applicable here, since they involve explicitly iterating over all models (here very larger or possibly infinite) and/or using evaluations of p(m|t)."

This is not a fair remark. In BMA literature model averaging does use the posterior of the model given the data as the weight which is very similar to the current paper. Current paper draws samples of the model using an LLM and then weights these samples using the likelihood of the data given the model. Now, depending on how one chooses to interpret the samples from the LLM this can be considered technically identical.

In any case, there is a fair amount of literature in BMA including frequentist methods which could all be considered relevant. It is unclear if the two methods described in this paper are indeed novel unless a thorough comparison is made to BMA.

Weakness : Scalability

Now, PPL inference is already quite intractable and this paper adds another layer of intractability by generating a large number of PPLs and performing inference on all of them. This makes the approach appear to be impractical.

The paper demonstrates results on only very simple examples with a few hundred data points at best. This doesn't provide any reassurance on the scalability of the approach.

Weakness : Baseline Comparisons

The experiments don't really compare the generated posteriors to those that would be generated if say an expert was to write a PPLs. It would be fine to compare agains the performance of a PPL generated by a non-expert.

There is no discussion of whether the generated posteriors are in any way calibrated which makes it hard for the reader to convince themselves that the proposed method here is sound.

问题

Why is the proposed method in this paper not been compared to related work in BMA?

Why has the algorithmic variants in section 5.3 been provided when there are no experiments to back up the claims here?

What kinds of models would experts and non-experts have generated for these scenarios and how do the highest weight models compare to those?

How well calibrated are the posteriors?

局限性

yes

最终评判理由

The authors have failed to place their paper in context with other BMA algorithms under a factually incorrect pretext. The authors claim that all other BMA algorithms explicitly iterate all models or require the ability to compute a model's prior probability. This claim is false. There are many decades of BMA algorithms that the authors are glossing over.

Without properly placing this paper in the context of other BMA algorithms, it is not possible to assess the novelty of this work. As such, we have to assume that this work is not novel.

格式问题

None

作者回复

2025-07-31

Thanks very much for your review!

Regarding this quote:

"However, existing algorithms for Bayesian model averaging (e.g. reversible jump MCMC [8, chapter 11]) do not appear to be applicable here, since they involve explicitly iterating over all models (here very larger or possibly infinite) and/or using evaluations of p(m|t)."

We believe this remark is correct as written, although we concede that we weren't as clear as we could have been.

We certainly agree that the final posterior can be seen as an instance of Bayesian model averaging, and we tried to be explicit about that (See lines 50-51).

However, there is a crucial distinction. In the setup of this paper, we assume that it is not possible to evaluate p(m|t). We briefly allude to this on lines 75-79, but we should be more explicit as it is a subtle point.

There are two issues here:

First, most commercial LLM providers (e.g. OpenAI) simply do not provide probabilities for their generations—probably to make it harder for other companies to distill their models.

Second, for open-source LLMs (like llama, which we use) generation probabilities are available. However, in order to encourage the LLM to generate better models, we ask it to generate a "chain of thought" where it informally describes the modeling strategy. (See the examples in Section G.) We have found this to be crucial for generating high quality models. But the impact is that the generation probabilities for the model (alone) are not available. Rather, the generation probabilities give you something like p(informal text, formal model). Finding the probabilities of the model (alone) is completely intractable as it would require marginalizing out all possible informal texts.

Regarding scalability, we must be straightforward that we simply do not intend to claim great scalability. The current inference method is, frankly, extremely expensive, as it requires doing MCMC on ~1000 models. (See section F for some discussion of timing.) However, we believe that there is great opportunity for faster inference methods (see lines 303-307) that would greatly increase scalability. At the same time, we respectfully suggest that given that this is the first method to directly define this kind of posterior directly from user text, it is useful to establish the basic sanity of the framework, and that even a method that can only (currently) address modestly-sized problems constitutes a valid contribution.

Why has the algorithmic variants in section 5.3 been provided when there are no experiments to back up the claims here?

We have given a series of increasingly complex algorithmic variants so that it is possible to discuss and analyze the different algorithmic components separately. Our final algorithm (given explicitly as Alg 7, section D) uses all the enhancements discussed in this section, and these are crucial for making inference possible.

What kinds of models would experts and non-experts have generated for these scenarios and how do the highest weight models compare to those?

We were advised that doing this would be considered human subjects research and thus subject to extremely cumbersome legal and IRB requirements. Our hope was that the reader of the paper could largely play this role. For example, in Figures 6-8 (appendix B.1) it is shown that for the rain example, essentially all posterior weight is given to a single model, which we provided in the text as m^(742). Realistically, we believe it is reasonable to expect the reader of the paper to look at the given informal text and this model and judge for themselves if the final model model should be considered reasonable. The same is true for the other models.

How well calibrated are the posteriors?

We ask for a bit more detail on what is requested here. (We may be able to provide it.) Typical notions of calibration (as, e.g., in a brier score decomposition) require doing inference a number of times and comparing the predictions to observed outcomes. This would not be practical given the expense of this method. For the city temperature problem, we actually did provide some small amount of evidence in Figure 30.

评论- Response to rebuttal

2025-08-07

I do not know anything about human-subject research implications, so I will take author's word for it.

However, I do not accept the author's claim that this work is not related to BMA literature.

The authors are defending the explicit text in the paper which is factually incorrect. BMA does not "explicitly iterating over all models (here very larger or possibly infinite)". BMA can draw samples from the prior over models, which is effectively what the current paper is doing. I understand that the authors can't get the probability p(formal model | informal text) but by repeatedly sampling from an LLM they are effectively sampling from this prior probability over models. In other words, exactly BMA.

I do sympathize with the authors regarding the inference time complexity, and I would have been happy to discount some of those objections if this work was properly placed in context.

2025-08-08

We do not claim that the work is unrelated to the BMA literature. We must protest that this is not a fair summary of the paper or our response. On the contrary, the posterior is not just related but mathematically equivalent to BMA, the only difference being that the prior comes from an LLM. This is stated prominently in the paper right after the main derivation on lines 50-51:

Note that Eq. (4) can be seen as an instance of Bayesian model averaging [16 , 32 ], just with a prior over models that’s defined using an LLM.

This is also stated before the quote under dispute:

As mentioned in Sec. 2, LLB can be seen as an instance of Bayesian model averaging [16, 32 ]. However, existing algorithms for Bayesian model averaging (e.g. reversible jump MCMC [8, chapter 11]) do not appear to be applicable here, since they involve explicitly iterating over all models (here very larger or possibly infinite) and/or using evaluations of p(m|t).

We accept that this may not be as clear as possible. But we stress that second sentence is only intended to refer to algorithms for BMA. It is true that many BMA algorithms explicitly iterate over the space of models. It is also true the many BMA algorithms (including reversible jump MCMC) require evaluations of the prior density. (We should have been clearer that by "evaluations of p(m|t)" we meant evaluation of the density, rather than sampling m~p(m|t).)

All that this text is trying to say is that standard BMA algorithms all appear to have at least one of these two properties, and thus cannot easily be applied to this problem since the space of models is infinite and the prior cannot be evaluated.

We don't see any factual dispute here. The only issue appears to be the intended meaning of the text. Rest assured, we do not want to create the misleading impression that the posterior in this work is unrelated to BMA, and we accept that this passage could create that impression, even if that is not what we intend. We will revise the text in question to make it absolutely clear that it is only computational issues that make it difficult to apply standard BMA algorithms.

审稿意见

评分: 5置信度: 32025-07-03

This work deals with a key limitation of probabilistic programming: model specification. Domain experts, such as clinicians, are typically not familiar with the syntax and semantics of probabilistic models, posing a significant barrier to quantitative analysis and requiring a dedicated data scientist or engineer to encode their domain knowledge into the model. Knowledge elicitation is often error-prone and cumbersome. This work addresses this problem using LLMs, allowing experts to describe their problems in English and translating the problem description (t) into a probabilistic program (m) using a decoder-only LLM, such as GPT. The program m describes a model over target variables (z) given observed variables (x), that is, p(z | x, m). Since LLMs are stochastic, the program m can be interpreted as having been sampled from p(m | t). The authors exploit this fact to make the program more robust by defining it as an ensemble, a set of sampled programs, the final density over the target a weighted average of each model’s density. They consider two methods: flat averaging and Bayesian model averaging (BMA). While the former weighs each sampled model equally (that is, proportional to the prior distribution p(m | t)), the latter uses the posterior over the model $p(m | x, t) \propto p(m | t) p(x | m)$ , which has the additional p(x | m) marginal likelihood term. Exact inference is exceedingly difficult. First, the LLM’s distribution over the model m given the text description p(m | t) is usually inaccessible; only samples drawn from it are accessible. Second, the model’s posterior density over the target p(z | x, m) and its marginal likelihood p(x | m) are hard to compute exactly. The former is estimated via NUTS, while the latter is estimated via variational inference, setting the variational distribution to be a Gaussian matching the empirical mean and variance of the samples. The authors demonstrate the efficacy of their proposed method on five newly created domains: the fully Bayesian averaged model outperforms the flat average, especially in challenging domains.

优缺点分析

Strengths

This framework exploits the complementarity between LLMs and probabilistic programs: LLMs excel at natural language processing but cannot reason; probabilistic programs excel at reasoning but cannot deal with unstructured information.
Empirical evaluation on newly constructed domains
Model-agnostic approximate inference procedure for Bayesian model averaging.

Weaknesses

Generated models failing to compile due to syntactic invalidity seems to be a major limitation. This seems to get worse in complex domains like Gold. This might be addressed by using constrained generation, e.g., Loula et al. (2023)
While the authors construct new domains to avoid leakage with the LLM’s training data, the domains themselves are relatively less complex. It appears that the complexity of the domain, the size of the text description, and its conflicts with the LLM’s embedded information would affect the quality of models.

References

Loula, João, et al. "Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo." ICLR, 2025.

问题

Since the models are LLM-generated and the experts are not familiar with probabilistic programming, there seems to be a risk of the model encoding undesirable biases from the LLM’s training data.
Section 5.1 says that "These results underline the importance that the LLM-defined prior be relatively broad.” Can you elaborate on this point? Would imposing syntactic validity constraints on the LLM output make it less likely to assign low p(m | t) to models having high marginal likelihood ?

局限性

yes

最终评判理由

Based on the other reviews and the authors' response, I understand the manuscript better. I will keep my positive evaluation and increase my confidence to 3

格式问题

none

作者回复

2025-07-31

Thanks very much for your review! We particularly appreciate the summary, which is, as far as we can see, accurate in every detail.

Generated models failing to compile is indeed a significant issue. Though, in practice it's not as severe as you might think. We take advantage of continuous batching (parallel inference) when generating models from the LLM which speeds things up enormously. The good news about invalid models is that they are quickly rejected during the compilation step, meaning no computational effort needs to be expended on them during inference. (See section F for some more discussion of this point.)

We have indeed considered using constrained generation. In fact, we initially were generating models using llama.cpp, which supports constrained generation (https://github.com/ggml-org/llama.cpp/blob/master/grammars/README.md), and we noticed that Stan has a published BNF language syntax (https://mc-stan.org/docs/reference-manual/syntax.html). The truth is that we ultimately did not to pursue this simply because llama.cpp did not suppose continuous batching and we could not find any LLM software that supported both continuous batching and constrained generation. As continuous batching made generation 1-2 orders of magnitude faster, we could not forego it.

We apologize for being somewhat unclear with the statement in section 5.1 that "These results underline the importance that the LLM-defined prior be relatively broad.”

What we were trying to say here is that the variance in Eq. 9 is determined by the chi-squared divergence between p(m|x,t) and p(m|t). To clarify notation, here p(m|t) is the conditional distribution over models from the LLM given only the model text, while p(m|x,t) would be the conditional distribution when you also condition on the observed data x.

Consider, for example, a case where the LLM is extremely "overconfident" in that it assigns almost all probability mass to "bad" models m where the marginal likelihood p(x|m) is very low. In this case, it is conceivable that the posterior p(m|x,t) ∝ p(m|t) p(x|m) would be more influenced by the marginal likelihood p(x|m) then the LLM prior p(m|t) and so could concentrate in regions where p(m|t) is very low. If that were to happen, the chi-squared divergence between p(m|x,t) and p(m|t) would likely be very large, meaning that estimates from this method would have very high variance.

For this reason, we believe it is important that the LLM be instructed to be quite "open minded" about what a good model might be. This, of course, depends on the base LLM. But one can also try to encourage this by providing a good system prompt and a good set of examples, as we sought to do.

Imposing syntactic validity constraints would not have much impact on this issue. Since we reject all invalid models, the only impact of imposing those constraints would be to improve the fraction of valid models. Essentially, if you find 1000 syntactically valid models by generating and rejecting or imposing validity constraints, then the resulting accuracy would be the same.

Ultimately, we suspect that general progress in LLMs will ultimately improve the fraction of valid models, as it has done in the generation of other kinds of code (e.g. python code), enough that this would become a fairly minor issue. We also believe there is great opportunity for better inference methods that would make it practical to address more complex problems (see lines 304-307). We hope that this paper establishes the basic idea of defining posteriors directly from informal user text, and that with future progress in inference methods it will be practical to address large-scale problems.

2025-08-09

There are only 24 hours left in the discussion period, and so far, we've only heard from one of five reviewers. We are eager to have a more detailed discussion. If any reviewers have any further comments, we are eager to have more discussion.

最终决定Accept (poster)

2025-09-17

This article combines large language models (LLMs) and probabilistic programming to define a joint distribution over models, latent variables, and data. The LLM is used to sample candidate models from informal descriptions, while the probabilistic program performs per-model posterior inference and marginal likelihood estimation. The approach instantiates Bayesian model averaging (BMA): the posterior is approximated via self-normalized importance sampling over models, relying on (i) sampling models with the LLM, (ii) estimating per-model posteriors using Stan or similar, and (iii) approximating per-model marginal likelihoods via variational inference. Experiments are conducted on synthetic data.

After discussion, the paper remains borderline, with a majority leaning toward acceptance.

Strengths:

Clear writing.
Novel combination of LLMs with probabilistic programs.
Model-agnostic approximate inference.
Generality of the framework.

Limitations:

Experiments are limited to synthetic data.
Computational cost is high, limiting applicability.
The importance sampling-based BMA is relatively simple and may suffer from high-variance estimates.
Some inaccuracies in the discussion of the BMA literature.

The authors’ response satisfactorily addressed most concerns; however, issues regarding the treatment of the BMA literature remain and should be corrected in the final version.