PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
4
5
5
4
3.3
置信度
创新性2.3
质量2.8
清晰度2.8
重要性3.0
NeurIPS 2025

Rollout Roulette: A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29
TL;DR

We introduce a Particle Filtering approach to Inference Scaling that is robust against inherently imperfect reward models and performs significantly better than previous methods.

摘要

关键词
Large Language ModelsInference ScalingParticle Filtering

评审与讨论

审稿意见
4

This paper introduces a particle filtering algorithm for inference-time scaling to solve early pruning and diminishing-returns issues in search-based LLM inference. It casts test-time compute as posterior inference over a state-space model—using the LLM as a transition model and a process reward model (PRM) as an emission model—and iteratively resamples, mutates, and reweights a diverse set of candidate sequences (particles) to maintain exploration–exploitation balance under uncertainty. The key insight is that preserving trajectory diversity prevents premature pruning and yields robust, efficient scaling without exponential rollout growth.

优缺点分析

Strengths

  1. Quality / Clarity: The writing is well structured and the figures clearly illustrate the algorithm (e.g., the particle update pipeline). Method details and advantages are presented in a direct, easy-to-follow manner.
  2. Significance: Early pruning in search-based LLM inference is a critical bottleneck; the proposed particle filtering (PF) algorithm is well matched to this problem, as it preserves candidate diversity and balances exploration–exploitation under reward uncertainty.
  3. Originality: Applying particle‐based Monte Carlo filtering to inference-time scaling in LLMs is novel, providing a principled Bayesian framing and opening a new axis of compute allocation beyond deterministic search.
  4. Quality: Comprehensive experiments on MATH500 and AIME 2024 show 4–16× faster scaling rates than deterministic baselines. Ablation studies on compute allocation, PRM aggregation, and temperature yield useful insights into each module’s contribution.

Weaknesses

  1. Significance / Clarity: Regarding Lines 43–46, there appears to be a trade‐off: early pruning can save compute by discarding unlikely trajectories and reallocating resources to high‐probability regions. Could you clarify under what conditions—such as problem difficulty or reward‐model uncertainty—additional exploration becomes beneficial?
  2. Originality / Significance: Monte Carlo Tree Search (MCTS) also balances exploration and exploitation to avoid premature pruning. The PF algorithm proposed in this paper shares strong operational similarities with Monte Carlo Tree Search (MCTS). The paper does not discuss MCTS: how does PF differ from or relate to MCTS? Including MCTS as a baseline and comparing performance would strengthen the evaluation.

问题

Please refer to the “Weaknesses” section above.

局限性

yes

最终评判理由

Thank you to the authors for the clear and direct rebuttal. Your responses successfully addressed my primary concerns regarding the trade-offs of exploration and the comparison to MCTS. I am maintaining my "Borderline Accept" score and now have stronger confidence in the paper's contributions.

My main reservations were about the lack of comparison to MCTS and the conditions under which the proposed exploration strategy is beneficial. The rebuttal addressed these points well:

  1. MCTS Comparison: The clarification of the fundamental differences between Particle Filtering (PF) and MCTS—particularly PF's robustness to reward noise and superior parallelism—was insightful. The new empirical results showing PF outperforming a recent MCTS baseline on the AIME benchmark provide the necessary evidence to position this work effectively within the existing literature.
  2. Value of Exploration: Your explanation that reward model uncertainty makes exploration valuable for most non-trivial problems is a reasonable and convincing argument. It clarifies the trade-off and better frames the problem setting where your method is most impactful.

After the rebuttal, I believe the paper is much stronger. The key remaining weakness is that the experimental comparison to MCTS, while valuable, is still limited to a single setting. However, the conceptual arguments and the new data are sufficient to demonstrate the novelty and value of the proposed PF approach.

格式问题

No obvious format issues.

作者回复

Thank you for your thoughtful and constructive review - we're glad you enjoyed the narrative clarity, novelty of applying particle filtering to inference-time scaling, and the strength of our experimental results and analysis.

 

Conditions Under Which Addional Exploration is Helpful: It’s true that in easier problem settings - where the language model has a very high probability of answering correctly - extensive exploration may not be necessary, and early pruning can save compute. That said, reward models are inherently uncertain: they’re not verifiers, but transformer models trained to assign scores, and scoring partial solutions is particularly noisy and difficult. Because of this built-in uncertainty, for most problems, and especially for non-trivial tasks, the stochasticity in particle filtering is beneficial - it helps retain potentially correct paths that might otherwise be prematurely discarded.

 


MCTS Comparison Thank you for this thoughtful comment. While both PF and MCTS seek to balance exploration and exploitation, they differ fundamentally in design and suitability for inference-time scaling of LLMs.

  • Design:

    • MCTS builds a tree and backpropagates scores from full rollouts or value estimates.

    • PF maintains a flat particle population, updates weights step-by-step, and resamples stochastically.

  • Value Function Sensitivity:

    • MCTS can use value estimates to avoid full rollouts, but errors in these estimates backpropagate and bias future decisions.

    • PF avoids this by using only local stepwise scores in a stochastic manner - more robust to reward noise.

  • Parallelism & Scalability:

    • MCTS requires sequential updates and backtracking, limiting parallelization.

    • PF updates all particles in parallel, aligning better with modern inference engines like vLLM.

  • Empirical Comparison:

    • We ran MCTS (from Zhang et al. 2024) vs. PF on AIME 2024 with LLaMA3.1-8B and budget 32.

    • PF: 16.6% accuracy, MCTS: 6.6%—PF outperforms MCTS under equal compute.

We will include this empirical comparison and discussion in the final version.

References:

Zhang, D., Huang, X., Zhou, D., Li, Y., & Ouyang, W. (2024). Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B. https://arxiv.org/abs/2406.07394

 


Thank you again for your thoughtful review. We hope these clarifications and updates address your concerns and further strengthen your confidence in the work.

评论

Thanks. My concerns have been largely addressed.

审稿意见
5

The authors consider using a particle filter (PF) based approach for the problem of inference-time scaling of language models (LMs). Specifically, the authors use process reward models (PRMs) in the PF, stochastically resampling sequences with higher PRM scores. The authors compare this against other inference-time methods such as best-of-n and beam search, testing empirically across a range of datasets (math, chemistry, finance) and models (LMs and PRMs), showing significantly improved performance with the PF method.

优缺点分析

Strengths:

Paper tackles an important problem setting. Ideas presented are reasonable, worth exploring, and could lead to significant advances in state-of-the-art. The theory/math that I checked (not everything) seems correct.

Good consideration of baselines, alternatives, ablations. Good evaluation over a wide range of different models (both generator/base LMs and PRMs). Fairly comprehensive experimental results across several different datasets. Experimental results appear to show very strong performance (though see comments below).

Paper is reasonably structured and not too hard to understand.

Weaknesses:

Line 47 - "shift-from" dash should be em dash

Line 97 - x1:T is improperly formatted

Lines 140-141: "Deterministic search can irreversibly discard trajectories that may have low early scores but high posterior mass overall" - actually this is true of particle filters as well. If you have a limited number of particles (which is always the case in practice), and high enough scores assigned by your PRM, then resampling can concentrate on only those high score sequences and discard low score sequences. And if this happens, that you've resampled and removed some early completions that had low score, you also cannot recover from this, since your propagation only generates subsequent tokens and cannot revise previous tokens.

Lines 145-146 - please use em dashes for readability

Line 152 - PG should be defined here instead of later; here is the first usage

Would be nice if you had MCTS as a baseline that you also compared against (this is a minor point).

Having no confidence intervals in any of the results is a big negative. It's understandable that an academic budget limits the number of experiments you could do, but at least for the final results (I'm not asking for the hyperparameter search/exploration phase), even if only for a few selected runs, you could do at least 3 seeds. This would at least give the reader a rough idea of variability/confidence interval width/statistical significance. With no confidence intervals/statistical significance tests, it's difficult to evaluate the validity and significance of results (having them isn't perfect either, but is much more informative than a single run).

Lines 210-213 - I appreciate testing on more benchmarks; that is good. However, I find the reasoning and exposition regarding these testing generalization beyond mathematical reasoning questionable. These are different domains, but they still rely on mathematical reasoning to a large degree. The authors themselves note that the NumGLUE task "targets numerical reasoning in scientific contexts." Similarly, FinanceBench (https://github.com/patronus-ai/financebench/blob/main/data/financebench_open_source.jsonl) has a lot of numerical reasoning based questions. So I would argue that instead of, as the authors claim in lines 229-230 that "such RMs implicitly learn broader reasoning evaluation capabilities during their training, not limited strictly to mathematical content", there's a simpler explanation, which is that Qwen2.5-Math-PRM-7B as the reward model works well for these tasks because they're close enough to mathematical reasoning tasks.

Lines 260-263 - if using a PRM as an ORM works best, then why is the whole paper focused on PRMs? It seems to me the authors should compare against using ORMs directly in their PF.

Lines 301-309 - "TSMC relies on a ground-truth verifier"; this is false, TSMC can also be used with ORM/PRMs. "requires joint training of a generator and value function" - this is also not really correct, the generator (proposal) can be trained as in Zhao et al. (authors' reference [33]) but does not have to be trained; e.g., you could just use the base LM. The value function also does not necessarily have to be trained, for example, as in Lew et al. (authors' reference [15]). "The authors of TSMC did not release code" - this is also false, see for example https://github.com/Silent-Zebra/twisted-smc-lm for the code for [33]; if this comment is for only reference Feng et al. [8], then this should be clarified. Finally, with respect to comparing vs [8], for a fair comparison, you'd have to use the same PRM (I believe [8] uses weaker PRMs).

PF/SMC-related works are the most relevant to this paper, and it's unclear how much originality this paper adds (the core novelty appears to me to stem from the application of PRMs in PFs to the language modelling (specifically mathematical reasoning) setting).

One other issue with the evaluations is, as the authors acknowledge, using a reward model at intermediate steps can be expensive. Baselines like BoN require only a single call to the PRM/ORM at the end, whereas the PF requires multiple intermediate calls. Therefore, while the authors control for the number of generated tokens/samples, the authors are likely giving the PF significantly more wall-clock time or total FLOPs. Ideally, the authors would be able to demonstrate that either under similar FLOPs or wall-clock time, PF still outperforms baselines.

问题

Figure 4c - why is the greedy line below the 2^0 point? Does this suggest that greedy is worse than a single random sample? And if so, why does Figure 4d show that greedy is better than a single random sample? Are these quirks of the models? Is the plot wrong? Or does random sampling have high variance (something that confidence intervals would be very useful to get an understanding of)?

Lines 246-249 - this is an interesting finding, and somewhat unintuitive. Why do the authors think this is happening? Also, a related question - why do the authors think that the Qwen PRM performance stagnates after just 4 samples?

Overall, I'm on the fence about this work. There's a lot that I like, but there's also a lot that I don't like. I think the results could be quite significant for the broader community, but I have doubts about the results. I'm choosing borderline accept instead of borderline reject because I think the strength of the positives outweighs the negatives. If the authors address all my points convincingly I'd consider increasing to accept.

局限性

Authors do discuss limitations, which is good, but only briefly; there is much more the authors could discuss (e.g., my points above).

Authors claim the work has no potential negative social impacts, which I don't think is fair. The critical assumption made here is that better LM performance (particularly on reasoning tasks, and for smaller models) can only lead to positive outcomes. While I would agree that the potential positives outweigh potential negatives, so overall the work would have positive social impact, there is potential for risks such as harm from more capable models in the hands of bad actors.

最终评判理由

As I mentioned in my comments, I am happy with the discussion with authors and have an improved evaluation of the paper now. I have increased to accept (but this is conditional on/assuming the authors making all the changes they have outlined in their responses).

格式问题

No major formatting concerns.

作者回复

Thank you for your detailed review and for highlighting the importance of our problem setting, the soundness of our approach, and the strength and breadth of our experimental evaluation - we appreciate your careful assessment. We address additional comments below.

 

Deterministic Search vs. Particle Filtering: Thank you for the thoughtful comment. While it's true that particle filtering can also discard low-scoring trajectories, the key distinction is that it does so stochastically rather than deterministically. This means there is always a non-zero probability of retaining low-scoring (but potentially correct) paths, unlike deterministic search which prunes them irrevocably. As a result, particle filtering offers greater robustness to early reward noise, especially under imperfect PRMs, by maintaining exploration over diverse hypotheses.

An example of this behavior is illustrated in Figure 2 of our paper, which shows a real example of the full PF method: the first step of the trajectory that ultimately leads to the correct final answer receives the lowest initial score. Yet, because PF retains a non-zero chance of continuing such paths, this trajectory survives and ultimately yields the correct solution.

 


MCTS: While both Particle Filtering (PF) and Monte Carlo Tree Search (MCTS) aim to balance exploration and exploitation, they differ fundamentally in design and suitability for inference-time scaling in LLMs. To directly compare the two, we implemented the MCTS baseline from Zhang et al. and evaluated it against PF using the same base model (Llama 3.1 8B Instruct) and compute budget (32 rollouts). On the AIME 2024 benchmark, PF achieved 16.6% accuracy, significantly outperforming MCTS, which achieved only 6.6%. We will include this empirical comparison and accompanying discussion in the revised version.

References:
Zhang, D., et al. (2024). Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B.

 


Confidence Intervals: Thank you for this important suggestion — we completely agree. To address this, we ran BoN, Beam Search, and Particle Filtering across 3 different seeds using 3 different generator models on the AIME 2024 benchmark. Below, we report mean accuracy along with standard deviation (±), which we will include in the revised version to better reflect result variability and statistical significance.

ModelMethodAIME 2024
Qwen-2.5 1.5B InstructBoN0.033 ± 0.019
Beam Search0.122 ± 0.022
Particle Filtering (Ours)0.155 ± 0.022
Qwen-2.5 Math 1.5B InstructBoN0.122 ± 0.011
Beam Search0.167 ± 0.019
Particle Filtering (Ours)0.222 ± 0.011
Qwen-2.5 Math 7B InstructBoN0.144 ± 0.011
Beam Search0.222 ± 0.011
Particle Filtering (Ours)0.300 ± 0.019

 


Additional Benchmarks: Thank you for the observation — we agree that many questions in FinanceBench and NumGLUE involve significant numerical reasoning, and that this likely contributes to the strong performance of our math-trained PRM. While these tasks also include elements such as reading comprehension and scientific context, we acknowledge that the overlap with mathematical reasoning is substantial. We will revise the discussion to better reflect this and avoid overstating the generality of the PRM.

 


PRM vs. ORMs: We appreciate the question. While "PRM as ORM" scores entire partial trajectories, it is still fundamentally different from using an outcome reward model (ORM). PRMs are explicitly trained to evaluate partial solutions step-by-step, whereas ORMs are trained to judge only final outputs. Since particle filtering requires scoring and resampling based on intermediate generations, PRMs remain the appropriate choice—even when used in aggregation modes resembling full-sequence evaluation. We will clarify this distinction in the revision. Notably, as we show in Figure 5b, other methods for using PRM scores, including the Last score, also perform very well.

 


TSMC: Thank you for this thoughtful and detailed clarification—we appreciate the opportunity to improve the precision of our comparisons to TSMC. You are correct that our original statements could be clarified further.

  • Our original statements referred specifically to the formal setup introduced in Feng et al. [8], which trains an optimal value function via Contrastive Twist Learning. This approach relies on both a trained generator and supervision from a ground-truth verifier, as detailed in Sections 3.4, 4, and Appendix C of [8]. You are correct that TSMC doesn't require a ground-truth verifier at inference time—the verifier is only used during training to shape the twist function.

  • This highlights a key difference in our premises: while TSMC (e.g., [8], [33]) focuses on learning theoretically optimal twist functions, our work assumes that PRMs are inherently imperfect and noisy. We ask a different question—how to best scale LLM inference given an off-the-shelf PRM—with a method that requires no training or tuning, offering greater robustness in practice.

  • You're absolutely right that Zhao et al. [33] released their code. Our comment specifically referred to Feng et al. [8], which we focused on since it applies TSMC to math. To our knowledge, their implementation and models are not publicly available—we’ll clarify this in the final version.

  • Regarding empirical comparisons, we acknowledge that [8] uses a trained value function and argues that off-the-shelf PRMs are weaker. Our goal was to highlight that even under this disadvantage—using a fixed off-the-shelf PRM (Qwen2.5-Math-PRM-7B)—our method outperforms [8], achieving 75.4% on MATH500 with 128 samples, compared to 60.8% in [8] using 240 samples, a trained generator, and a trained value function.

We’ll revise the final version to clearly reflect that [8] is focused on learning optimal PRMs, while our work addresses how to conduct inference-time search when the PRM is fixed and noisy.

 


Runtime Exploration: Thank you for the thoughtful comment. We ran wall-clock time comparisons on a 50-question subset of MATH500 using Qwen2.5-Math-1.5B-Instruct and vLLM as the inference engine:

  • BoN (budget=32): 211.69 sec
  • BoN (budget=64): 392.14 sec
  • Beam Search (budget=32): 549.48 sec
  • Particle Filtering (budget=32): 401.63 sec

We find that PF at budget 32 takes comparable time to BoN at budget 64, while achieving higher accuracy (84.6 vs. 82.8), indicating a favorable compute–accuracy tradeoff.

Regarding FLOPs: in theory, PF, BoN, and Beam Search have similar cost under perfect KV-cache reuse (where past token KV-caches are fully reused in decoding). However, in practice, step-wise inference leads to low KV cache hit rates because inference engines like vLLM aren't optimized for token-by-token generation. This causes duplicated compute and increased latency. Importantly, these differences are driven more by current engineering constraints than by the underlying algorithm.

(As further evidence of how much these engineering implementation details can dominate: running DVTS, a better version of beam search, took days.)

We will include these latency results and a discussion of the practical vs. theoretical tradeoffs in the revised version.

 


Greedy Lines in Figure 4c: The small discrepancy in Figure 4c arises because greedy decoding uses temperature 0, whereas the multi-sample methods—including the 1-sample case at 202⁰—use temperature 0.8. Thus, the "1 sample" point reflects a stochastic sample that may occasionally outperform deterministic greedy decoding, depending on the model. In the 1.5B case (Fig. 4c), this occurred; for the 7B model (Fig. 4d), greedy slightly outperformed the sample.

That said, the differences are minute—on the order of 0.005 to 0.01 in accuracy—and well within the expected variance from sampling.

 


Performances of Different PRMs:

  • Thank you for highlighting this insightful finding.
  • EurusPRM-Stage2 improves at higher budgets likely due to greater robustness to noise with more particles:
    • At low budgets, PF relies heavily on accurate early-step scoring, and EurusPRM may be less reliable at early steps due to noisier or less calibrated partial scores.
    • With more particles, PF can better tolerate this noise—allowing good trajectories to survive and improve final accuracy.
  • Qwen2.5-Math-PRM-7B stagnates early possibly because of limited answer diversity:
    • Most viable completions are already surfaced with a few samples. Additional samples contribute diminishing returns.
  • We note that this ablation used only 100 MATH500 questions (vs. 500 in main results), so variability is higher, especially at small budgets.
  • We’ll clarify these points in the revision.

 


Social Impact: Thank you for raising this important point. We will revise our broader impact statement to note that while our method improves inference robustness, it could amplify harm if misused in sensitive applications.

 


We will be sure to fix all formatting related concerns in the camera ready version of this work.

We appreciate your thoughtful feedback. We hope that our clarifications on the novelty of our approach and the strength of our experimental results have further reinforced your confidence in the significance of this work.

评论

Thanks for your response! The additional discussion is helpful, and I appreciate your additional results, which increase my confidence in your findings and conclusions.

Quotes abbreviated below for character limit.

PRM vs. ORMs

Yes, ORMs are only trained to judge final outputs, but they can also provide scores over partial sequences. One might expect the PRM to be better on partial sequences since that's what it's explicitly trained for, but I think it would be good to have empirical validation of this. This is probably getting a bit off topic from the point of the paper though so we can stop discussing this.

TSMC section

Overall I think much more, and much clearer, discussion needs to be had with regards to this relationship. I agree with the other reviewer APTW who also highlights the connection to particle based methods as a weakness that needs to be better addressed and clarified.

Our original statements referred specifically to the formal setup introduced in Feng et al. [8], [...]

My suggestion would be to dedicate an entire paragraph, possibly even an entire subsection, to clarifying the relationship with TSMC. If you are referring to a specific work, then avoid lumping in other works in the same bucket. But I wouldn't recommend referring too much to a specific work either, without also comparing to other closely related works (I mean specifically from the conceptual/theoretical standpoint here).

This highlights a key difference in our premises: while TSMC (e.g., [8], [33]) focuses on learning theoretically optimal twist functions, our work assumes that PRMs are inherently imperfect and noisy. We ask a different question—how to best scale LLM inference given an off-the-shelf PRM—with a method that requires no training or tuning, offering greater robustness in practice.

The first question you ask is the same one that [8] focuses on - how to best scale LLM inference given an off-the-shelf PRM. If the key difference is in avoiding training/tuning, this is the part that should be highlighted. And what do you mean by "greater robustness" here?

You're absolutely right that Zhao et al. [33] released their code. Our comment specifically referred to Feng et al. [8], which we focused on since it applies TSMC to math. To our knowledge, their implementation and models are not publicly available—we’ll clarify this in the final version.

As I mentioned above, it's best not to conflate different works and lump all TSMC works in the same bucket. You should compare with TSMC works broadly; I encourage general comparison, but I discourage making comments that come across as general comparison but are actually based on a specific work.

Regarding empirical comparisons, we acknowledge that [8] uses a trained value function and argues that off-the-shelf PRMs are weaker. Our goal was to highlight that even under this disadvantage—using a fixed off-the-shelf PRM (Qwen2.5-Math-PRM-7B)—our method outperforms [8], achieving 75.4% on MATH500 with 128 samples, compared to 60.8% in [8] using 240 samples, a trained generator, and a trained value function.

This wasn't what I was trying to get at with my original comment. My point was that you're using different off-the-shelf PRMs, and that your off-the-shelf PRM is likely quite a bit better than the off-the-shelf PRM that [8] uses, which I think makes the comparison unfair, and advantaged in favor of your work.

We’ll revise the final version to clearly reflect that [8] is focused on learning optimal PRMs, while our work addresses how to conduct inference-time search when the PRM is fixed and noisy.

As far as I'm aware, [8] is not about learning optimal PRMs. Twist functions are not the same as PRMs. You could try to use them interchangeably, but they're trying to learn different things; optimal twists are defined in relation to the target distribution which also generally contains the base model as part of it. Twist functions are akin to value functions in soft RL (see [33] for the theoretical details of this connection), which are not the same as PRMs. [8] discusses a connection to PRMs in 3.3, noting some similarities and other key differences.

I get the feeling the authors have not really understood the connection between their work and existing TSMC works, both from the original paper and from the follow-up comments. I would appreciate if the authors could really dig into this and better understand and discuss. I would be happy to discuss this with the authors in more detail, but would like to see the authors do a better, much more detailed attempt first.

Runtime Exploration

Thanks for these results. They are great, and I think should go in the main paper (with confidence intervals on accuracy).

That said, the differences are minute—on the order of 0.005 to 0.01 in accuracy—and well within the expected variance from sampling.

Confidence intervals for this plot (based on sampling variance) would be great to address this.

评论

Thank you for your thoughtful response. We completely agree that a deeper dive into the relationships between this paper and related SMC work is very necessary.

 

Here, we elaborate on works our reviewers have mentioned:

[18] - Syntactic And Semantic Control Of Large language Models Via Sequential Monte Carlo

[8] - Step-by-step Reasoning For Math Problems Via Twisted Sequential Monte Carlo

[33] - Probabilistic Inference in Language Models via Twisted Sequential Monte Carlo

 

We’d like to begin by acknowledging that particle-based Monte Carlo methods are a foundational tool with deep roots in probabilistic inference, and that many recent papers—including ours, [18], [8], and [33]—build on this shared framework. No single work lays exclusive claim to these tools; rather, each adapts them to its own application domain and practical constraints. The differences across works often stem from how the SMC procedure is instantiated for a particular task.

 

[18] - This work explores token-level SMC within probabilistic programs, targeting constrained generation. While our work shares some high-level conceptual parallels, its goals are quite different. We focus on step-level particle filtering for inference-time scaling, emphasizing robustness to noisy, imperfect reward models—especially when applying LLMs to diverse reasoning problems.

[33] - This work introduces the Twisted SMC (TSMC) framework, showing how to learn intermediate twist functions to approximate target distributions that are defined solely via terminal rewards. This paper presents a general and theoretically rich foundation, proposing the contrastive twist learning (CTL) approach to train twist functions in language modeling scenarios. Its contributions lie largely in the development of this theoretical framework.

[8] - This follow-up to [33] applies the TSMC framework to multi-step symbolic reasoning in math problem-solving. [8] adapts TSMC to the math domain by:

  • Training a separate proposal model (generator) to sample high-quality candidate trajectories;
  • Using a ground-truth verifier to score correctness of full solutions;
  • Learning a twist function via CTL that approximates expected future reward, trained on those samples.

They show that TSMC outperforms beam search and other decoding strategies in this setting. Since [8] operates in the same domain as our paper—mathematical reasoning—we focus our detailed comparison on this work going forward, referring to it as “TSMC.”


 

Our work and TSMC both leverage sequential resampling and particle-based sampling, but differ meaningfully in motivation, assumptions, and implementation. While both use particle methods, Twisted SMC and standard particle filtering are distinct algorithms, and we summarize the core differences below:

 

  1. Different Motivations and Assumptions

TSMC seeks to improve sampling efficiency by learning an optimal twist function—a modification of the target distribution that ideally reduces the variance of importance weights. To learn this twist, [8] and [33] train a value function using ground-truth supervision. (They also show that using this value function outperforms off-the-shelf, frozen PRMs.) This is in strict contrast to our work that assumes a fixed, imperfect process reward model (PRM) and explores how to use it most effectively at inference time—without training any additional components. This distinction is important: we frame inference-time scaling as posterior inference under noisy supervision, not as amortized variational inference over a reweighted posterior.

 

  1. Value Function vs PRM

TSMC aims to train an optimal twist function, which is formally a value function proportional to the expected final reward conditioned on a partial trajectory. While both value functions and PRMs can provide step-level or token-level signals, they are different theoretical objects, serving different purposes and trained under different objectives. The value function used in TSMC is specifically trained to approximate expected future reward, which enables it to reweight trajectory distributions toward those that lead to correct final outputs. In contrast, our work uses fixed PRMs that do not always capture expected future correctness and our method does not rely on supervision or make assumptions about their optimality (PRMs output correctness of a single step). Instead, we treat PRMs as potentially noisy and imperfect, and the particle filtering method is explicitly designed to be robust in such settings.

 

  1. Training-Free vs Learned Components (continued in second response)

 

  1. Practical Use Cases (continued in second response)

 

Where [8] and our Particle Filtering work meet: (continued in second response)

评论

(continued from above response)

 

  1. Training-Free vs Learned Components

TSMC methods like [8] and [33] typically involve training one or more components - the twist function and the proposal policy—using supervised data or synthetic feedback. Our method is fully training-free: we assume both the generator (LLM) and scorer (PRM) are frozen, and aim to use them effectively at inference time. This enables use in settings where fine-tuning is expensive or unavailable.

 

  1. Practical Use Cases

TSMC is well-suited to domains where high-quality ground-truth supervision is available and can be leveraged to learn twist functions. Our method targets the complementary regime: settings where no ground-truth verifier is available, and the only feedback comes from a fixed, imperfect PRM. In such scenarios, training a twist function may be brittle or costly, whereas our particle filtering with PRMs approach offers robustness via resampling and exploration. This benefit is evident not only in our results on mathematical reasoning, but also in our experiments on chemistry and finance datasets—although these were numerical domains, the PRMs were far from optimal, and yet particle filtering still achieved strong performance. By framing inference-time scaling as particle filtering under noisy supervision, we enable strategies such as Particle Gibbs and Parallel Tempering for improved exploration.

 


Where [8] and our Particle Filtering work meet:

The main results of [8] and their theoretical differences with our work are highlighted above. The main method introduced in their paper steers the weighting and resampling steps with a trained value function, which is a key departure from our approach that relies entirely on fixed, potentially noisy process reward models without any learned components.

 

However, in section 4.4 of [8], the authors test out different versions of their method to study the impact of variance on estimators - instead of using the weight from the trained value function, Variants A and B swap out the trained value function for a pre-trained PRM in two different ways.

 

  • Variant A: Step Correctness PRM (from PRM800K) where they feed the PRM’s step level prediction directly into the sampler.

  • Variant B: Value Estimate PRM - treat the PRM’s aggregated process reward as a surrogate value function.

 

The authors later note that the Variant A step-correctness option under-performs because it ignores generator likelihood, while the Variant B value-estimate option does better but still trails the learned twist. The Variant A described in 4.4 is equivalent to a minimal particle filter that (i) keeps only the most recent PRM score as its weight and (ii) runs just one SMC sweep. It therefore captures the core skeleton of PF, but in practice that means Variant A and PF behave the same only if the PF configuration is last-step weights, one sweep, no Gibbs or tempering.

 

Importantly, while [8] observes that using step correctness scores from a PRM to guide the SMC process significantly underperforms their primary TSMC implementation and alternative variants (see Figure 3 in [8]) - citing the loss of generator likelihood as a key limitation - our findings demonstrate a contrasting outcome. Despite relying on a noisy PRM, our particle filtering approach achieves strong performance, highlighting its robustness and practical effectiveness in inference-time settings.

 


We acknowledge the conceptual overlap in particle-based reasoning and appreciate the reviewer’s feedback that this connection should be made clearer. We will add a dedicated subsection outlining this relationship and clarifying that our work is not proposing a new particle method, but rather adapting standard SMC to a setting with imperfect stepwise supervision (PRMs), in contrast to twist-based sampling with learned functions.

评论

I thank the authors for their continued engagement and detailed response. I appreciate the deep dive the authors did and now believe the authors have a much better understanding. While I still have minor issues with the framing, overall I think this discussion is good, and would be happy to see the authors include this discussion in their final version.

Overall, the discussion with the authors has improved my evaluation. Whereas previously I was on the fence between giving "borderline reject" or "borderline accept", and just barely gave "borderline accept", I am now on the fence about giving "borderline accept" or "accept".

审稿意见
5

The paper formulates the process of PRM-guided LLM inference as a state-space models and applies particle filtering to infer a posterior distribution over optimal trajectory. Trajectories sampled from this posterior are scored with outcome reward model and the best trajectory is retuned. The method achieves higher performance in a range of reasoning tasks while using the same computational budget.

优缺点分析

Strengths:

  • The method is theoretically sound and grounded in the principles of probabilistic inference.
  • The empirical performance gains in the experiments are significant and consistent across different settings.
  • Multiple ablations show that the performance remains high across a range of hyperparameter choices.

Weakness:

  • It is not fully clear to me how the methods would compare in terms of runtime. For the results in Table 1 and 2, it is mentioned that the budget (i.e. number of generated sequences) was the same for all methods. Does this also hold for the runtime? It is mentioned earlier that the total runtime is comparable to generating N full completions independently. This should be more than the runtime of beam search, if I understand it correctly.

  • The conclusion mentions:

We hope that the formal connection of inference scaling to probabilistic modeling established in this work will lead to systematic solutions for current limitations of these methods and pave the way for brining advances probabilistic inference algorithms into LLM inference-time scaling in future work.

Applying techniques from probabilistic inference to LLM inference-time scaling has already been explored in other work:

Uncertainty-Guided Optimization on Large Language Model Search Trees (Grosse et al., 2016)

The settings differ to some extend (and obviously the probabilistic inference techniques also differ), but to claim that this connection is newly established in this work nevertheless is an exaggeration.

问题

  • Which stopping criterion for the searches was used in empirical comparison? I would suspect that this has a strong influence on the performance in reasoning tasks.

局限性

yes

最终评判理由

All questions were addressed and the authors provided additional results regarding the runtime. I am not 100% confident in the runtime-efficiency of the proposed method, but since Reviewer wjrz seems to think that the new results regarding the runtime are great, I am also not 100% confident that I did not misread something. All in all, I still lean towards acceptance due to the paper being theoretically well motivated.

格式问题

no formatting concerns

作者回复

Thank you for your thoughtful and constructive review—we’re grateful for your recognition of the paper’s theoretical grounding, strong empirical results, and robust ablations across hyperparameters.


Runtime Exploration: Thank you for the thoughtful comment. We ran wall-clock time comparisons on a 50-question subset of MATH500 using Qwen2.5-Math-1.5B-Instruct and vLLM as the inference engine:

  • BoN (budget=32): 211.69 sec

  • BoN (budget=64): 392.14 sec

  • Beam Search (budget=32): 549.48 sec

  • Particle Filtering (budget=32): 401.63 sec

We find that PF at budget 32 takes comparable time to BoN at budget 64, while achieving higher accuracy (84.6 vs. 82.8), indicating a favorable compute–accuracy tradeoff.

Regarding FLOPs: in theory, PF, BoN, and Beam Search should have similar cost under perfect KV-cache reuse (where past token KV-caches are fully reused in decoding). However, in practice, step-wise inference leads to low KV cache hit rates because inference engines like vLLM are not optimized for token-by-token generation. This results in duplicated compute and increased latency. Importantly, these differences are driven more by current engineering constraints than by the underlying algorithm.

(As further evidence of how much these engineering implementation details can dominate: running DVTS, a better version of beam search, took days.)

We will include these latency results and a discussion of the practical vs. theoretical tradeoffs in the revised version.

Related Work: Thank you for pointing this out and for highlighting the relevant prior work. We’ll revise the conclusion to better reflect this and adjust our claims accordingly.

Stopping Criterion: Thank you for the question. For all methods, including particle filtering and the baselines, we used a consistent stopping criterion: each generation runs until the <EOS>< EOS > token is produced or the model hits a predefined maximum length. This ensures that completions are full solutions and not prematurely truncated. We will clarify this detail in the revised version.


Thank you again for your thoughtful comments. We hope the above information further solidifies your support of this work.

评论

Thanks a lot for the clarifications and additional results!

I think in order to throughly support a claim a favorable runtime–accuracy trade-off it would be necessary to show the entire Pareto-front of runtime budget and accuracy (including error bars since as is I am not sure if an 2%-increase in accuracy is impressive or not). However, as the method is theoretically well justified I will maintain my high rating nevertheless.

审稿意见
4

Inference time scaling approaches based on deterministic rules suffer from early pruning -- pruning promising sequences due to incorrectly assigning low scores. This paper proposes a particle-based monte carlo method to alleviate this problem. The idea is to keep NN candidate incomplete sequences at each step, sample new tokens for every sequence from a LLM, re-weight each new sequence using a PRM, and finally re-sample with replacement from the candidate pool using the corresponding weights. This process is designed to trade-off exploration and exploitation by using PRMs to exploit but using sampling to explore. Experimental results are done using Qwen and Llama family of LLMs with a Qwen-based frozen PRM. The proposed sampling approach shows improvement on several reasoning benchmarks including math, chemistry, and finance compared to several baselines. On MATH500, the approach improves open source Qwen-2.5 Math 7B Instruct to reach the performance of o1-preview. While using other PRMs still give promising results, a strong PRM is necessary to best performance. The approach is also shown to be robust to different sampling temperatures.

优缺点分析

Strengths

  1. The paper is written clearly and easy to follow. The motivation is clear and the approach is explained using a clear language.

  2. The approach outperforms several baselines on multiple benchmarks while being simple to implement.

Weaknesses

  1. The approach is similar to other particle-based MC methods. Especially, [18] also introduces a particle-based MC sampling approach for inference time scaling. Based on Eq (4) in their paper, if you define the potential function as ϕ(x<t,xt)=exp(i<twi+wt)\phi(x_{<t}, x_t)=exp(\sum_{i<t} w_i + w_t) where wiw_is are defined as in your paper, you get a very similar rejection sampling approach. Currently, related work lack a detailed comparison and it needs improvement to highlight key differences. If they are closely related, they should also be added as baselines in empirical comparison.

  2. Why do you update wtw_t multiplicatively? While you define rr as a probability, weights are treated merely as unnormalized logits; you use softmax over them for re-sampling. Multiplicative weights make softmax more complicated and requires at least some ablation studies for justification.

  3. Some details are missing. Please correct/address the followings:

  • wtw_t is initialized to 1.0.
  • Why is it proportional to rather than equality in wt=wt1r^w_t=w_{t-1} \hat{r}?
  • Sampling is without replacement.
  • In line 88, you define the bernoulli distribution by using xtx_t but the left side has x<tx_{<t}.
  • In Eq (2), subscript is missing, in line 111, subscript/superscript is missing from summation.

问题

  1. Please compare to relevant work on MC methods in more detail and add related baselines to empirical comparison.

  2. Why weights are updated multiplicatively rather than additively?

  3. Please correct or address other issues explained in weaknesses.

局限性

Yes

最终评判理由

The authors clarified most of my concerns.

格式问题

N/A

作者回复

Thank you for your thoughtful review. We address your comments below.

 

Comparison to Other Works: Thank you for the thoughtful comment.

  • We agree that particle-based Monte Carlo methods are a shared foundation across multiple works and note that particle filtering is a decades-old concept with long-standing use in probabilistic inference. As such, none of these papers - including ours or [18] - are unique in applying such techniques to language models.

 

  • Despite their shared theoretical inspirations, the goals of the two works are quite distinct.

    • [18] focuses on token-level SMC within probabilistic programs for constrained generation.

    • Our work develops step-level particle filtering for inference-time scaling, designed to be robust to noisy and imperfect reward functions when scaling LLM inference across diverse reasoning tasks.

 

  • Because [18] addresses a different setting and is not directly applicable to our evaluation domains (e.g., MATH, AIME), we did not include it as a baseline. We instead compare to and outperform TSMC for Math [8], which also uses SMC for LLM reasoning and is closer in scope to our work. We exceed [8]'s performance by 14.6 points using half the budget on the same dataset (MATH500) and model (DeepSeekMath7B), and we do so without needing fine-tuning or ground-truth verifiers.

 

  • We also implemented the MCTS baseline from Zhang et al. and evaluated it against PF using the same base model (Llama 3.1 8B Instruct) and compute budget (32 rollouts). On the AIME 2024 benchmark, PF achieved 16.6% accuracy, significantly outperforming MCTS, which achieved only 6.6%. We will include this empirical comparison and accompanying discussion in the revised version.

 


Updating wtw_t multiplicatively

Thank you for raising this point — you're absolutely right that the choice to multiply weights across steps deserves clarification.

  • We use a multiplicative update so that each particle’s weight reflects the quality of the entire partial trajectory, not just the most recent step. This aligns with standard particle filtering practice, where the total weight captures cumulative likelihood.

  • That said, we acknowledge that multiplicative updates could potentially compound noise, so we explore alternative reward aggregation schemes (e.g., using only the final step score, minimum score, etc.) as ablations in Appendix A and Section 4.5 (see Lines 131–135) and show that these offer different trade-offs between stability and expressivity.

We’ll make this connection more explicit in the main text.

 


Other Details

  • wtw_t is initialized to 1.01.0:

    • Yes, each particle’s initial weight w_t is initialized to 1.0. We will make this explicit in the revised algorithm description.
  • Proportional rather than equality in wtwt1r^w_t \propto w_{t-1} \hat{r}:

    • We use a proportionality symbol rather than equality to reflect that the weights are unnormalized at each step. The normalization is performed later during the resampling step via a softmax over all particle weights. We will clarify this in the text to avoid confusion.
  • Sampling is without replacement:

    • Thank you for raising this point. To clarify, our resampling is performed with replacement, consistent with standard practice in particle filtering. This means that a particle at step tt can be selected multiple times and continued by multiple particles at step t+1t+1. An example of this behavior is illustrated in our main method Figure 2 of our paper. We will make this explicit in the revision.
  • Line 88 Bernoulli Distribution:

    • Thank you for the careful reading - you're absolutely right that the emission distribution as written is formally incorrect. It should be written as p(otc,xt)p(o_t \mid c, x_{\leq t}), since the emission probability depends explicitly on the newly sampled token xtx_t.
    • In our model, the reward function r(c,xt)r(c, x_t) directly determines the likelihood of observing oto_t, and thus the correct conditioning set includes xtx_t. We’ll revise the notation to reflect this.
    • This kind of dependency is common in state space models, where although emissions are conditioned on past latent states, they also rely on the current state sampled at each time step.
  • Subscript Formatting:

    • Thanks for catching this. We will fix both issues in the final version.

 

References: Zhang, D., et al. (2024). Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B.


Thank you again for your thoughtful review. We sincerely hope these clarifications and updates address your concerns and strengthen your confidence in the work.

评论

Thank you for the clarifications. I updated my score accordingly.

评论

To Reviewer APTW:

Thank you again for your thoughtful and constructive review. We wanted to follow up and share that we’ve addressed each of your concerns in our rebuttal, particularly:

  • Comparison to [18]: We clarified that while both approaches use particle-based MC methods, the goals and evaluation domains differ significantly. Our method focuses on robust inference-time scaling for LLMs using noisy reward models, whereas [18] focuses on SMC for constrained generation with ground-truth supervision. We've also added stronger baselines (including MCTS from Zhang et al.) to support this distinction.

 

  • Multiplicative weight updates: We now explain more clearly that our use of multiplicative updates aligns with standard particle filtering to reflect the entire partial trajectory. We also included ablations on alternative update schemes to show the trade-offs involved.

 

  • Clarifications and fixes: All the smaller issues you raised (initialization, sampling strategy, Bernoulli notation, and missing subscripts) have been fixed or clarified in the revised draft.

 

We hope these revisions help clarify the novelty and rigor of our method. If anything remains unclear or if you have additional thoughts, we’d be really grateful for your continued feedback. We’d also appreciate it if you would consider updating your score if you now have more confidence in the contribution. Thank you again for your time and valuable feedback!

最终决定

This paper introduces a novel and effective method for inference-time scaling of LLMs by adapting particle-based Monte Carlo methods to address the issue of "early pruning" in deterministic search. The proposed particle filtering approach robustly maintains a diverse set of candidate solutions, balancing exploration and exploitation to achieve impressive performance gains on various reasoning tasks. The work is well-written, theoretically sound, and supported by a comprehensive set of experiments demonstrating a significantly better scaling rate than existing baselines. While the results are quite strong, the paper could be further improved by expanding its discussion on related work to more clearly differentiate the approach from other sequential Monte Carlo methods (like TSMC) and search algorithms (like MCTS), perhaps including them as empirical baselines. Additionally, including a more detailed analysis of runtime efficiency and confidence intervals for the results would add further weight to the already compelling findings. Nevertheless, the paper presents a valuable and impactful contribution to the field. I would like to recommend accept this paper.