6.8

/10

Rejected4 位审稿人

最低4最高5标准差0.4

3.3

置信度

创新性3.0

质量3.0

清晰度3.3

重要性3.0

NeurIPS 2025

Optimal Policy Minimum Bayesian Risk

Ramón Fernandez Astudillo,Md Arafat Sultan,Aashka Trivedi,Yousef El-Kurdi,Tahira Naseem,Radu Florian,Salim Roukos

OpenReview PDF

提交: 2025-05-11更新: 2025-10-29

TL;DR

A new method to integrate a reward model, a risk/similarity function and a generator for inference scaling. The method is based on the optimal policy and admits an efficient version.

摘要

关键词

Inference ScalingBest-of-NMinimum Bayesian RiskOptimal Policy

评审与讨论

审稿意见

评分: 4置信度: 42025-07-03

This paper introduces Optimal Policy Minimum Bayesian Risk Decoding (OP-MBRD), a decoding algorithm that generalizes standard minimum Bayes risk decoding (MBRD) using KL-regularized RL.The method defines an optimal policy distribution that interpolates between the model’s own generation distribution and a reward-weighted distribution. The authors derive an importance-sampled estimator for expected similarity between outputs under this optimal policy and show asymptotic consistency (Eq. 14).

Building on this, the paper proposes a sample-efficient variant that adjusts the number of generated candidates based on an interpretable acceptance-weighted metric. Experiments on MATH-500 and HumanEval using open-source models (Qwen, Granite, Phi-4) show that OP-MBRD performs robustly across model and task settings, often matching or outperforming baselines such as best-of-N sampling and MBRD with reward scores. The method is particularly effective under lower-resource settings or for weaker models, is modular and can be readily integrated with existing decoders.

优缺点分析

Strengths:

Proposes a novel MBRD formulation based on KL-regularized optimal policy inference, integrating reward modeling and similarity-based decoding in one framework (Section 4.1, Eq. 6–11).
Derives a self-normalized importance-sampling estimator with convergence guarantees, generalizing both majority voting and reward-weighted decoding (Eq. 11, Eq. 14).
The efficient variant (OPE-MBRD) introduces a principled mechanism for adaptive sampling based on a soft estimate of example difficulty, improving compute efficiency when the reward model is well-calibrated (Figure 1 right).
Experimental results show that OP-MBRD generally performs comparably or better than BoN and MBRD(EM*R), particularly for smaller models and in low-sample regimes (Figure 1, Figure 2).

Weaknesses:

The method’s performance is sensitive to PRM calibration. For example, using a math-trained PRM on coding tasks in HumanEval leads to poor adaptation by OPE-MBRD (Figure 2 right), reducing its reliability under domain mismatch.
All experiments fix the reference model to be the generator, limiting the method’s ability to exploit stronger teacher models or mixture-of-agents style configurations. This undermines part of the theoretical generality of OP-MBRD.
The paper does not compare against recently competitive inference-time techniques such as speculative decoding, stepwise verifier-guided reranking, or tree-of-thought search. This makes it difficult to assess how OP-MBRD compares against stronger baselines beyond BoN and standard MBRD.
It seems like Section 4 could use some additional figures or an algorithmic pseudocode. The estimator derivation is difficult to parse without a strong RL background.

问题

Have you tried using pR ≠ p, e.g., by using a stronger teacher model or an ensemble as the reference? This would better demonstrate the flexibility of the OP-MBRD framework.

In cases where PRMs are misaligned (e.g., math PRM applied to code), could the model detect or correct for this dynamically? Would adaptive β tuning help?

Can you provide any timing comparisons or latency overhead metrics for OP-MBRD vs BoN and MBRD(EM*R)? How does wall-clock cost scale?

Could OP-MBRD be integrated with multi-step inference strategies (e.g., beam search or tree-of-thought reasoning)? Would the optimal policy still be applicable at intermediate steps?

Have you explored the method’s robustness under noisy PRMs? Some form of regularization may be necessary for stable deployment in open-ended domains.

局限性

Yes.

格式问题

NA.

作者回复

2025-07-30

We thank the reviewer for her/his constructive comments. We provide inline responses below and answer to questions at the end.

The method’s performance is sensitive to PRM calibration

We agree that we introduce and additional parameter $\beta$ , but in general $\beta=0.1$ shows a robust performance in most scenarios. In fact, Phi-4 results are very similar for $\beta=0.1$ which leaves only the Qwen-1.5b + Qwen-PRM as the only case where beta needed to be far away from the nominal $0.1$ . This is also an understandable case since the model is rather small and the Qwen-PRM exceptionally good. It should also be pointed out that the dev-adjusted value generalized well to test.

All experiments fix the reference model to be the generator,

We do agree with this is a valid criticism. We would like to raise following points

the reference policy does play a lesser role in the use of the KL-controlled objective for inference scaling / decoding compared to the standard use for RL, since there is no weight updates and thus less risk of policy drifting towards reward hacking.
Even if we do not provide empirical results, mathematically, its behavior is well defined. For a reward of zero, it will converge to conventional MBRD from the reference policy (approximated by self-normalized importance sampling, see section 4.4). For a non-zero reward, OP-MBRD reweighs this samples to account for the effect of the reward and this relation is mediated by the parameter $\beta$ .
We did perform initial experiments using Granite as generator and Phi-4 and Phi-4-reasoning as a reference policies. We observed performance above BoN but below MBRD. We felt further exploring this line would be beyond the scope of the paper but we can complete and include this analysis in the final manuscript.

not compare against recently competitive inference-time techniques such as speculative decoding, stepwise verifier-guided reranking, or tree-of-thought search

We believe the main strength of the proposed approach is simplicity. It adds little complexity to BoN/MBRD while being more performant across scenarios, providing a well founded derivation/guarantees and an efficient version. The methods mentioned above can be seen as complementary and can be combined with OP-MBRD. For example speculative decoding allows generating from an expensive generator while using a cheaper one to propose completions. This could be used in MBRD/BoN or OP-MBRD to speed up generation, but it serves a different purpose. Regarding step-wise and tree-of-thoughts, see the answer to questions in the end.

The estimator derivation is difficult to parse without a strong RL background.

Indeed this can be improved. We had added a complete derivation as additional materials in the original submission (we should have indicated this in the manuscript). We can move this to the appendix, add an algorithm block description and more details / appendix references in the main paper body. Feedback of possible improvements to the additional material is also welcome.

Questions

Have you tried using pR ≠ p,

Yes, see comment about about Phi-4/ Phi-4-reasoning.

In cases where PRMs are misaligned (e.g., math PRM applied to code), could the model detect or correct for this dynamically? Would adaptive β tuning help?

We did look at the range of PRM values for the coding tasks, and the PRM assigned clearly lower values on average than for math tasks. This is probably an effect of the PRM working out-of domain. Result do still notably improve, so the PRM is not being totally ineffective. Tuning $\beta$ specifically for code may help but probably tuning the PRM for coding tasks is likely to help the calibration more. A PRM that is well tuned to a generator and a task, like in the Qwen experiments seems to have large positive effects on performance and particularly in the savings attained by using the efficient version of OP-MBRD.

Can you provide any timing comparisons or latency overhead metrics for OP-MBRD vs BoN and MBRD(EM*R)? How does wall-clock cost scale?

The main difference regarding latency is between MBRD (including OP-MBRD, MBRD(EM*R), etc) and BoN methods.

MBRD requires computing pair-wise similarity between outputs which scales quadratically with $N$ . This is not a problem for e.g. math, where the similarity function is a exact match and is very fast. It can get slower for other metrics like Rouge for large $N$ . This can be however sped up with caching and pre-computation of operations such as tokenization.
Also BoN methods and MBRD(EM*R) OP-MBRD versions require evaluation using an RM or a PRM. This is linear on $N$ but can be expensive if the Reward Model is big.

Could OP-MBRD be integrated with multi-step inference strategies (e.g., beam search or tree-of-thought reasoning)? Would the optimal policy still be applicable at intermediate steps?

This is a great point, there are two main possibilities:

it is indeed possible to use the same principle for intermediate steps. The only difference is that the normal use of MBRD implies an argmax over the $Q$ score whereas in this case its is more useful to do a top-k and discard the lowest scoring components, as it would be done on a beam search algorithm. Regarding ToT, the same applies to state evaluation during decoding.
further, if any of these methods produce $N>1$ hypotheses at the end of decoding, normal OP-MBRD can be used to for final post-processing. Importance weights from ancestral sampling can be accumulated in the same way as it is done for the PRM in normal OP-MBRD (See section 4.2).

Have you explored the method’s robustness under noisy PRMs? Some form of regularization may be necessary for stable deployment in open-ended domains.

It is true that the dependency on a RM’s ability to generalize across domains is a limitation of both the baselines and the presented method. The experiments cover 3 different PRMs with different performances and, given the results, one could argue that at least the variance in performance of the PRM seems to affect more the baselines like MBRD(EM*R) than OP-MBRD itself.

2025-08-05

Thanks for the rebuttal. I will maintain my positive score.

审稿意见

评分: 5置信度: 32025-07-03

This paper proposes a Optimal Policy Mimimum Bayes Risk Decoding (MBRD) for LLM. MBRD aims to decode from a distribution maximize some utility, for example reward function. in expectation. For entropy-regularized RL, with relative entropy to some reference distribution. It's well-known that the optimal policy admits a closed-form. Due to the intractable sum over the string space $V^+$ , it requires one to sample from an unnormalized energy model. This paper leverages techniques from previous to sample from unnormalized energy-based model to develop OP-MBRD and its rao-blackwellization version, for LLM decoding.

优缺点分析

Strengthes

Very clear and enjoyable writing
Throughout discussions on related works
Elegant solution with minimal implementation overhead
Strong empirical performance on MATH and HumanEval

Weaknesses

Not a weakness, but I am not particularly familiar with the string of works mentioned in L171 (DPG, GDC++, RSO [Liu et al., 2024], BRAIn). So I am not sure if there is any similar works worthwhile noticing or comparing.

问题

N/A

局限性

N/A

最终评判理由

Good paper, I'll keep my positive evaluation.

格式问题

N/A

作者回复

2025-07-30

We thank the reviewer for her/his constructive comments. We provide inline responses below and answer to questions at the end.

not particularly familiar with the string of works mentioned in L171 (DPG, GDC++, RSO [Liu et al., 2024], BRAIn). So I am not sure if there is any similar works worthwhile noticing or comparing.

This line of methods are indeed less well known that other RL alternatives such as PPO and DPO. They are mentioned here because their provide the inspiration for the method proposed. It should be however noted that all these approaches concern Reinforcement Learning while the approach presented here is a inference scaling / decoding technique. The overlap is merely leveraging the fact that one can sample from the optimal policy via importance sampling / rejection sampling. The RL approaches use it to derive an approximate gradient update rule for the the KL-controlled reward maximization objective (same objective from which PPO and DPO are derived). The approach presented here uses it to perform Minimum Bayesian Risk Decoding from the optimal policy, thus yielding a well defined way to integrate a generator policy, a similarity metric and reward model for inference scaling.

审稿意见

评分: 4置信度: 22025-07-09

This paper introduces Optimal Policy Minimum Bayesian Risk Decoding (OP-MBRD), a novel decoding method for large language models that integrates minimum Bayes risk decoding with concepts from KL-controlled reinforcement learning. The proposed approach offers three primary contributions: (1) improved robustness and accuracy over standard MBRD and best-of-N sampling, especially under performance gaps between generators and reward models; (2) an asymptotic guarantee that OP-MBRP converges to MBRD over a reward-weighted distribution; (3) a sample-efficient variant (OPE-MBRD) that dynamically adjusts the number of generated samples based on problem difficulty using expected optimal policy samples. Experiments on MATH-500 and HumanEval benchmarks demonstrate consistent gains across diverse open-source models (Qwen, Granite, Phi-4), particularly for well-calibrated generator-reward model pairs. The method retains the simplicity of traditional MBRD while adding only one hyperparameter ( $\beta$ ).

优缺点分析

Strengths

Clarity. The paper is well-written and logically structured. Technical concepts are clearly explained.
Theoretical contribution. The paper effectively bridges KL-regularized RL (optimal policy derivation) and MBRD, providing a principled reward-integration framework.
Significance. OPE-MBRD’s adaptive sampling reduces compute costs without accuracy loss, addressing key inference-scaling challenges.
Experiments. Thorough evaluation across various model sizes (1.5B–20B) and diverse tasks (MATH-500/HumanEval), demonstrating consistent gains.

Weaknesses

Statistical robustness. The result in Fig 1/2 lacks variance quantification (e.g., error bars/confidence intervals), limiting reproducibility assessment.
Novelty. The proposed OP-MBRD primarily combines BoN and MBRD paradigms. The theoretical formalization is rigorous but it minimally extends beyond prior MBRD-with-rewards concepts.

问题

Will you add error bars/standard deviations to results (e.g., Fig 1 pass@1 curves) to clarify performance stability?
Can OP-MBRD handle larger LLMs (e.g., 90B llama 3.2)? Would sample diversity effects differ at scale?

局限性

yes

最终评判理由

The paper proposed a novel decoding method for large language models that integrates minimum Bayes risk decoding with concepts from KL-controlled reinforcement learning. Most of my concerns are resolved during the rebuttal period. Given the mediocre algorithmic novelty of the proposed method, I recommend weak acceptance.

格式问题

作者回复

2025-07-30

We thank the reviewer for her/his constructive comments. We provide inline responses below and answer to questions at the end.

Statistical robustness. The result in Fig 1/2 lacks variance quantification (e.g., error bars/confidence intervals), limiting reproducibility assessment.

We agree this can be improved. The main reason to avoid error bars was visibility. To compensate for this we had indicated in the figure footer the maximum standard deviation across samples observed. A second reason is that our main claim is that the method matches either BoN or MBRD(EM*R) across all scenarios, while those methods are only good at a subset of those scenarios. The differences for the relevant comparisons for this claim are rather large. We will explore however some better visualization of data dispersion. We would also like to note that all experiments are done with multiple repetitions (e.g. 256 for N=1 or 4 for N=64).

The proposed OP-MBRD primarily combines BoN and MBRD paradigms

Indeed the idea is rather simple. We would like to make the point that this is a strength, rather than a weakness, since improving upon these well established methods is not trivial and the idea is well founded theoretically.

Questions

Will you add error bars/standard deviations to results (e.g., Fig 1 pass@1 curves) to clarify performance stability?

We will do this. We may have to rethink how to represent these results to guarantee visibility. Maybe a separate graph for the efficient version.

Can OP-MBRD handle larger LLMs (e.g., 90B llama 3.2)? Would sample diversity effects differ at scale?

There is no limitation on the algorithm regarding model size, other than the cost of sampling being larger. As pointed out in the limitations section, a possible hypothesis is that bigger models will benefit more from self-consistency i.e. yield stronger MBRD results compared to BoN. Since OP-MBRD seems to be able to match the best results of MBRD/BoN across different scenarios we can assume it will not be affected or even will benefit in terms of the gap.

2025-08-09

Thanks for the reply. I have no further questions.

审稿意见

评分: 4置信度: 42025-07-21

This paper introduces Optimal Policy Minimum Bayesian Risk Decoding (OP-MBRD), an inference-time method to enhance LLM reasoning by improving upon and unifying existing techniques like Best-of-N (BoN) and standard MBRD. The core contribution is a framework that performs MBRD over a theoretically-grounded ``optimal policy'' derived from KL-constrained policy optimization over a reward model (i.e., standard RLHF setting). As sampling from this optimal policy is intractable, the paper uses importance sampling to approximate its expectations using samples from a base generator. The paper also proposes a sample-aware variant, OPE-MBRD, that adaptively adjusts the number of samples based on problem difficulty. Through empirical validation on MATH and HumanEval benchmarks, the paper demonstrate that OP-MBRD achieves robust performance compared to baselines and that OPE-MBRD offers a favorable accuracy-compute trade-off.

优缺点分析

Strengths

The paper's core idea is simple yet interesting. The approach of combining principles from reward model training with inference-time decoding is a promising direction. The writing is clear and easy to understand, and the set of experiments conducted on MATH and HumanEval is quite thorough, providing a solid demonstration of the proposed method's capabilities.

Weaknesses

Lack of Algorithmic Explanation and Theoretical Grounding: Although the paper is easy to read conceptually, it is hard for the reader to piece together the algorithm that was finally implemented. Including a formal algorithm block would be very useful for understanding the final method and for reproducibility. Furthermore, while the theoretical results are cited from other papers, the submission would be much stronger if it provided some brief primers to contextualize the theory and explain why the assumptions hold for this specific application.

Experimental Limitation with $p_R = p$ : A second weakness concerns the scope of the experimental validation. The paper makes a crucial simplifying assumption by setting the reference policy $p_R$ to be the same as the generator policy $p$ in all experiments (line 259). This nullifies the $\log \frac{p_R}{p}$ term in the core formulation (Eq. 10), effectively reducing the method to a form of normalized, reward-weighted sampling. While acknowledged, this sidesteps a key question of how OP-MBRD performs when the generator and reference models differ, which is the standard setting in RLHF and where the framework's full potential lies, and the empirical validation seems to not be comparing against policies trained using RLHF which makes the empirical section incomplete.

问题

Algorithmic Clarification: Regarding the importance weights in Equation 9, how is the intractable $\max_{y'} \tilde R(y', x)$ term calculated in the general case where $p_R \neq p$ ? The paper explains the simplification for $p_R = p$ (line 263), but it is unclear how this would be handled otherwise. Is the maximization simply performed over the N generated samples?

Empirical Setup Clarification: This is a question about the baseline policies. The introduction discusses RL-trained models, but the experimental setup excludes them (lines 235-237). Could you please confirm if the generator models used for all methods (including baselines like BoN and MBRD) are standard instruction-tuned, or if they have undergone some training with RL for reasoning?

局限性

Yes, limitations discussed.

最终评判理由

Good paper, I will recommend acceptance hoping authors make the proposed changes.

格式问题

Nothing major, but is $S(N)$ on line 173 defined anywhere earlier in the text?

作者回复

2025-07-30

We thank the reviewer for her/his constructive comments. We provide inline responses below and answer to questions at the end.

submission would be much stronger if it provided some brief primers / algorithm

We can add a full derivation of the algorithm, including results already present in other papers, to the appendix. We had added such derivation as additional materials in the original submission (we should have indicated this in the manuscript). Feedback of possible improvements to this is welcome. In addition to this, we will add an algorithm block description and more details / appendix references in the main paper body.

simplifying assumption by setting the reference policy [...] as the generator

This is a valid criticism. We would like to raise following points

the reference policy does play a lesser role in the use of the KL-controlled objective for inference scaling / decoding compared to the standard use for RL, since there is no weight updates and thus less risk of policy drifting towards reward hacking.
Even if we do not provide empirical results, mathematically, its behavior is well defined. For a reward of zero, it will converge to conventional MBRD from the reference policy (approximated by self-normalized importance sampling, see section 4.4). For a non-zero reward, OP-MBRD reweighs this samples to account for the effect of the reward and this relation is mediated by the parameter $\beta$ .
We did perform initial experiments using Granite as generator and Phi-4 and Phi-4-reasoning as a reference policies. We observed performance above BoN but below MBRD. We felt further exploring this line would be beyond the scope of the paper but we can complete and include this analysis in the final manuscript.

not [...] comparing against policies trained using RLHF which makes the empirical section incomplete

See answer to questions below.

Questions

how is the intractable term calculated in the general case

The usual way in rejection sampling is using a sample estimate i.e. selecting the highest reward from the sample set. This has negligible computational cost and does not change OP-MBRD performance (just shifts the logits). We did test this but left the results out due to space. The only effect is that the savings of the efficient version are worse for low $N$ (compared to the fixed $1.0$ limit that can be used with PRM and any upper bounded reward). We also did additional experiments where setting a scaling factor ( $\alpha>1$ ) as in $\alpha \cdot \max_{y'} R(y', x)$ on dev for the sample estimate increases the rejection rate and allows to recover the savings both for dev and test. This could be useful for non-upper-bounded rewards or pushing for more aggressive saving in efficient mode. We can include these results in the appendix.

have [baseline models] undergone some training with RL for reasoning?

We use models directly download from Huggingface with no further modifications. Following their technical reports all models have undergone some level of RL training. GRPO for Qwen, DPO and PPO for Granite, DPO for Phi-4.

2025-08-07

Thank you for clarifying my questions. I still believe this is a good paper. Incorporating the suggested changes will improve clarity and enhance understanding. Discussing the above points would also be a valuable addition; even the extra experiments can be deferred to the appendix.

2025-08-08

We thank the reviewer for carefully reading and responding to our rebuttal, and for the constructive suggestions. We will make sure to add the suggested details and clarifications in the final version.

最终决定Reject

2025-09-17

The reviewers collectively commend the authors for a well-written and clear paper that introduces a principled and interesting approach, OP-MBRD, which elegantly unifies concepts from KL-regularized reinforcement learning with minimum Bayes risk decoding. However, a significant and recurring concern is the critical experimental limitation where the reference policy is set to be the same as the generator policy. This limitation sidesteps a core aspect of the paper's theoretical framework and fails to evaluate the method in the more general and practical RLHF setting, thereby undermining the full scope of the contributions claimed. Further weaknesses include the lack of a formal algorithm or pseudocode, a need for more context on the theoretical underpinnings, and an evaluation that omits comparisons to more recent competitive inference-time techniques. While the foundational idea is promising, these substantial concerns regarding the experiments and implementations lead to the recommendation of a weak rejection, with encouragement for the authors to address these points in a future submission.