Policy Filtration for RLHF to Mitigate Noise in Reward Models
We design a method to filter samples during RLHF to mitigate the inaccuracy of reward model.
摘要
评审与讨论
This paper finds that the reliability of the reward model varies across responses assigned with different rewards. Motivated by this fact, this paper considers filtering the samples whose rewards may be unreliable to improve the signal-to-noise ratio during policy learning, resulting in Policy Filtration for Proximal Policy Optimization (PF-PPO). To choose a proper policy filtering strategy, the authors use the coefficient of determination between the rewards and actual scores on filtered samples as the metrics to help us find promising strategies since it measures how well the rewards filtered by PF-PPO indicate real performance. The authors provide extensive experiments to validate the effectiveness of PF-PPO in code generation and math reasoning tasks. In code generation, PF-PPO achieves the state-of-the-art performance of 7-billion-parameter models on HumanEval (+7.9%), MBPP (+0.7%), and LeetCode Contest (+10.0%) which is a newly-created and more challenging benchmark. In math reasoning, PF-PPO yields performance increase using different reward models and benchmarks (Ape210K and CMATH).
给作者的问题
Please see the weaknesses above.
论据与证据
The claims made in this paper are supported by clear and convincing evidence.
方法与评估标准
The proposed methods and evaluation criteria (e.g., benchmark datasets) make sense.
理论论述
There is no theoretical result in this paper.
实验设计与分析
The experiments look reasonable.
补充材料
I didn’t read the supplementary material.
与现有文献的关系
This paper is relevant to the literature.
遗漏的重要参考文献
None
其他优缺点
Strengths:
- This paper studies a well-motivated and important problem in the RLHF and LLM community, i.e., mitigating noise in reward models for RLHF. This topic and the experiments on math reasoning and code generation is timely.
- The authors conduct experiments on multiple LLM benchmarks, including HumanEval, MBPP, LeetCode Contest, Ape210K and CMATH, to evaluate their approach, and show an improved performance compared to the baselines.
- This paper is well-written and clearly organized.
Weaknesses:
- The proposed Policy Filtration PPO approach seems very straightforward. The main idea of this approach is to generate multiple responses, use a weight vector (assign more weight to the response that has a higher reward) to combine them to generate a new response, and adopts this combined response in LLM finetuning. Is there any other novelty in this approach?
- This Policy Filtration PPO approach seems very heuristic. There is no theoretical result, or theoretical derivation that guides the design of this approach.
- The authors should explain more on the reason why this Policy Filtration PPO approach achieves better performance in experiments, except the simple idea of “generating multiple candidate responses and choosing the best one”. For example, why is the quality of samples (i.e., responses ) important for algorithm PPO? Why improving the quality of samples in algorithm PPO can achieve good performance?
其他意见或建议
Please see the weaknesses above.
Author Response for Reviewer kkSM
We thank the reviewer for highlighting these points.
Novelty of this paper
While simple, PF-PPO’s contribution lies in its universal effectiveness for RLHF noise mitigation. Key innovations:
- Empirical Validation: Extensive experiments across domains (see extended results) demonstrate consistent improvements.
- Practical Impact: Addressing reward noise—a critical but underexplored RLHF challenge—via a task-agnostic approach.
We also refer the reviewer to check the new experiment results on various tasks (in “Response to All Reviewers”) and take our effort to demonstrate universal effectiveness of our method into account.
Lack of theoretical derivation
Though heuristic, PF-PPO is motivated by empirical evidence that mid-reward samples harm convergence (likely due to conflicting gradients). Future work will formalize this via gradient variance analysis.
Why sample quality matters in PPO?
PPO relies on accurate advantage estimates. Noisy rewards distort advantage estimates, leading to suboptimal updates. Ablations (e.g., PF-PPO vs PPO-M) show that filtering unreliable samples stabilizes training by ensuring gradients derive from high-confidence data.
Response to All Reviewers
We sincerely thank the reviewers for their constructive feedback and insightful suggestions.
One major advantage of our method is its simplicity, universality, and effectiveness. To further validate the broader effectiveness of our method, we conducted experiments across diverse domains using Doubao-25k (policy and reward model backbone). Tasks included logic reasoning, math, code generation, STEM problems, complex tasks, instruction following, knowledge QA, and language understanding. Each task has distinct evaluation sets and verifiers to assess response correctness. Results (accuracy improvement over vanilla PPO) are shown below. Statistically significant changes (exceeding , based on test case counts) are bolded. These results demonstrate PF-PPO’s consistent effectiveness across tasks.
| Task (Evaluation Set Size) | BO1 Accuracy (%) | BO5 Accuracy (%) |
|---|---|---|
| Logic Reasoning (1203) | 48.9 (+2.3) | 63.8 (+2.8) |
| Math (1759) | 69.7 (+1.1) | 79.9 (+2.3) |
| Code (3933) | 55.8 (-0.2) | 67.4 (+0.1) |
| STEM (4466) | 54.7 (-0.1) | 63.1 (+0.1) |
| Complex Tasks (2990) | 9.5 (+1.0) | 14.9 (+0.6) |
| Instruction Following (1525) | 49.6 (+1.7) | 59.8 (+1.8) |
| Knowledge (775) | 47.3 (+1.9) | 58.3 (+1.8) |
| Language Understanding (680) | 63.8 (+1.6) | 68.4 (+3.8) |
We thank the reviewers for their rigorous feedback, which has strengthened our analysis. The revised manuscript will incorporate these responses, clarify methodological details, and emphasize PF-PPO’s broader applicability.
This paper introduces Policy Filtration for Proximal Policy Optimization (PF-PPO), a reinforcement learning from human feedback (RLHF) method that addresses reward model noise by selectively training on samples where rewards are most reliable. Observing that reward models are more accurate for extreme (high/low) rewards than moderate ones, PF-PPO filters responses using strategies like Best-Random (BR) and Best-Worst (BW), guided by the coefficient of determination to optimize reward-actual score alignment. Evaluated on code generation (HumanEval, MBPP, LeetCode) and math reasoning (Ape210K, CMATH) tasks, PF-PPO is claimed to achieve state-of-the-art performance for 7B models, outperforming standard PPO and other baselines by reducing reward over-optimization and enhancing training efficiency through noise mitigation.
给作者的问题
N/A
论据与证据
It is unclear to me why BW performs the best.
Is reward model mentioned in line 144 the same as the reward model ?
方法与评估标准
While correlates with performance, the causal link between it and policy improvement is assumed rather than proven. It seems to be calculated using SFT policy responses, and its generalization to RL-trained policies is not rigorously validated.
理论论述
No theoretical claims. Providing some fundamental theoretical results would be better.
实验设计与分析
More detailed discussion on experiment results would be better for me to understand the insights of the results. For example, why the current design works and what are the implications.
补充材料
No
与现有文献的关系
Limited new contributions
遗漏的重要参考文献
No
其他优缺点
It is unclear why using the coefficient of determination (), how to calculate it in the experiments, and why it measures the reliability.
其他意见或建议
N/A
Author Response for Reviewer 2Fwc
We appreciate the reviewer’s questions.
Why does BW perform the best?
BW selects extreme samples with high/low rewards, which are most reliably aligned with actual scores (Fig. 1). Mid-reward samples often mix correct/incorrect elements (e.g., non-standard solutions with partial correctness), confusing the policy. BW maximizes the proportion of high-confidence samples, and thus improves learning efficiency.
Why using the coefficient of determination (R2), why it measures the reliability, and how to calculate it in the experiments.
R² measures how well the reward model ranks samples by their true scores/quality. The filtration strategy with high R² ensures remaining samples provide reliable learning signals (i.e., those rewards can well indicate the true quality of the responses). Calculation steps:
- Generate multiple responses per prompt using the SFT policy.
- Filter responses using the given strategy (e.g., BW).
- Compute R² between reward scores and ground-truth scores.
Is the reward model mentioned in line 144 the same as the reward model ?
This was a typo, and it should be in line 144. We will correct this in the revised manuscript. The reward model in line 144 should be (not ). This will be corrected in the manuscript.
Response to All Reviewers
We sincerely thank the reviewers for their constructive feedback and insightful suggestions.
One major advantage of our method is its simplicity, universality, and effectiveness. To further validate the broader effectiveness of our method, we conducted experiments across diverse domains using Doubao-25k (policy and reward model backbone). Tasks included logic reasoning, math, code generation, STEM problems, complex tasks, instruction following, knowledge QA, and language understanding. Each task has distinct evaluation sets and verifiers to assess response correctness. Results (accuracy improvement over vanilla PPO) are shown below. Statistically significant changes (exceeding , based on test case counts) are bolded. These results demonstrate PF-PPO’s consistent effectiveness across tasks.
| Task (Evaluation Set Size) | BO1 Accuracy (%) | BO5 Accuracy (%) |
|---|---|---|
| Logic Reasoning (1203) | 48.9 (+2.3) | 63.8 (+2.8) |
| Math (1759) | 69.7 (+1.1) | 79.9 (+2.3) |
| Code (3933) | 55.8 (-0.2) | 67.4 (+0.1) |
| STEM (4466) | 54.7 (-0.1) | 63.1 (+0.1) |
| Complex Tasks (2990) | 9.5 (+1.0) | 14.9 (+0.6) |
| Instruction Following (1525) | 49.6 (+1.7) | 59.8 (+1.8) |
| Knowledge (775) | 47.3 (+1.9) | 58.3 (+1.8) |
| Language Understanding (680) | 63.8 (+1.6) | 68.4 (+3.8) |
We thank the reviewers for their rigorous feedback, which has strengthened our analysis. The revised manuscript will incorporate these responses, clarify methodological details, and emphasize PF-PPO’s broader applicability.
This paper introduces a novel method, Policy Filtration for Proximal Policy Optimization (PF-PPO), to address the challenge of reward model inaccuracy in Reinforcement Learning from Human Feedback (RLHF). The authors propose a filtering mechanism to select samples with more reliable rewards, thereby improving the signal-to-noise ratio during policy learning. Motivated by empirical observations on the reward models, PF-PPO achieves this goal by excluding samples with moderate rewards. The method is validated on code generation and math reasoning tasks, demonstrating significant performance improvements.
给作者的问题
1 The method is proposed to address the inaccurate reward issue. Why this method also improves the results when the reward is given by the oracle (cf. Table 4)? 2 For the computational costs, the authors state that PF-PPO and standard PPO and evaluated in a fair manner using approximately the same number of samples (including the filtered samples). Since the performance is reported as the best score during the whole training process, how does PF-PPO converges compared to standard PPO? It is possible PF-PPO converges slowly and therefore corresponds to a poorer computational efficiency. 3 Are there other metrics besides R2 that could be used to select filtering strategies? How do these metrics compare in terms of predicting performance?
论据与证据
The claims are supported by provided evidence. 1 The authors claim that the reward model is less reliable for responses with moderate rewards compared to those with high or low rewards. This claim is supported by empirical evidence from experiments on code and math tasks where the reward model's reliability is analyzed across different reward regions. 2 The authors also claim that the filtering strategies can be selected by measuring the R2 between the true rewards and the predicted rewards on the filtered samples. This claim is supported by monitoring this measure and the final performance across different strategies.
方法与评估标准
The proposed methods (PPO plus a sample filter) make sense in general, and the evaluation criteria (pass@1 and accuracy) follow the standard for the code and math tasks.
理论论述
The paper does not present explicit theoretical claims.
实验设计与分析
I have checked the experimental designs and gone through the analysis on the experimental results and computational costs.
补充材料
I have gone through the supplementary material.
与现有文献的关系
The paper builds on the foundation of RLHF and addresses the critical issue of reward model inaccuracy. It relates to prior work on improving reward model accuracy and methods to mitigate reward over-optimization. The proposed method follows a different path to address the issue of reward model inaccuracy, i.e., to analyze which samples are associated with inaccurate rewards and filter them out. This path is interesting but seems too empirical.
遗漏的重要参考文献
The paper includes a comprehensive survey of related literature. Although Deepseek-math and Qwen-math employ a combination of a dense reward model and a rule-based reward model, claiming advantages for this configuration, I want to highlight the ongoing debate spurred by the success of DeepSeek-R1. Specifically, the debate centers on whether code or math tasks truly require a learned reward model. By leveraging long-chain-of-thought (CoT) reasoning, the DeepSeek team employs only the sparse true reward for these tasks. It would greatly benefit the authors to include a discussion on this debate in their paper.
其他优缺点
S1 The idea of filtering samples based on reward reliability is easy to implement to address the reward inaccuracy issue issue in RLHF. S2 The authors provide extensive experiments on multiple benchmarks and different tasks to demonstrate the effectiveness of their method. S3 The findings are applicable to a wide range of tasks, including code generation and math reasoning.
W1 This paper is mostly motivated by empirical observations and lacks theoretical analysis on why it works. W2 The effectiveness of PF-PPO is highly dependent on the quality or the property of the reward model. Although the paper provides experimental results to demonstrate the property applies to reward models trained in different domains, these is no strong evidence for the effectiveness of this method in other scenarios, e.g., where the reward model is poorly trained or biased.
其他意见或建议
See the questions below.
Author Response to Reviewer DFFh
We thank the reviewer for the positive and constructive feedback.
W1: Theoretical Motivation
While PF-PPO is empirically driven, we ground our approach in the signal-to-noise ratio principle: ambiguous samples (mid-reward) introduce conflicting gradients during PPO updates, destabilizing training. By filtering these, PF-PPO reduces variance in policy updates. We will formalize this intuition in future work by analyzing gradient variance under different filtration strategies.
W2: Dependency on Reward Model Quality
We believe that our algorithm will yield improvement over different rewards (either generated by reward models or provided by oracles) since filtering out less reliable samples should help across different settings. One example is that, even with perfect rewards (Table 4), PF-PPO improves performance by filtering suboptimal but correct samples (e.g., verbose code), allowing the policy to prioritize concise, high-quality solutions (see Appendix B). Please also refer to the new set of our experiment results shown in the post “Response to All Reviewers” which indicate that our method is robust to rewards for different fields.
Response to All Reviewers
We sincerely thank the reviewers for their constructive feedback and insightful suggestions.
One major advantage of our method is its simplicity, universality, and effectiveness. To further validate the broader effectiveness of our method, we conducted experiments across diverse domains using Doubao-25k (policy and reward model backbone). Tasks included logic reasoning, math, code generation, STEM problems, complex tasks, instruction following, knowledge QA, and language understanding. Each task has distinct evaluation sets and verifiers to assess response correctness. Results (accuracy improvement over vanilla PPO) are shown below. Statistically significant changes (exceeding , based on test case counts) are bolded. These results demonstrate PF-PPO’s consistent effectiveness across tasks.
| Task (Evaluation Set Size) | BO1 Accuracy (%) | BO5 Accuracy (%) |
|---|---|---|
| Logic Reasoning (1203) | 48.9 (+2.3) | 63.8 (+2.8) |
| Math (1759) | 69.7 (+1.1) | 79.9 (+2.3) |
| Code (3933) | 55.8 (-0.2) | 67.4 (+0.1) |
| STEM (4466) | 54.7 (-0.1) | 63.1 (+0.1) |
| Complex Tasks (2990) | 9.5 (+1.0) | 14.9 (+0.6) |
| Instruction Following (1525) | 49.6 (+1.7) | 59.8 (+1.8) |
| Knowledge (775) | 47.3 (+1.9) | 58.3 (+1.8) |
| Language Understanding (680) | 63.8 (+1.6) | 68.4 (+3.8) |
We thank the reviewers for their rigorous feedback, which has strengthened our analysis. The revised manuscript will incorporate these responses, clarify methodological details, and emphasize PF-PPO’s broader applicability.
The authors introduce Policy Filtration for Proximal Policy Optimization (PF-PPO). Their key insight is that the reward signal is more useful in cases of high-reward or low-reward and design an algorithm around exploiting this by filtering the samples used in PPO based on their quality as measured by some heuristic (R-squared). They also construct the LeetCode Contest benchmark, consisting of 160 weekly LeetCode problems from July 2022 to Janurary 2024
They evaluate their algorithm on the HumanEval, MBPP, Ape210K and CMATH benchmarks as well as a LeetCode benchmark, testing against a range of SFT, DPO and PPO methods, demonstrating strong results over their chosen baselines.
给作者的问题
Did the authors consider using a benchmark with less verifiable rewards as an ablation? In such a case where the policy filtration may be less accurate in some cases, how does this performance?
论据与证据
The primary claims of the paper are the following: -The reward model used in RLHF is primarily more reliable when it comes to exemplar samples (samples given a high score/reward). -Because of the former, a model trained with RLHF with filtering such that the primary samples used are exemplar positive or negative samples should yield higher performance.
These claims are supposed by the following: -across a large number of experiments, (10 outputs for 160 prompts across 10 trials) the authors show that the actual reward aligns with the predicted most closely as both the predicted score gets higher. -The authors demonstrate PF-PPO outperforms the other training methods.
方法与评估标准
The authors apply three different filtration methods, 'best-of-N' (BoN), always selecting the top response; 'best-of-random' (BR), 50% chance to select the best response, 50% chance to select any other policy; and 'best-worst' (BW) a coin flip on whether the best or the worst policy is selected. They use the R-squared between the predicted and the actual score and show a high linear correlation between BR and BW, even higher than the R-squared with no filter.
The benchmarks evaluated on consisted of a range of Math and Coding benchmarks. These are fitting evaluation benchmarks as both Math and Coding reward models tend to be easier to train due to how verifiable the rewards are. This attribute makes these benchmarks ideal for the authors method, as a messy reward signal could call into question the source of a decreased performance.
理论论述
No proofs were presented in the paper. The correctness of theoretical claims was provided by the authors (see R-squared verification above).
实验设计与分析
I believe the experimental design with respect to the baselines and evaluation benchmarks is sound (Please see 'Methods And Evaluation Criteria')
I agree with the range of methods covered as baselines. They all focus around training an LLM to be better at a specific domain, and I do not believe other zero-shot or in-context learning examples are relevant. The authors primary contribution is an improvement over PPO and this is compared against. They also demonstrate that their approach is able to raise vanilla PPO over other methods such as DPO or BOND-SFT in, for example, HumanEval.
These results also correlate with their R-squared results with BR and BW improving performance while BoN harms performance.
补充材料
I read through the appendix, but did not examine it as closely as the main paper.
与现有文献的关系
The key contributions of this paper are in their insight of the tendency of reward models to be most accurate in the extrema of the reward distribution and filtration method they developed for applying this to PPO along with their LeetCode benchmark.
Both are very relevant to RLHF and the latter will be a good, new environment for evaluating the new reasoning models.
遗漏的重要参考文献
None that I am aware of.
其他优缺点
Strengths: -The method is relatively straight-forward conceptually but the authors both verify its correctness and show that leveraging it can lead to increases in performance in relevant benchmarks. -The authors verifiably demonstrate the correlation between the reward model predictions and the actual rewards in cases of high/low reward. -A good range of benchmarks are selected and the author's method demonstrates gains over all of them when expected -The authors also introduce a new benchmark for evaluating their method
Weaknesses: -While testing was done primarily in benchmarks with easily verifiable rewards, it was not studied what happens when this is not the case
其他意见或建议
N/A
Author Response to Reviewer Rkmi
We thank the reviewer for the careful investigation and fruitful discussion.
Generalization to Messy Reward Scenarios
While our experiments focus on code and math tasks with verifiable rewards, we acknowledge the importance of evaluating PF-PPO in scenarios with noisier or subjective rewards (e.g., safety, helpfulness). Our core insight—filtering ambiguous samples to retain high/low-reward extremes—is task-agnostic and relies only on the relative reliability of reward signals. Moreover, our additional experiment results shown in the post “Response to All Reviewers” should also support this claim since the tasks such as language understanding and instruction following are associated with different levels of reward accuracy (e.g., involving subjective rewards).
Response to All Reviewers
We sincerely thank the reviewers for their constructive feedback and insightful suggestions.
One major advantage of our method is its simplicity, universality, and effectiveness. To further validate the broader effectiveness of our method, we conducted experiments across diverse domains using Doubao-25k (policy and reward model backbone). Tasks included logic reasoning, math, code generation, STEM problems, complex tasks, instruction following, knowledge QA, and language understanding. Each task has distinct evaluation sets and verifiers to assess response correctness. Results (accuracy improvement over vanilla PPO) are shown below. Statistically significant changes (exceeding , based on test case counts) are bolded. These results demonstrate PF-PPO’s consistent effectiveness across tasks.
| Task (Evaluation Set Size) | BO1 Accuracy (%) | BO5 Accuracy (%) |
|---|---|---|
| Logic Reasoning (1203) | 48.9 (+2.3) | 63.8 (+2.8) |
| Math (1759) | 69.7 (+1.1) | 79.9 (+2.3) |
| Code (3933) | 55.8 (-0.2) | 67.4 (+0.1) |
| STEM (4466) | 54.7 (-0.1) | 63.1 (+0.1) |
| Complex Tasks (2990) | 9.5 (+1.0) | 14.9 (+0.6) |
| Instruction Following (1525) | 49.6 (+1.7) | 59.8 (+1.8) |
| Knowledge (775) | 47.3 (+1.9) | 58.3 (+1.8) |
| Language Understanding (680) | 63.8 (+1.6) | 68.4 (+3.8) |
We thank the reviewers for their rigorous feedback, which has strengthened our analysis. The revised manuscript will incorporate these responses, clarify methodological details, and emphasize PF-PPO’s broader applicability.
The paper proposes policy Filtration to deal with noise in reward models. Reviewers are split in their final recommendations, with the main concern being the lack of theoretical support and deeper analysis. Through discussions, reviewers who are on the fence, are fine with a more positive outcome on the paper given the strong empirical evidence. The insight on the tendency of reward models to be most accurate in the extreme and the new evaluation benchmarks contribute to the LLM fine-tuning literature.