PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
4
5
4
5
3.0
置信度
创新性3.5
质量3.0
清晰度2.8
重要性2.8
NeurIPS 2025

Information-Theoretic Reward Decomposition for Generalizable RLHF

OpenReviewPDF
提交: 2025-05-06更新: 2025-10-29
TL;DR

In this paper, we decompose the reward value into prompt-free reward and prompt-related reward from a information-theoretic perspective, and use the former to guide reward training.

摘要

关键词
Reinforcement Learning from Human FeedbackReward LearningLarge Language Models

评审与讨论

审稿意见
4

This work identifies the problem of generalization to unseen prompt-response pairs in reinforcement learning from human feedback (RLHF). They propose a decomposition of reward: r(x,y)=r1(x,y)+r2(y)r (x, y) = r_1 (x, y) + r_2 (y), where r1r_1 is prompt-related reward and r2r_2 is prompt-free reward. The authors use mutual information constraints to extract these components without requiring additional models. Using the difference in r2r_2, a new data prioritization method helps improving the reward accuracy. The efficacy of such decomposition and learning algorithm is corroborated by both synthetic and practical experiments.

优缺点分析

Strengths

  • Important Problem: Addresses a fundamental issue in RLHF - reward model generalization is crucial for practical deployment.

  • Novel Technical Approach: The decomposition method is supported by the information-theoretic formulation, and provides new insights into reward learning and evaluation. The proposed method doesn't require additional models and can be integrated into existing training pipelines.

  • Comprehensive Evaluation: Includes both synthetic datasets (length-biased and adversarial-prompt) and standard benchmarks, with comparisons to relevant baselines of vanilla Bradley-Terry (BT) training and RRM.

Weaknesses

  • Importance Sampling for Unknown Behavior Policy: As the authors mentioned in Appendix C.2, when the behavior policy for the dataset is unknown, they cannot estimate P(y1,y2x)P(y_1, y_2 | x), so they turn to a fixed weighted average for corresponding and non-corresponding prompt-responses tuples. While the experimental results are good, this step lacks intuitive explanation and theoretical guarantee. In fact, this is the most practical regime, where the behavior policy could be rule-based, human, or a large mixture of different LLMs, instead of a single LLM for easy probability calculation.

  • Lack of Discussion on Computational Overhead: The binary search decomposition and importance sampling add non-trivial computational costs that aren't thoroughly analyzed. The authors haven't provided detailed analysis on the computation cost and the wall-clock time of the experiments. It is hard to see if the allowed training time is the same, whether the proposed method will outperform baselines.

问题

My only question is about the computational cost compared to baselines. What is the running time of each method in the experiments shown in Table 2? If we control the running time, will your method still be the best?

局限性

Aside from the computation cost issue, I didn't find significant limitations.

最终评判理由

The authors' rebuttal addresses my concerns about the computation overhead. This issue is also raised by reviewers S9sS and rGtM, so I think the presentation of this work can be further improved. I'll keep my positive rating.

格式问题

None

作者回复

Thank you very much for your time and effort in reviewing our paper. We present the responses as follows.

Importance Sampling for Unknown Behavior Policy: As the authors mentioned in Appendix C.2, when the behavior policy for the dataset is unknown, they cannot estimate P(y1,y2x)P(y_1, y_2|x), so they turn to a fixed weighted average for corresponding and non-corresponding prompt-responses tuples. While the experimental results are good, this step lacks intuitive explanation and theoretical guarantee. In fact, this is the most practical regime, where the behavior policy could be rule-based, human, or a large mixture of different LLMs, instead of a single LLM for easy probability calculation.

Firstly, we want to clarify that the importance sampling technique can be used under many practical scenarios. In fact, for many online RL methods (online-DPO, online-PPO, ...), the reward model will be updated based on the generated data of the trained policy and the external preference signal. This enables the reward model to provide more immediate (on-policy) preference signal, facilitating policy improvement. The importance sampling technique is a perfect fit for this setting.

Secondly, although the weighted average scheme may lead to an imprecise estimation of the expectation, the intuition of our method still holds. Either a large value of reward gap on corresponding prompt or non-corresponding prompt will lead to a large value of Δr2\Delta r_2 (prompt-free reward gap). Similar to the analysis in section 3.3, by prioritizing samples with smaller values of Δr2\Delta r_2, we concentrate our updates on data that bring more preference information and are not dominated by spurious preferences, which is more advantageous for reward learning.

Lack of Discussion on Computational Overhead: The binary search decomposition and importance sampling add non-trivial computational costs that aren't thoroughly analyzed. The authors haven't provided detailed analysis on the computation cost and the wall-clock time of the experiments. It is hard to see if the allowed training time is the same, whether the proposed method will outperform baselines.

My only question is about the computational cost compared to baselines. What is the running time of each method in the experiments shown in Table 2? If we control the running time, will your method still be the best?

In fact, estimating expectations over prompts with importance sampling does not incur significant computational overhead. The reasons are as follows.

  1. The importance weights are independent of the trained reward model and are an inherent property of the dataset distribution. Therefore, for each dataset, we only need to pre-sample prompts and compute probabilities once. This process does not need to be coupled with reward training, allowing experiments with different random seeds, hyperparameters, or LLM backbones to share the same importance weights.
  2. The calculation of importance weights is also time-efficient. Specifically, it only requires constant times of forward pass of the LLM, and is not autoregressive. This is significantly faster than reward training because there is no need to wait or communicate for parameter updates.

Since the importance weights are computed only once for each dataset and the overhead is not significant compared to training, we did not compare the precise time consumption. In our experiments, calculating the importance weight for a dataset is faster than conducting a full naive BT reward model training on the same dataset. Considering different random seeds, hyperparameters, and LLM backbones, the average time consumption is minimal.

If our responses have addressed your concerns, we would appreciate your consideration for a higher score. Thank you again for your time and effort in reviewing our paper.

评论

The authors' rebuttal addresses my concerns about the computation overhead. This issue is also raised by reviewers S9sS and rGtM, so I think the presentation of this work can be further improved. I'll keep my positive rating.

审稿意见
5

The paper tackles a key problem in RLHF - training reward models that generalize well to unseen prompt-response pairs. The authors identify a key issue in standard reward modeling - models often overfit to response-only features, neglecting the prompt’s role and thus generalizing poorly. They propose decomposing the reward into prompt-free and prompt-related components using an information-theoretic formulation, without additional models. They then guide training by prioritizing samples with small prompt-free reward gaps, reducing spurious biases (e.g. in response length) and improving generalization. Experiments on crafted and real-world datasets show improved reward model accuracy and downstream policy performance.

优缺点分析

Overall I think this is an excellent paper addressing a very important facet of RLHF that has not been suitably addressed by prior work, to my knowledge. Unless I have missed a major flaw in the empirical validation, I am enthusiastic about giving this paper an accept rating conditioned on explicitly adding their algorithm to the main paper, discussing its computational inefficiency, and also briefly elaborating on their “randomly sample prompts to compare reward gaps” experiment in the main paper.

Strengths:

  1. The paper clearly highlights a real weakness in current RLHF pipelines: failure to account for prompt information during reward training, leading to poor generalization. They also perform experiments to demonstrate this by randomly sampling prompts and comparing the reward gap on two datasets to the original prompt.
  2. The paper proposes a creative, theoretically grounded solution to the problem - introducing an information-theoretic decomposition of reward gaps into prompt-free and prompt-related components.
  3. The paper has provable results establishing the feasibility of their optimization problem and discusses a natural optimization procedure (binary search) for solving the problem without requiring extra models.
  4. The data prioritization mechanism based on prompt-free reward gaps is an easy modification of existing RLHF pipelines, and some of the practical considerations (e.g. binary clustering with EMA threshold) demonstrate thoughtful engineering.
  5. Experiments on both synthetic (length bias, adversarial prompts) and standard datasets show meaningful improvements in generalization and policy performance on multiple benchmarks (RewardBench, AlpacaEval-2, MT-Bench).
  6. By focusing reward learning on prompt-conditioned information, the paper proposes a direction likely to influence future RLHF reward model design.

Weaknesses:

  1. I think the paper does not explicitly lay out their algorithm for decomposing the reward into a prompt-free and prompt-related components. While it is discussed in words, it would be very helpful if the algorithm was also explicitly laid out in the main paper to avoid ambiguity or misinterpretation.
  2. While the algorithm avoids extra models, estimating expectations over prompts with importance sampling can be computationally intensive, especially on large datasets. The paper does not report run times or efficiency comparisons with some other natural options for the step of computing the prompt-free components of the reward.
  3. (Minor) Experiments mostly revolve around length bias. I think the work would be strengthened by showing mitigation of other known biases (e.g. sentiment, verbosity, style) to confirm the breadth of the applicability of the paper’s method. I suppose that many of these are hard to quantify properly, but I hope that reasonable classifiers for these features could be good enough. I think this could also help us better understand the overly positive tone of LLMs.
  4. While improved generalization is shown empirically, the paper doesn’t analyze how the learned prompt-related rewards align with human-interpretable preferences, which would validate that the method captures meaningful prompt-response relationships. Both a quantitative interpretability study of the prompt-free and prompt-related reward models as well as a qualitative analysis of reward gaps for a few examples would be very helpful.
  5. The paper briefly shows an image illustrating an experiment comparing reward gaps between responses in a dataset under both the original prompt and randomly sampled prompts. It would be good to discuss the experiment in the main paper a bit more. Perhaps it will also help to provide some statistical significance numbers to make sure that the picture is not deluding us, since we are working in the space of the logarithms of the true binary probabilities.

问题

Could you elaborate on the run-time and efficiency of your reward decomposition algorithm? What are some more efficient versions of the algorithm - either by approximating the expectations involved more quickly or estimating the prompt-free reward in an entirely different way?

局限性

I think the paper’s reward decomposition algorithm is computationally intensive, and this needs to be acknowledged and discussed.

最终评判理由

I am largely satisfied with the authors’ rebuttal and will maintain my score of 5. However, I still think that the authors should consider adding the algorithm to the main paper in the camera-ready version to improve clarity of exposition.

格式问题

None

作者回复

Thank you very much for your time and effort in reviewing our paper. We present the responses as follows.

I think the paper does not explicitly lay out their algorithm for decomposing the reward into a prompt-free and prompt-related components. While it is discussed in words, it would be very helpful if the algorithm was also explicitly laid out in the main paper to avoid ambiguity or misinterpretation.

We're sorry that due to the space limit, we present the pseudo-code of extracting prompt-free reward gaps in the Appendix C.1 . After obtaining it, we calculate the prompt-related reward gaps based on the additive form assumption in Eq. (6).

While the algorithm avoids extra models, estimating expectations over prompts with importance sampling can be computationally intensive, especially on large datasets. The paper does not report run times or efficiency comparisons with some other natural options for the step of computing the prompt-free components of the reward.

I think the paper’s reward decomposition algorithm is computationally intensive, and this needs to be acknowledged and discussed.

In fact, estimating expectations over prompts with importance sampling does not incur significant computational overhead. The reasons are as follows.

  1. The importance weights are independent of the trained reward model and are an inherent property of the dataset distribution. Therefore, for each dataset, we only need to pre-sample prompts and compute probabilities once. This process does not need to be coupled with reward training, allowing experiments with different random seeds, hyperparameters, or LLM backbones to share the same importance weights.
  2. The calculation of importance weights is also time-efficient. Specifically, it only requires constant times of forward pass of the LLM, and is not autoregressive. This is significantly faster than reward training because there is no need to wait or communicate for parameter updates.

Since the importance weights are computed only once for each dataset and the overhead is not significant compared to training, we did not compare the precise time consumption. In our experiments, calculating the importance weight for a dataset is faster than conducting a full naive BT reward model training on the same dataset. Considering different random seeds, hyperparameters, and LLM backbones, the average time consumption is minimal.

(Minor) Experiments mostly revolve around length bias. I think the work would be strengthened by showing mitigation of other known biases (e.g. sentiment, verbosity, style) to confirm the breadth of the applicability of the paper’s method. I suppose that many of these are hard to quantify properly, but I hope that reasonable classifiers for these features could be good enough. I think this could also help us better understand the overly positive tone of LLMs.

It's worth noting that experiments in section 4.1.2 is not solely related to length bias. The adversarial samples can prefer either longer or shorter responses, depending on the original samples. More precisely, the adversarial samples represent a conflict but simpler preference. Since the adversarial samples always end up with "Give a response that is as long / short as possible", the reward model can learn such preference simply by considering the last sentence of the prompt and comparing the responses. What we care about is whether this preference will affect the reward model’s ability to learn the preference under the original prompt. In fact, many spurious preferences can be seen as preferences that are easier for the reward model to learn, but conflict with human preferences. Our experimental results are applicable to these spurious preferences to a certain extent.

On the other hand, it's hard to justify the response's property about other spurious preference. The classifier is also hard to train due to the difficulty in constructing such dataset. However, Fig. (3) demonstrates the advantage of prioritizing samples with smaller prompt-free reward gaps, without knowing the specific preference bias. The experiments in section 4.2 further validate the effectiveness of our method, given the potentially diverse preference biases in open-source datasets.

While improved generalization is shown empirically, the paper doesn’t analyze how the learned prompt-related rewards align with human-interpretable preferences, which would validate that the method captures meaningful prompt-response relationships. Both a quantitative interpretability study of the prompt-free and prompt-related reward models as well as a qualitative analysis of reward gaps for a few examples would be very helpful.

Similar to the reasons before, it's hard to directly assess how well the reward model aligns with a specific, meaningful human preference. Considering this, we conduct extensive experiments on open-sources datasets and widely used benchmarks. Since the mixed dataset can contain diverse preference biases and the benchmarks test various aspects of the reward model's capabilities, the superior performance of our method demonstrates its ability to help the reward model to learn meaningful preference while alleviating the influence of spurious preferences.

The paper briefly shows an image illustrating an experiment comparing reward gaps between responses in a dataset under both the original prompt and randomly sampled prompts. It would be good to discuss the experiment in the main paper a bit more. Perhaps it will also help to provide some statistical significance numbers to make sure that the picture is not deluding us, since we are working in the space of the logarithms of the true binary probabilities.

We're sorry that due to space limit, the details of this experiment are not given in the main paper. As is shown in Fig. (1) (Left), we first randomly sample 1000 prompt-response pairs from the dataset, and then randomly sample 8 non-corresponding prompts for each pair. We then calculate the reward gaps given the original prompt and non-corresponding prompts. the blue line shows the reward gaps with original prompt (in ascending order). The green line shows the mean values of the reward gaps under 8 non-corresponding prompts.

We continue to provide two statistical metrics to support this. The first is align rate, which is the ratio of the samples with non-corresponding prompts that prefer the same response as for the original prompt. The second is mean deviation rate, which is the mean value of ΔroriginalΔrnon-correspondingΔroriginal\frac{|\Delta r_{\text{original}}-\Delta r_{\text{non-corresponding}}|}{|\Delta r_{\text{original}}|} The results are as follows:

align ratemean deviation rate
SHP dataset0.870.28
Ultrafeedback dataset0.830.55

If our responses have addressed your concerns, we would appreciate your consideration for a higher score. Thank you again for your time and effort in reviewing our paper.

审稿意见
4

The paper propose to decompose the reward value into two independent components: prompt-free reward and prompt-related reward, so to improve the generaibility of reward model and robustness to shortcuts such as length bias. Throught curated dataset, they explicitly show the efficay of methods. They also demonstrate the proposed methods achieives better accuracy on reward benchmark and inference computing.

优缺点分析

Strengths

  • The paper is well-written, with a rigorous and straightforward derivation of its methods.
  • The use of a prompt-free reward for priority sampling is a well-motivated and sensible approach.
  • The authors explicitly demonstrate the algorithm's effects on a curated dataset, which is persuasive and aligns with the algorithm's design.
  • Experimental results on reward benchmarks and BON infer are also provided.

Weakness

  • The calculation of the prompt-free reward relies on a binary search, which involves an expectation over E(x|y_1,y_2). Given that the reward is typically continuous, the time complexity of this approach is a concern. More details on the algorithm and its time complexity would be beneficial. Moreover, the number of samples required to accurately calculate the expectation E(x|y_1,y_2) is not specified.

  • The paper is missing a comparison with some relevant baselines. For instance, the Generalizable Reward Model (GRM) is another line of research that aims to improve the generalizability of reward models by leveraging the generative power of text. Including a comparison to GRM would provide a more comprehensive evaluation of the proposed method.

问题

Please refer to the 'Strengths And Weaknesses' part.

局限性

yes

最终评判理由

The authors have addressed my main concern by providing additional comparison to the GRM in their rebuttal, but I tend to maintain my current score. This is solid work on removing bias in reward models, but the method's complexity may limit its practical application, especially in today's long-cot directions.

格式问题

None

作者回复

Thank you very much for your time and effort in reviewing our paper. We present the responses as follows.

The calculation of the prompt-free reward relies on a binary search, which involves an expectation over E(x|y_1,y_2). Given that the reward is typically continuous, the time complexity of this approach is a concern. More details on the algorithm and its time complexity would be beneficial. Moreover, the number of samples required to accurately calculate the expectation E(x|y_1,y_2) is not specified.

In fact, estimating expectations over prompts with importance sampling does not incur significant computational overhead. The reasons are as follows.

  1. The importance weights are independent of the trained reward model and are an inherent property of the dataset distribution. Therefore, for each dataset, we only need to pre-sample prompts and compute probabilities once. This process does not need to be coupled with reward training, allowing experiments with different random seeds, hyperparameters, or LLM backbones to share the same importance weights.
  2. The calculation of importance weights is also time-efficient. Specifically, it only requires constant times of forward pass of the LLM, and is not autoregressive. This is significantly faster than reward training because there is no need to wait or communicate for parameter updates.

Since the importance weights are computed only once for each dataset and the overhead is not significant compared to training, we did not compare the precise time consumption. In our experiments, calculating the importance weight for a dataset is faster than conducting a full naive BT reward model training on the same dataset. Considering different random seeds, hyperparameters, and LLM backbones, the average time consumption is minimal.

In practice, we first randomly sample 16 prompts for each responses pairs and calculate their importance weights. However, we found that estimating the expectation with 8 prompts can achieve the same performance. We suggest that the number of prompts does not need to be large, as long as they effectively reflect prompt-free preferences. This will be added to the experimental details in the updated verison.

The paper is missing a comparison with some relevant baselines. For instance, the Generalizable Reward Model (GRM) is another line of research that aims to improve the generalizability of reward models by leveraging the generative power of text. Including a comparison to GRM would provide a more comprehensive evaluation of the proposed method.

We choose the experiment in section 4.1.1 and the experiment on the Reward-Bench in section 4.2 to compare our method with the Generalizable Reward Model (GRM). To ensure the fairness of comparison, we only compare with GRM that has a non-linear reward head and is fully finetuned. We follow the hyperparameter setting in its original paper. The results are as follows:

Results on Reward-Bench with length-biased dataset in section 4.1.1:

MethodChatChat HardSafetyReasoningAverage
vanilla78.029.836.458.250.6
GRM w/ dpo noref80.129.737.258.551.3
GRM w/ sft80.329.237.659.351.6
Ours86.831.145.160.355.8

Results on Reward-Bench with RLHFlow pair-preference dataset(mixed dataset, 300K samples):

MethodChatChat HardSafetyReasoningAverage
vanilla-8B0.93±0.010.50±0.020.67±0.020.78±0.030.72±0.02
GRM w/ dpo noref-8B0.96±0.020.57±0.010.73±0.010.85±0.030.78±0.01
GRM w/ sft-8B0.95±0.010.55±0.010.76±0.030.82±0.010.77±0.01
Ours-8B0.96±0.020.59±0.010.80±0.010.89±0.020.81±0.01
vanilla-7B0.90±0.020.49±0.030.60±0.010.69±0.050.66±0.03
GRM w/ dpo noref-7B0.94±0.010.54±0.020.67±0.020.74±0.040.72±0.02
GRM w/ sft-7B0.93±0.020.55±0.010.65±0.010.75±0.020.72±0.01
Ours-7B0.94±0.010.56±0.020.69±0.020.81±0.040.75±0.02

It's clear from the results that our method still outperforms GRM. Interestingly, GRM can achieve comparable performance in the experiments with RLHFlow pair-preference dataset(mixed dataset, 300K samples), but only has minor improvements with length-biased dataset. We suggest the reason is that, GRM only enforces the language consistency of the LLM backbone, but can not handle the inherent preference bias in the dataset. For the length-biased dataset where most of the chosen responses are longer than the rejected responses, both the SFT regularization and the DPO Regularization may further exacerbate such length bias of the reward model.

If our responses have addressed your concerns, we would appreciate your consideration for a higher score. Thank you again for your time and effort in reviewing our paper.

评论

I appreciate the response and the additional comparison to the GRM, which have addressed my main concern. After careful consideration, I tend to maintain my current score.

审稿意见
5

The paper proposes reward function decomposition to handle prompt free and prompt based rewards separately and independently to increase generalization of reward model and alignment. These 2 independent components are extracted using information theoretic measures. Effectiveness of the decomposed reward model is shown through toy example and evaluations on standard open source datasets.

优缺点分析

Strength:

  1. The problem is interesting and novel to the best of my knowledge; generalization of reward function in LLMs is an important research direction given the current AI landscape.
  2. Experiments on Open source Datasets shows superior performance of the proposed work as compared to other baselines.
  3. Experiments on manually crafted dataset including adversarial prompt Dataset in interesting and shows the utility of prompt free reward values.

Weakness:

  1. While I think Fig 1 is useful in understanding the motivation , the right hand side explanation is confusing and not clear to me.
  2. The interpretations of the random variables (equation 7) are not clear to me. How can you get the preference label over the prompt free reward component when the overall preference is over the responses to a prompt inducing an unknown reward function r_\theta. (which is eventually learnt in Preference based learning).
  3. I think the intuition of the proposed idea is not clearly conveyed; it would have been nice to explain the intuition before modelling the 4 random variables introduced in Eqn 7.
  4. The idea of just training the samples with small prompt-free reward gap might led to overfitting. What happens to the samples with small prompt-dependent reward gap? Whether they are trained or not is not clear.
  5. Fig 4(a) and 4(b) is not clear to me from the visualization plots; I am not sure how to interpret them.
  6. Overall, though I like the high-level idea of training a generalizable reward function, I could not understand a lot of technical details from the paper.

问题

  1. Can the authors please clarify the highlighted issue in Fig 1 (right) in context of the specific example? If I understand correctly, all the 3 prompts are alike and have the same preferred response; but still given they are not explicitly seen during training leads to lower reward gap values?
  2. In Eqn 4, is r_1(x,y) the prompt dependent reward? I believe r_1(x,y) hasn’t been defined anywhere.
  3. Why does the random variables in Eqn 7 is modelled using Bernoulli distribution? It isn’t clear to me what these variables mean (I did take a look at the appendix).
  4. In Eqn 5; does y_1 refers to preferred response and y_2 refer to non-prefered response or do both of these refer to different preferred response for the same prompt?

局限性

Yes

最终评判理由

I have read the rebuttal and understood the technical concerns that I had. The responses were detailed which helped in clearing other aspects of the word.

I have likewise increased my score.

格式问题

None

作者回复

Thank you very much for your time and effort in reviewing our paper. We present the responses as follows.

While I think Fig 1 is useful in understanding the motivation , the right hand side explanation is confusing and not clear to me.

Can the authors please clarify the highlighted issue in Fig 1 (right) in context of the specific example? If I understand correctly, all the 3 prompts are alike and have the same preferred response; but still given they are not explicitly seen during training leads to lower reward gap values?

In the example provided in Fig. 1 (right), the 3 prompts are different, and their corresponding response pairs are also different. Moreover, only in-distribution prompt-response pairs (connected with solid lines) are used to train the reward model, while the user might query the preference on out-of-distribution prompt-response pairs (connected with dashed lines) in the subsequent alignment stage.

The chosen response in the blue box lists the main applications of LLMs in bullet points, while the rejected response in the blue box describes them in a single paragraph. It's clear that the former response is more prefered given the prompt in the blue box, and the latter is more prefered given the prompt in the yellow box. As the prompt in the green box requires for the procedure of RLHF, the two responses in the blue box can be seen as equally bad.

When the reward gap overly depends on the responses, changing the prompt will only lead to minor difference (for the response pair in the blue box, Δr=3.2\Delta r=3.2 given the blue prompt, Δr=2.5\Delta r=2.5 given the yellow prompt, Δr=2.7\Delta r=2.7 given the green prompt). This is conflict with the true preference mentioned before. More importantly, such mistakes are not caused by unseen prompts. The reward model is well-behaved on the yellow and green prompt-response pairs (drawn on the left, responses are different), but loses the ability to generalize the information in the prompt to non-corresponding prompt-response pairs. The catastrophic result is due to overdependence on the responses.

The interpretations of the random variables (equation 7) are not clear to me. How can you get the preference label over the prompt free reward component when the overall preference is over the responses to a prompt inducing an unknown reward function r_\theta. (which is eventually learnt in Preference based learning).

We can't get the preference labels for the prompt-free reward; otherwise, there would be no need to extract it using the information-theoretic approach based on Theorem 1 and Theorem 2. We could simply fit a Bradley-Terry reward model using its preference labels.

In fact, our core contribution is that, after defining prompt-free preference in a way that aligns with our intuition (in Eq. (7)), we find a tractable and lightweight method to extract the prompt-free reward. Such a method only requires the value of Δrθ\Delta r_\theta.

I think the intuition of the proposed idea is not clearly conveyed; it would have been nice to explain the intuition before modelling the 4 random variables introduced in Eqn 7.

For the Bradley-Terry model, given a prompt-response pair (x,y1,y2)(x, y_1, y_2), it will prefer y1y_1 with the probability of p=σ(r(x,y1)r(x,y2))p=\sigma(r(x, y_1)-r(x, y_2)), and prefer y2y_2 with the probability of σ(r(x,y2)r(x,y1))\sigma(r(x, y_2)-r(x, y_1)) (which is 1p1-p). Such a preference label exactly matches the definition of Bernoulli random variable, and is all we can get from the preference data.

Clearly, our goal is to decompose rθr_\theta into r1r_1 and r2r_2. To achieve this, we need to identify the connection between their preferences. Mathematically, this means finding the relation between the Bernoulli random variables inside the preference labels.

Intuitively, prompt-free preference is independent of any specific prompt and only tells which response is prefered in general. For example, "the longer response is better" is a kind of prompt-free preference. This is because, for any given prompt, the longer response is preferred and therefore considered generally better. Therefore, we incorporate the conditional expectation (based on P(XY1,Y2)P(X|Y_1, Y_2)) in the definition of W~\tilde{W} in Eq. (7), in order to characterize the general probability that y1y_1 is prefered. The definition of other three Bernoulli random variables are natural.

One can expect that these four random variables, each representing different kinds of preference, are deeply connected from an information-theoretic perspective. We provide intuitive illustration of these connections in Fig. (2), and formalize this intuition using the constrained optimization objective in Eq. (9). After that, we propose a lightweight method to solve this objective and achieve our goal.

The idea of just training the samples with small prompt-free reward gap might led to overfitting. What happens to the samples with small prompt-dependent reward gap? Whether they are trained or not is not clear.

Such overfitting is unlikely to happen. The reason is that a sample is not updated more than once in a single iteration. At each step, we continuously sample data and retain those with Δr2\Delta r_2 below the threshold until the number of retained samples reaches the batch size. We then use these retained samples for one-step update. Only the unretained samples are reinserted into the dataset for future sampling. This process actually only reorders the sampled data so that each step prioritizes updates with data having smaller Δr2\Delta r_2, as we consider them more valuable.

For samples with small prompt-related reward gaps, if their prompt-free reward gaps are also small, they will be used for update. If their prompt-free reward gaps are large, they will not be used for update (in this step) since rθr_\theta may have overfit to supirious preference on these samples.

Fig 4(a) and 4(b) is not clear to me from the visualization plots; I am not sure how to interpret them.

Take Fig 4(a) as an example. The four subfigures represent four equally spaced training steps. All the points plotted in each subfigure represent randomly sampled prompt-response pairs, characterized with their corresponding Δr1\Delta r_1 and Δr2\Delta r_2 values. The data points that satisfy yw>yl|y_w| > |y_l| are colored in red and the ones that satisfy ywyl|y_w| ≤ |y_l| in blue.

It's clear that for naive training procedure, when we consider the prompt-free preference of rθr_\theta on the data samples (you can consider as projecting points to the Δr2\Delta r_2 axis), longer responses are prefered. Specifically, chosen-longer samples have significantly larger Δr2\Delta r_2 and chosen-shorter samples have small Δr2\Delta r_2, which demonstrate a clear preference towards longer responses. This is not surprising since the training dataset is inherently biased towards longer responses. On the other hand, the reward model trained with our method are only slightly biased towards the longer responses, making it focusing more on the prompt-related preference (which is more valuable). Similar interpretations of Fig. 4(b) are listed in section 4.1.2 .

In Eqn 4, is r_1(x,y) the prompt dependent reward? I believe r_1(x,y) hasn’t been defined anywhere.

Sorry for the unclear writing. In line 93, we define the prompt-related reward with r1r_1.

Why does the random variables in Eqn 7 is modelled using Bernoulli distribution? It isn’t clear to me what these variables mean (I did take a look at the appendix).

The reason we use Bernoulli random variable to model the random preference label is given in the third response. When the prompt and the two responses are fixed, the Bradley-Terry model will randomly prefer one response with probability pp, and prefer the other with probability 1p1-p. However, considering the entire dataset distribution, the probability pp is determined by the specified prompt and the respones (which are also inherently random). So the random preference label of a Bradley-Terry reward model can be formalized with a Bernoulli random variable, whose randomness comes in two ways: randomness of the prompt and the respones, and the randomness of Bernoulli distribution.

In Eqn 5; does y_1 refers to preferred response and y_2 refer to non-prefered response or do both of these refer to different preferred response for the same prompt?

In the subsequent theoretical analysis, we need to use the preferences under non-corresponding prompt-response pairs. Therefore, we do not simply use y+y_+ and yy_-, as the preference between responses may reverse under different prompts. Here, we generally define how to express the preference between two responses for a given prompt. In practice, y1y_1 and y2y_2 are the response pair in a data sample, and which one is preferred depends on the specific prompt.

If our responses have addressed your concerns, we would appreciate your consideration for a higher score. Thank you again for your time and effort in reviewing our paper.

评论

I have read the rebuttal and understood the technical concerns that I had. The responses were detailed which helped in clearing other aspects of the word.

I have likewise increased my score.

最终决定

Paper summary The submitted paper addresses the issue of RLHF’s reward model generalization to unseen prompt-response pairs. The authors propose a decomposition of the reward function into a prompt-free reward and a prompt-related reward component, and introduce a data prioritization mechanism based on the prompt-free reward gap to improve reward model training. The proposed method is evaluated on synthetic datasets (e.g., length-biased and adversarial prompts) and standard benchmarks, demonstrating improved generalization and robustness.

Strengths of the paper

  1. Novelty and Importance: The paper tackles an important problem in RLHF and proposes a theoretically grounded solution. The decomposition of reward into prompt-free and prompt-related components is sensible and provides relevant insights into reward learning.
  2. Technical Rigor: The method builds on an information-theoretic formulation, and the learning problem is well-defined, admitting some theoretical analysis.
  3. Practicality: The proposed method integrates into existing RLHF pipelines without requiring additional models.
  4. Evaluation: The paper includes experiments on both synthetic datasets and real-world benchmarks, showing meaningful improvements in reward model accuracy and downstream performance.

Weaknesses of the paper

  1. Clarity:
    • The algorithm for reward decomposition is not explicitly laid out in the main paper, which could lead to ambiguity. While it is included in the appendix, its inclusion in the main text would improve clarity.
    • Some visualizations (e.g., Figures 1 and 4) and equations (e.g., Eq. 7) are not intuitive and require better explanations.
  2. Computational Overhead:
    • The binary search decomposition and importance sampling add computational costs, which are not thoroughly analyzed. While the authors argue that the overhead is minimal, detailed runtime comparisons with baselines are missing.
  3. Limited Scope of Experiments:
    • Most experiments focus on length bias, with limited exploration of other biases (e.g., sentiment, verbosity, style). This raises questions about the general applicability of the method.
    • The paper does not analyze how the learned prompt-related rewards align with human-interpretable preferences, which would validate the method's ability to capture meaningful prompt-response relationships.
  4. Baseline Comparisons:
    • While the authors include comparisons with the Generalizable Reward Model (GRM) in the rebuttal, this was missing in the initial submission. Including such comparisons earlier would have strengthened the paper.

Discussion and rebuttal summary

  • Reviewer sjU2: Initially raised concerns about clarity and technical details but increased their score after the rebuttal clarified these issues.
  • Reviewer S9sS: Appreciated the additional baseline comparisons but maintained their score after “careful consideration”.
  • Reviewer rGtM: Appreciated the paper's novelty and potential impact but suggested including the algorithm in the main paper and addressing details regarding computational efficiency.
  • Reviewer ZzmL: Acknowledged the authors' rebuttal on computational overhead but emphasized the need for better presentation and runtime analysis.

The authors provided detailed and thoughtful responses to the reviewers' concerns, addressing most of the raised issues:

  • Clarity: The authors clarified the algorithm for reward decomposition and provided additional explanations for figures and equations.
  • Computational Overhead: The authors argued that the computational cost of importance sampling is minimal and provided justifications for this claim. However, detailed runtime comparisons are still missing.
  • Baseline Comparisons: The authors included comparisons with GRM in their rebuttal, showing that their method outperforms GRM on multiple benchmarks, although this improvements are not always that significant. Including these results will enable readers to better understand the paper in the context of other approaches.
  • Experimental Scope: The authors explained the challenges of addressing other biases and argued that their method is broadly applicable to spurious preferences.
  • While the rebuttal addressed many concerns, some reviewers maintained their original scores due to unresolved issues, such as the lack of runtime analysis and limited experimental scope.

Recommendation The paper is technically solid and addresses an important problem in RLHF. The proposed method is novel, well-motivated, and demonstrates relevant improvements in generalization and robustness. However, weaknesses include the lack of clarity in some parts of the paper, the limited discussion of computational overhead, and some limitations of the experiments - but these were mainly well addressed in the rebuttal. Despite these issues, the paper's contributions outweigh its shortcomings - hence, in line with the reviewers’s suggestions, I am recommending the acceptance of the paper.

For the camera ready paper, the authors are strongly encouraged to:

  1. Include the algorithm for reward decomposition in the main paper for clarity.
  2. Provide a detailed runtime analysis.
  3. Expand the experimental scope to address other biases and validate the interpretability of the learned rewards. Furthermore, the additional baseline results should be included.