PaperHub
4.5
/10
Rejected4 位审稿人
最低3最高7标准差1.5
4
4
3
7
3.3
置信度
正确性2.5
贡献度2.5
表达2.3
NeurIPS 2024

Removing Length Bias in RLHF is not Enough

OpenReviewPDF
提交: 2024-05-14更新: 2024-11-06

摘要

关键词
LLMRLHFPrompt Bias

评审与讨论

审稿意见
4

The authors considers methods for removing bias in RMs, specifically the bias towards long responses and the bias certain prompts might have to generate high rewards (this stems from the BardleyTerry model being underspecified.). For the second problem the authors proposed PBC which adds a linear layer to the last token of the prompt, the output of which predicts the average reward of completions from the prompt. For the first problem the authors propose to combine PBC with existing length bias correction methods which adds a correlation term to the loss. For experimental results the authors considers RLHF training LLama-7B on the RM-static dataset. They find that their method outperforms baselines on academic benchmarks (Table 2) and in head-to-head comparisons (Fig 4). They also consider hyperparameter stability and ablations in Fig 5.

优点

  1. Bias in RLHF can potentially have a large impact if addressed correctly.

缺点

  1. Academic metrics like MMLU are not a good fit for RLHF. MT-bench is better.
  2. There are no error bars, unclear how strong the signal is.
  3. The writing is rather handwavy at times, e.g. the motivation in section 3.1. is very qualitative.
  4. The novelty is low.

问题

  1. Is the method in section 3.3 novel? It seems like not. The intro states “We show that the developed PBC method can be flexibly combined with existing methods 53 of removing length bias”. Please clarify this.
  2. The Bradley–Terry model is indeed underspecified as in eq (3). Do you have any quantitative evidence that this is a problem in practice? Section 3.1. is rather handwavy.
  3. What prompt is used for GPT4 evaluation is Figure 4? Please clarify in paper.
  4. Can you add error bars to Table 2 and Figure 4?

局限性

na

作者回复

Thank you for recognizing the impact of our research direction.

W1

As your suggestion, we have compared our method with other baselines on MT-bench. The results have been exhibited in the following.

MT-BenchTurn 1Turn 2Average Score
RLHF3.952.223.09
ODIN3.982.263.12
PBC3.612.352.98
ODIN+PBC4.222.203.21
LPBC4.532.813.67

From the results, we can find that our developed LPBC method still ourperforms other baselines on the evaluation of MT-bench.

W2&Q4

Thanks for your suggestion, we have reported error bars in Tables 2 and Fig. 4.

MethodMMLUDROPBBHTQA
RLHF43.82 ± 0.6329.53 ± 0.3931.65 ± 0.0836.57 ± 0.17
ODIN42.29 ± 0.1529.82 ± 0.3732.01 ± 0.5239.43 ± 0.66
PBC43.84 ± 0.2831.61 ± 0.0230.99 ± 0.0138.50 ± 0.22
ODIN+PBC45.56 ± 0.1432.04 ± 0.3331.32 ± 0.3340.80 ± 0.72
LBPC45.94 ± 0.4831.57 ± 0.2632.04 ± 0.1038.75 ± 0.12
LBPC v.s.v.s.PBCODINODIN+PBC
Win45.33 ± 2.4336.67 ± 1.0227.33 ± 2.50
Tie52.67 ± 1.9056.00 ± 1.7357.67 ± 1.95
Loss2.00 ± 1.517.33 ± 1.8915.00 ± 1.42

W3

Sorry for the inconvenience caused in reading. We will improve the writing quality to ensure better understanding.

The brief statement of the motivation is that the prompt-template bias learned by RM will result in LLMs preferring to generate responses in a specific format after RLHF fine-tuning, regardless of the format requested in the prompt. Thus, we develop a method to estimate the prompt-template bias so that we can remove it in the following RLHF process.

W4&Q1

Thank you for giving us the opportunity to clarify the novelty of our work.

The main contribution of our work is revealing the cause of prompt-template bias in reward modeling through theoretical analysis in Section 3.1 and propose PBC method to address this issue in Section 3.2.

In Section 3.3, we propose a novel method to simultaneously estimate prompt-template bias with PBC method and length bias with existing method, e.g. ODIN. We did not simply stack these two methods, that is, we did not just remove the prompt-template bias on an RM that had already been adjusted for length bias (ODIN+PBC in our paper). We chose a more refined modeling approach by decomposing the prompt-template bias into quality and length components for separate estimation, as shown in Eq.(14) (LPBC in our paper). Note that we have inlcuded the comparison beween ODIN+PBC and LPBC in our experiments.

For novelty in the aspect of technical, we admit the bias estimation and combination method is a little straightforward, but the benefit is that it won’t bring too much computation burdens and can be easily deployed on the original implementation of RLHF, which is particularly important given the significant computational resources required for LLM training today.

Q2

As mentioned in W3, our work mainly focuses on addressing the issue of prompt-template bias. As shown in Fig.3, the reward distributions on different categories show that the RM trained with vanilla preference loss tend to assign higher reward scores on the response in a specific format, e.g. tech article.

The quantitative experimental results shown in Table 2 show that our developed PBC method for removing prompt-template bias can lead to significant performance improvements compared to the original implementation of RLHF.

Q3

The prompt we used for GPT-4 evaluation is listed below, following the same experimental settings as in ODIN for a fair comparison.

"[System Prompt] You are a helpful and precise assistant for checking the quality of the answers.

[User Prompt] {prompt}

[The Start of Assistant1’s Answer] {response_a} [The End of Assistant1’s Answer]

[The Start of Assistant2’s Answer] {response_b} [The End of Assistant2’s Answer]

We would like to request your feedback on the performance of two AI assistants in response to the user question displayed above. Please rate the helpfulness, relevance, accuracy, level of details of their responses. Each assistant receives an overall score on a scale of 1 to 10, where a higher score indicates better overall performance. Please first output a single line containing only two values indicating the scores for Assistant 1 and 2, respectively. The two scores are separated by a space. In the subsequent line, please provide a comprehensive explanation of your evaluation, avoiding any potential bias and ensuring that the order in which the responses were presented does not affect your judgment."

We wish the reivewer prioritize RLHF techniques that are truly practical and implementable, rather than those that may appear fancy but remain confined to papers. Our work is driven by real-world problems, identified the underlying causes through theoretical analysis, and effectively addressed the issue in practice.

评论

Dear reviewer

Sorry to bother you, as the discussion period is nearing its end, we hope that our response has adequately addressed your concerns regarding our paper.

If not, we kindly ask you to list your remaining concerns, so that we can improve the quality of our paper in the next round.

Your comments on our paper will be extremely important to us. Thanks.

Best wishes

审稿意见
4

This paper studies the prompt bias in RLHF, especially the reward modeling --- beyond the length bias that might exist. Alleviating reward hacking is an important topic in RLHF, however, with the current paper, some details or contributions are not very clear. I'll elaborate in the following sections.

优点

The problem studied is important. The illustrative figures are helpful.

缺点

Some notations do not make sense, for example, in Equation (5), averaging over y does not make sense. Would not it be better to average over the C rather than y.

The presentation of the problem itself is not yet clear to me. Although the authors keep using examples in the context to anchor their ideas (which I appreciate), it is still unclear what is the problem this work aims to solve. I like the general idea of Figure 1, however, what does the red color highlighting mean? This figure makes a good contrast between your RM and conventional RM, yet it fails to illustrate the problem your RM aims to solve.

The experimental results are not supportive enough.

问题

In Equation (6) the authors compare reward values between two different prompts. What is the motivation for making such comparisons --- as RLHF only ranks/compares responses within a template?

In experiments, what is the performance of different reward models? The authors didn't report the accuracy or other quantitative performance information on the RMs.

Real-world case studies would be very helpful in understanding the paper's contribution: would it be possible for the authors to show some reward hacking examples (not the ones generated by GPT4)?

The proposed algorithm seems fragile w.r.t. its hyper-parameters. There is no clear trend in the heat maps of Figure 5. Could the authors also report the standard deviation of each evaluation?

Error bars are missing in the reported results of Table 2.

局限性

Please see weakness

作者回复

Thank you for recognizing the impact of our research direction.

W1

In tems of notation, we assume it is correct to perform the averaging operation on the variable yy rather than the function symbol CC.

W2

Thanks for your suggestion. We admit that Fig.1 is used to illustrate the difference between our method and vanilla RM training, where the word highlighted in red color indicates the template request in the prompt.

The statement of the problem is that the prompt-template bias learned by RM will result in LLMs preferring to generate responses in a specific format after RLHF fine-tuning, regardless of the format requested in the prompt. Thus, we develop a method to estimate the prompt-template bias so that we can remove it in the following RLHF process.

The cause of prompt-template bias is that the datasets for RM training usually only collect responses that satisfy the template/format requests in the prompt. Because it will be time-consuming and expensive to construct responses in various formats for each prompt in practice.

Q1

As stated in W2, the main cause of prompt-template bias issue is that the datasets for RM training usually only collect responses that satisfy the template/format request in the prompt. Our method developed in this paper can allevaite this issue.

The comparison in Eq.(6) is used to illustate that there is a chance that C(xa,ya)>>C(xb,yb)C(x_a, \overline{y}_a)>>C(x_b, \overline{y}_b) will lead to C(xb,yab)>C(xb,yb)C(x_b, y_{ab}) > C(x_b, y_b), leading to the issue in Eq.(7).

Q2

Actually, we have compares the accuracy of various RMs on Fig.5(a). From the results, we can find that the constraint terms introduced by our method won't significantly affect the RM's accuracy.

Moreover, we have included a comparison of other quantitative performance metrics on the RM benchmark, which is likely the most popular benchmark for evaluating trained RMs.

MetricChatChat HardSafetyReasoning
Vanila RM89.66 ± 0.6041.89 ± 0.1831.34 ± 0.0052.16 ± 1.30
ODIN85.20 ± 0.1337.94 ± 0.2730.96 ± 0.2047.94 ± 1.59
PBC73.97 ± 1.1834.43 ± 1.1934.40 ± 2.1955.35 ± 3.10
PBC+ODIN89.11 ± 0.2340.35 ± 0.5230.60 ± 0.2649.39 ± 0.89
LPBC(ηl=ηc=0.01\eta_l=\eta_c=0.01)90.50 ± 0.2642.54 ± 0.3628.79 ± 0.3245.80 ± 1.20
LPBC(ηl=ηc=0.05\eta_l=\eta_c=0.05)88.24 ± 1.5045.39 ± 1.0728.69 ± 0.2551.30 ± 1.70
LPBC(ηl=ηc=0.10\eta_l=\eta_c=0.10)85.94 ± 0.3945.83 ± 0.4527.76 ± 0.6749.80 ± 1.09

Based on the RM benchmark comparison, our method (LPBC) does not significantly impact RM performance and even enhances it in certain areas, such as Chat and Chat Hard, where the focus is on the quality of generated responses.

Q3

Due to the page limitation, we have attached showcases of reward hacking in the pdf file, sorry for the inconvenience.

Q4

Thanks. Actually, Fig.5 mainly aims to illustrate that the performance of our method is not sensitive to the selection of hyperparameters, as there is no clear trend in the heat maps, as you mentioned.

As you suggested, we also report the standard deviation of each evaluation in the table below

MMLUηc=0\eta_c=0ηc=0.01\eta_c=0.01ηc=0.05\eta_c=0.05ηl=0.1\eta_l=0.1
ηl=0\eta_l=043.82 ± 0.63
ηl=0.01\eta_l=0.0140.30 ± 0.6545.47 ± 0.5245.91 ± 0.55
ηl=0.05\eta_l=0.0544.57 ± 0.5245.94 ± 0.4843.89 ± 0.67
ηl=0.1\eta_l=0.143.93 ± 0.7742.25 ± 0.6736.54 ± 0.98
DROPηc=0\eta_c=0ηc=0.01\eta_c=0.01ηc=0.05\eta_c=0.05ηc=0.1\eta_c=0.1
ηl=0\eta_l=029.53 ± 0.39
ηl=0.01\eta_l=0.0130.63 ± 0.1931.57 ± 0.2231.47 ± 0.25
ηl=0.05\eta_l=0.0532.52 ± 0.1531.57 ± 0.2627.41 ± 0.41
ηl=0.1\eta_l=0.131.06 ± 0.2832.60 ± 0.3230.96 ± 0.33

Q5

Thanks for your suggestion, we have also included error bars in Table.2.

MethodMMLUDROPBBHTQA
RLHF43.82 ± 0.6329.53 ± 0.3931.65 ± 0.0836.57 ± 0.17
ODIN42.29 ± 0.1529.82 ± 0.3732.01 ± 0.5239.43 ± 0.66
PBC43.84 ± 0.2831.61 ± 0.0230.99 ± 0.0138.50 ± 0.22
ODIN+PBC45.56 ± 0.1432.04 ± 0.3331.32 ± 0.3340.80 ± 0.72
LBPC45.94 ± 0.4831.57 ± 0.2632.04 ± 0.1038.75 ± 0.12
评论

Dear reviewer

Sorry to bother you, as the discussion period is nearing its end, we hope that our response has adequately addressed your concerns regarding our paper.

If not, we kindly ask you to list your remaining concerns, so that we can improve the quality of our paper in the next round.

Your comments on our paper will be extremely important to us. Thanks.

Best wishes

审稿意见
3

This paper introduces the Prompt Bias Calibration (PBC) method to address prompt-template bias in reward training of RLHF. The proposed PBC method is validated through extensive empirical results and mathematical analysis, showing its effectiveness in combination with existing length bias removal methods.

优点

  1. Good Writing: The paper is well-written and easy to follow.
  2. Innovative Methodology: Introduces Prompt Bias Calibration (PBC) to address prompt-template bias in RLHF.
  3. Strong Empirical Evidence: Demonstrates significant performance improvements through comprehensive evaluations.

缺点

see questions

问题

  1. Why Eq (6) would happen? In general training, C(xa,yˉa)C(x_a, \bar{y}_a) and C(xb,yˉb)C(x_b, \bar{y}_b) may exhibit some gaps but should be in a reasonable area as there is no Bellman process in the training process.

  2. For Eq (7), suppose there are two elements in the prompt: a (paper type) and b (themes). The original sample in the dataset is (b, b) = (brief on theme b) and (a, a) = (academic paper on theme a). The authors mentioned the margin sample of yab=(a,b)y_{ab}=(a, b) = (academic paper on theme b). There are two fundamental assumptions: (1) rθ(xb,yab)=rθ(xb,yb)r_{\theta'}(x_b, y_{ab}) = r_{\theta'}(x_b, y_b) for the same theme (b). (2) C(xb,yˉa)C(xa,yˉa)C(x_b, \bar{y}_a) \approx C(x_a, \bar{y}_a) due to formulate preference.

2.1 Further clarification is necessary for these two assumptions: (1) The first assumption is too strict focusing on text creation, where the reward function only depends on bb rather than aa. For example, a (code languages) b=(Neural networks), the yaby_{ab} is heavily dependent on (a). (2) I think the second assumption is wrong as the prompt bias is heavily associated with xx and we cannot make this assumption. The following algorithm in Eq (8) aligns with my intuition.

2.2. There are also two more comments on this: (1). joint distribution is approximated as the margin distribution for the reward function is generally unacceptable for me, which means the x1x_1 is not important. So, the optimization function could be: r(x2,y)+C(x2,yˉ)r(x_2, y) + C(x_2, \bar{y}) in Eq 5. If this is true, CC will not be associated with x1x_1, i.e., template in this work. (2) The xa,bx_{a, b}, as well as y(a,b)y_(a, b) is the out-of-distribution sample, and the accuracy for the reward estimation should exhibit high bias.

局限性

see the above

作者回复

First of all, we greatly appreciate your responsible review of our theoretical analysis on the issue of “prompt-template bias” and also thanks for acknowledging the performance of our method. We assume that our greatest disagreement lies in the theoretical analysis part, so we will try to address your concerns one by one in the following.

Q1

The training of reward modeling is not related to RL, nor is it related to the Bellman process. C(xa,ya)C(x_a, \overline{y}_a) and C(xb,yb)C(x_b, \overline{y}_b) exhibits gaps due to the absence of constraints addressing prompt-template bias in the original preference loss, typically the value of C(xa,ya)C(x_a, \overline{y}_a) won't affect the preference order within the set of prompt-response pairs with the same prompt xax_a and response format ya\overline{y}_a.

We assume that you argee with the point that C(xa,ya)C(x_a, \overline{y}_a) and C(xb,yb)C(x_b, \overline{y}_b) will exhibit some gaps, but believe it should be within a reasonable range.

However, it is very difficult to define what is a reasonable range in practice. From the experimental results shown in Fig.3(a), the RM trained with original preference loss does assign higher scores on responses formatted as Tech Article, leading to the misordering of prompt-response pairs exhibited in Table. 1. As long as this gap causes the RM to assign higher reward scores to certain marginal samples (possibly OOD samples), like rθ(xb,yab)>rθ(xb,yb)r_\theta(x_b,y_{ab}) > r_\theta(x_b, y_b) in Eq.(7), it will guide the LLM to generate responses in a specific format after RLHF fine-tuning.

Q2.1

Thank you for your insightful comments on Eq.(7). However, we believe there may be significant misunderstandings regarding the assumptions underlying Eq.(7).

To assumption 1, we only claim rθ(xb,yab)rθ(xb,yb)r_{\theta'}(x_b, y_{ab}) \approx r_{\theta'}(x_b, y_b) in our paper rather than strictly constraining rθ(xb,yab)=rθ(xb,yb)r_{\theta'}(x_b, y_{ab}) = r_{\theta'}(x_b, y_{b}). We absolutely agree with your comment that the reward function should depend on the whole response yaby_{ab} rather than only its template yb\overline{y}_b.

It is also the reason why we only assume rθ(xb,yab)rθ(xb,yb)r_{\theta'}(x_b, y_{ab}) \approx r_{\theta'}(x_b, y_b) because it can be possibly achieved on the reward distribuion rθ(x,y)r_{\theta'}(x,y) learned to approximate "gold standard" reward model rθ(x,y)r_{\theta*}(x,y)

To assumption 2, we didn't assume C(xb,ya)C(xa,ya)C(x_b, \overline{y}_{a}) \approx C(x_a, \overline{y}_a), but only demonstrate there is a chance that

C(xa,ya)>>C(xb,yb)C(x_a, \overline{y}_a) >> C(x_b, \overline{y}_b) leads to C(xb,yab)>C(xb,yb)C(x_b, y_ab) > C(x_b, y_b) .

Please notice the difference. C(xa,ya)>>C(xb,yb)C(x_a, \overline{y}_a) >> C(x_b, \overline{y}_b) is achievable because the original preference loss does not impose any constraints on prompt-template bias and this gap can cause trained RM to prefer either prompt xax_a over xbx_b or format ya\overline{y}_a over yb\overline{y}_b.

As long as there is a tendency that RM prefers format ya\overline{y}_a over yb\overline{y}_b, we can have

C(xb,yab)>C(xb,yb)C(x_b, y_{ab}) > C(x_b, y_b) to lead to rθ(xb,yab)>rθ(xb,yb)r_\theta(x_b, y_{ab}) > r_\theta(x_b, y_b) in Eq.(7)

Q2.2

Firstly, there are two templates in your comment: 1) template request in the prompt, denoted as x1x_1; 2) response templates, denoted as y\overline{y}. In our understanding, your question is why the generated response yy will be unrelated to the template request x1x_1 during RM training.

Actually, it is the shortcoming of the original preference loss that we want to highlight. Assume there is a set of prompt-response pairs with the sample prompt xax_a and responses in the same template ya\overline{y}_a. Notably, during training RM with the preference loss, these characteristics (or elements) shared across all prompt-responses pairs won't affect the preference order, e.g. prompt xax_a, response template ya\overline{y}_a and the correlation between xax_a and ya\overline{y}_a. The most effective way to address this issue is to construct responses in various formats for the prompt xax_a, helping the reward model to distinguish whether the response template meet the requirement of the prompt xax_a. However, constructing responses in various formats for each prompt will be time-consuming and expensive in practice.

For OOD sample, it seems that we have reached a consensus on this point that reward estimation should exhibit high bias on OOD samples, no matter it is xabx_{ab} or yaby_{ab}. So, we wonder if you agree with the fact that Eq.(7) can be probably achieved if (xb,yab)(x_b, y_{ab}) is an OOD sample?

Summary

At last, we thank you for your effort in improving the quality of our paper, even if it is a rejected score. This issue actually stems from the process of deploying RLHF technology in our text creation product, so there is no doubt that it will occur in practice (even if we used nearly 190k human-annotated preference pairs for RM training). In this paper, we aim to analyze and explain this phenomenon, and propose a solution that can be practically deployed in the product, rather than a flashy but impractical algorithm.

评论

Thanks for your response! I would appreciate the authors providing further clarifications for 2.1 and 2.2.

  1. For Q1, my point is that it it reasonable to assume C(xa,yˉa)>C(xb,yˉb)C(x_a, \bar{y}_a)>C(x_b, \bar{y}_b) rather than C\(x_a, \bar{y}_a)>>C(x_b, \bar{y}_b). I think this would not influence further analysis as >>>> is generally a stronger condition compared to >>.

2.1 Assumption 1: rθ(xb,yab)rθ(xb,yb)r_{\theta^{\prime}}(x_b, y_{a b}) \approx r_{\theta^{\prime}}(x_b, y_b) is not a claimed assumption for me and the reason is explained above.

Without Assumption 2: C(xb,yˉa)C(xa,yˉa)C(x_b, \bar{y}_a) \approx C(x_a, \bar{y}_a), why there is a chance for:

C(xa,yˉa)>C(xb,yˉb) leads to C(xb,yab)>C(xb,yb)C\left(x_a, \bar{y}_a\right) \gg>C\left(x_b, \bar{y}_b\right) \text { leads to } C\left(x_b, y_a b\right)>C\left(x_b, y_b\right)

2.2 I think the response should be highly related to prompt xax_a or xbx_b. OOD scenarios should be considered in Eq (7) but still, Eq (7) is not convincing for me for the above reasons.

评论

Dear reviewer

Sorry to bother you, as the discussion period is nearing its end, we hope that our response has adequately addressed your concerns regarding the theoretical analysis section of our paper.

We greatly appreciate your efforts on reviewing our paper and also responsible comments. Since your concerns about our paper mainly stem from your belief that the theory is wrong, we believe there is a misunderstanding, which we feel we have already clarified. Therefore, we kindly ask if you could reconsider your evaluation of our paper.

If not, we kindly ask you to list the issues in our theoretical proof in mathematical form, so that we can improve the quality of our paper in the next round.

Your comments on our paper will be extremely important to us. Thanks.

Best wishes

评论

Thank you for your responsible comments and your willingness to discuss with us.

First, please allow us to emphasize the consensus we have reached:

  1. There is indeed hacking of the current RM through specific response formats, which causes the LLM after RLHF finetuning to tend to generate responses in a specific format (we have indeed encountered this issue in product implementation).

  2. The theoretical analysis in this paper is only intended to confirm that prompt-template bias in some cases of preference pairs may also lead to reward hacking, but not all preference pairs will result in this.

  3. As long as the prompt-template bias in certain preference pairs leads to reward hacking, the LLM aligned with this reward model may overlook the template requirements in prompts and tend to generate responses in a few specific formats.

Q1

Thanks for your patience in carefully reading our response. We are pleased to see that we have reached a consensus that there is a possibility of achieving C(xa,ya)>C(xb,yb)C(x_a, \overline{y}_a) > C(x_b, \overline{y}_b) in certain prompt-response pairs.

The reason why we strictly assume C(xa,ya)>>C(xb,yb)C(x_a, \overline{y}_a) >> C(x_b, \overline{y}_b) is because we wish the prompt-template bias term to play a dominant role in the comparison of prompt-response pairs from different sets, leading to

rθ(xa,ya)>>rθ(xb,yb)r_\theta(x_a, y_a) >> r_\theta(x_b, y_b) where

rθ(xa,ya)=rθ(xa,ya)+C(xa,ya)r_\theta(x_a, y_a) = r_{\theta'}(x_a, y_a) + C(x_a, \overline{y}_a),

rθ(xb,yb)=rθ(xb,yb)+C(xb,yb)r_\theta(x_b, y_b) = r_{\theta'}(x_b, y_b) + C(x_b, \overline{y}_b)

We agree with your statement that C(xa,ya)>>C(xb,yb)C(x_a, \overline{y}_a) >> C(x_b, \overline{y}_b) is a stronger condition, but you cannot deny that there are prompt-response pairs that satisfy this condition, which could lead to reward hacking.

Assumption 1

Sorry for not explaining it clearly in the first response. This time, we will provide an explanation based on the examples you provided.

Given a response yaby_{ab} where aa indicates the template of code language and bb denots the theme on neural networks, the trained reward model will assign a reward score to the concatenation of the prompt xbx_b and the response yaby_{ab} as rθ(xb,yab)=rθ(xb,yab)+C(xb,ya)r_\theta(x_b, y_{ab}) = r_{\theta'}(x_b, y_{ab}) + C(x_b, \overline{y}_a).

We fully agree with your comment that the reward function rθ(xb,yab)r_\theta(x_b, y_{ab}) should depend on the entire response yaby_{ab}, not just on elements aa or bb individually. However, please note that we have split rθ(xb,yab)r_\theta(x_b, y_{ab}) into two components:

  1. C(xb,ya)C(x_b, \overline{y}_a), which models the RM's scoring of a prompt xbx_b when the response follows format aa,

  2. rθ(xb,yab)r_{\theta’}(x_b, y_{ab}), which is intended to model the remaining part of the reward score, such as whether the response’s theme satisfies with the prompt's request.

Thus, we can assume rθ(xb,yab)rθ(xb,yb)r_{\theta'}(x_b, y_{ab}) \approx r_{\theta'}(x_b, y_b) because both yaby_{ab} and yby_b follow the same theme bb after discarding the impact of response template. For a response yaby_{ab} that heavily depends on aa, its reward score will be dominated by the term C(xb,ya)C(x_b, \overline{y}_a).

Moreover, as stated in our reply to Q.1, we assume the prompt-template bias term to play a dominant role in the comparsion, and thus the gap between rθ(xb,yab)r_{\theta}'(x_b, y_{ab}) and rθ(xb,yb)r_{\theta}'(x_b, y_b) will be relative small when compared to the gap between C(xb,ya)C(x_b, \overline{y}_a) and C(xb,yb)C(x_b, \overline{y}_b).

Assumption 2

Actually, there is no need to assume C(xb,ya)C(xa,ya)C(x_b, \overline{y}_a) \approx C(x_a, \overline{y}_a).

Given C(xa,ya)>>C(xb,yb)C(x_a, \overline{y}_a) >> C(x_b, \overline{y}_b), we may have C(xa,ya)>>/>C(xb,ya)>C(xb,yb)C(x_a, \overline{y}_a) >>/> C(x_b, \overline{y}_a) > C(x_b, \overline{y}_b).

This can be achieved when there is a tendency that RM perfers format ya\overline{y}_a over yb\overline{y}_b,

we can have C(xb,ya)>C(xb,yb)C(x_b, \overline{y}_a) > C(x_b, \overline{y}_b) to lead to

rθ(xb,yab)>rθ(xb,yb)r_\theta(x_b, y_{ab}) > r_\theta(x_b, y_b) in Eq.(7)

Q2.2

The reason for the prompt-template bias lies in what you mentioned: the responses collected for each prompt are too closely aligned with the prompt's requirements. As a result, the RM trained these preference pairs has not encountered responses with different templates and cannot distinguish whether a response's template meets the requirement in the prompt.

Yeah, we are trying to explain why an OOD sample, such as (xb,yab)(x_b, y_{ab}), might receive an overly high bias estimate, e.g. Eq.(7).

Additional Comments

We still wish our responsible reviewer to prioritize RLHF techniques that are truly practical and implementable, rather than those that may appear fancy but remain confined to papers. Our work is driven by real-world problems, identified the underlying causes through theoretical analysis, and effectively addressed the issue in practice. Moreover, this technique has been proven effective in our industry and hope you can provide a fair judgment.

审稿意见
7

The paper addresses the issue of reward hacking in RLHF training, superficially, identifying prompt-template bias, defined as when a reward model (RM) develops a preference for responses that adhere to specific formats or templates, even when these formats are not explicitly specified or desired in the prompt and proposes Prompt Bias Calibration (PBC) method that successfully tackles this issue. PBC can also be combined with existing length debiasing methods like ODIN to mitigate both hacks in the reward signal.

优点

  • The paper identifies and analyzes "prompt-template bias" in RLHF, a potentially impactful issue.
  • PBC is easy to implement and as shown can be combined with existing approaches.
  • Strong empirical validation with good coverage in the experiments and ablation.

缺点

  • Choosing one specific bias - while the title claims that removing the length bias is not enough, it seems to change the scene to potentially removing length and prompt-template bias not being enough. Leading to concerns about needing to combine many methods, one for each mitigation.

问题

Any insights on whether the approach would work for the larger models beyond 7B? Could the approach be generalized to address other instances of reward hacking?

局限性

The limitations are covered.

作者回复

Thank you for recognizing our work. We believe that you are a reviewer with genuine experience in implementing RLHF and are well aware of the current shortcomings of RLHF.

W1

Thanks for your suggestion. The orginal title aims to emphasize that existing RLHF research mainly focuses only on length bias, while overlooking other potential biases, e.g. prompt-template bias in our paper. We also argee with your point and will try to come up with a new title for our article.

Q1

Thanks for your insightful questions.

Actually, we have evaluated our method on an LLM with 13B parameters, which we pretrained ourselves and intend to deploy in text creation products, and it consistently outperforms the original RLHF pipeline. Notably, the evaluation metrics on the industrial side are more stringent, but our developed method still performs well.

For generalization, the developed PBC is not limited to removing template bias in responses, but can address any characteristics shared among responses to a specific prompt, e.g. the language of the response (english or chinese). We constructed the concept of prompt-template bias in the text creation scenario purely for the ease of understanding. In real-world applications, the situation is more complex, but this does not undermine the effectiveness of our method, as it remains challenging to achieve diversity in responses to each prompt.

Overall, we have great respect for you because you are a researcher who genuinely focuses on the practical implementation of RLHF, regardless of whether this paper is accepted or not.

评论

Dear reviewer

Sorry to bother you, as the discussion period is nearing its end, we hope that our response has adequately addressed your concerns regarding our paper.

If not, we kindly ask you to list your remaining concerns, so that we can improve the quality of our paper in the next round.

Your comments on our paper will be extremely important to us. Thanks.

Best wishes

评论

Thank you for providing further explanations on the generalizability of the method. I have read all responses and maintain my rating.

作者回复

We sincerely appreciate the reviewers' efforts and valuable feedback in helping to enhance the quality of our paper. Here, we would like to highlight the motivation and key contributions of our work in the following:

Motivation

The motivation for our work stems from the process of deploying RLHF to enhance the performance of LLM-based content creation products. We find that these LLMs after RLHF fine-tuning prefer to generate responses in a specifc format, even regardless of the format requested in the prompt. This observation motivates us to uncover the source of this phenomenon. Through therotical analysis, we attribute it to the issue of prompt-template bias and propose a novel method that imposes almost no additional computational burden to address this issue in RLHF.

Contribution

  1. We reveal that reward models with prompt-template bias tend to assign higher reward scores to responses in specific formats, such as the responses in technical articles shown in Table 1.
  2. Through theoretical analysis, we reveal that the reward model learns prompt-template bias because the dataset typically only includes responses that adhere to the format specified by the prompt.
  3. Without introducing too much additional computational burden to RLHF pipeline, we develop PCB method to estimate prompt-template bias during RM training, so that we can remove the bias term in the following PPO process
  4. We demonstrate that the developed PCB method can be integrated with existing algorithms to simultaneously eliminate prompt-template and length biases during RM training, further improving the quality of responses generated by LLMs.

Additional Comments

  1. We admit the most effective solution to address the issue of prompt-template bias is to contruct responses in various formats for each prompt. However, manually constructing responses in specific formats and annotating their preference order is extremely time-consuming and costly in practice. Most publicly available datasets for RM training even provide only a single preference pair per prompt.

  2. We acknowledge that the developed method is somewhat straightforward, but it has proven to be an effective solution for alleviating prompt-template bias in our product without introducing additional annotated preference pairs.

We wish reviewers and ACs to prioritize RLHF techniques that are truly practical and implementable, rather than those that may appear fancy but remain confined to papers. Our work is driven by real-world problems, identified the underlying causes through theoretical analysis, and effectively addressed the issue in practice. We also hope that the issues we’ve uncovered and the methods we’ve proposed will assist the community in deploying RLHF more effectively.

评论

Dear Reviewers,

Please read authors' responses carefully and provide your answers.

Thanks, AC

最终决定

This paper proposes a new method to mitigate the reward hacking for RLHF by introducing a new bias, prompt-template bias, during the reward modeling. By properly estimating this bias during the reward modeling and removing it from reward scores during the RLHF fine-tuning, the proposed method can improve performances of RLHF, especially along with a previous length bias calibration. Experimental results with LLaMA-2-7B show that the proposed PBC (prompt bias calibration) reduces the reward variation according to formats of output responses on a manual template dataset and improve performances on benchmark understanding tasks such as MMLU, DROP, and BBH.

First, the authors address many concerns and issues raised by the reviewers, including the explanation of Eq. (7) and the reason why such a prompt-template bias can occur during the reward modeling from general preference datasets.

However, the overall descriptions on the main problem, prompt-template bias problem, and the main algorithm to mitigate it in Section 3 need to be clarified and improved, as the reviewers said. For example, the statements in Section 3 are lengthy and difficult to clearly understand while the descriptions for Figure 1,2 are very lacking. There is also a typo in Eq. (4) (-r) and the sum over y in Eq. (5) is not clear. In addition, the prompt-bias term C(x,y_a) depends on the format (a) of the output response, but its estimation is u_c(x) and obtained only from the input prompt (x). The derivation of this main algorithm is also unclear.

Moreover, empirical validation is limited and insufficient. Especially, the proposed method targets for generation tasks, but main benchmark results are obtained only for understanding tasks, even though the authors show MT-bench scores and some chat scores in the rebuttal. Also, it seems that the performances are highly dependent on the hyperparameters (eta_c, eta_l), and the use of main bias calibration PBC alone significantly decreases generation performances in comparison to the baseline that does not use the bias calibration.

Therefore, I think the paper is insufficient to be accepted by NeurIPS.