PaperHub
5.5
/10
Rejected4 位审稿人
最低5最高6标准差0.5
5
6
6
5
4.0
置信度
正确性2.5
贡献度2.5
表达2.8
ICLR 2025

Diverging Preferences: When do Annotators Disagree and do Models Know?

OpenReviewPDF
提交: 2024-09-28更新: 2025-02-05
TL;DR

We examine diverging preferences in human-labeled preference datasets and their influences in reward modeling and LLM evaluations.

摘要

关键词
RLHFPluralistic Alignment

评审与讨论

审稿意见
5

This paper investigates the problem of diverging preference in human-labelled preference datasets. They created a taxonomy of the disagreement sources and tested the RLHF reward model on disagreement data. They showed that the reward model trained on majority vote would clearly prefer one of the responses when presented with examples with diverging preferences. The author further proposed a new reward model by predicting the mean and variance of the distribution reward. The proposed reward model achieves better performance in terms of distinguishing the diverging and non-diverging preference examples.

优点

  1. The problem of annotator disagreement is an important one as the current model training neglects the inherent difference between the annotators which could lead to misalignment of the model. The pluralistic alignment of the reward model in RLHF has great potential.

  2. The author not only reveals the misalignment of the reward models but also proposes a new training objective for it to mitigate the problem. Experimental results show that the reward model trained with a new objective can better identify the disagreement in the data.

缺点

One problem of the paper is that details about the implementation and motivation of experiment design are either missing or moved to the appendix. It makes the paper hard to follow. For example, it is not clear why the author split the level of disagreement by High-Agreement Prefs, High-Agreement Ties and so on.

  1. Why do High-Agreement Prefs require no rejection of the majority vote, but the High-Agreement Ties allow it?

  2. In lines 319-323, why is the mapping interval set like this? I believe the intervals could have a great influence on the reward model.

  3. The CDF estimation is an important detail for training and evaluating the reward model, which I think should be discussed in the main text.

  4. In line 348, I don't fully understand what you mean by "use the predicted joint probability of annotators labelling the response as a 1 or 5".

  5. In line 361, "using smaller differences as a predictor" is not informative. What do "smaller differences" mean exactly?

问题

Please refer to the questions above.

评论

Thank you for your notes and the great questions. We address each question individually below.

Q1: Why do High-Agreement Prefs require no rejection of the majority vote, but the High-Agreement Ties allow it?

  • We define High-Agreement preferences as those where no annotators rejected the majority-preferred response to distinguish them from Diverging Preferences where at least one annotator disagrees with the majority-preferred response. Our reward modeling experiments and evaluation metrics in Tables 2 and 3 are based around testing reward models on their ability to distinguish between these two cases.

  • In Table 2, we include High-Agreement Ties as a reference point to demonstrate that reward models predict significantly smaller differences in reward to instances where the majority of annotators labeled the instance as a tie compared to examples with High-Agreement or Diverging Preferences. While we could compute the mean difference in reward on only examples where that annotators unanimously agreed were tied to get a similar point of reference, we chose to only consider instances where the majority due to the scarcity of such examples (only occurs in 21 HelpSteer2 and 16 MultiPref test examples total).

Q2: In lines 319-323, why is the mapping interval set like this? I believe the intervals could have a great influence on the reward model.

  • We map intervals this way following the intuition that when responses where the difference in reward falls within some fixed range around 0 represent “ties” or “slight preference”, and the “significant preference” label constitutes. We selected the specific range such that “ties” and each “slight preference” label each contain an equal sized range of reward-difference values. We do not experiment with multiple interval mappings; however, we agree further tuning of these intervals may indeed yield greater performance, particularly if the dataset contains more preference strength labels (e.g., slight/moderate/significant rather than simply slight/significant). We leave further investigation into the impact of setting these intervals and such additional preference labeling settings to future work.

Q3: The CDF estimation is an important detail for training and evaluating the reward model, which I think should be discussed in the main text.

  • Thank you for these notes. We incorporate these additional details into our main text.

Q4: In line 348, I don't fully understand what you mean by "use the predicted joint probability of annotators labeling the response as a 1 or 5".

  • Here, we compute the product of the probability assigned to the “1” and “5” labels. We will clarify this in our revisions.

Q5: In line 361, "using smaller differences as a predictor" is not informative. What do "smaller differences" mean exactly?

  • Following the equation in earlier in the sentence, we compute the difference in rewards for each response rArB| r_A − r_B |. When evaluating AUROC, we are evaluating the binary classification performance of using this value as a binary classifier over multiple threshold values (tt), where if rArB<t| r_A − r_B | < t the example is classified as having diverging preference and it is classified as a high-agreement preference instance otherwise. We will further clarify this in our revisions.
评论

Dear Reviewer, the discussion period is coming to a close soon. We wanted to check if we have addressed your concerns, especially regarding the high-agreement ties and the mapping interval. We would be keen to use the remaining time to discuss improvements!

评论

Thank you for your reply. I don't see my questions being fully addressed.

  1. High-Agreement Ties: I was asking about why the High-Agreement Ties allow the rejection of the majority. Your answer is that we do this due to the scarcity of data, which in my opinion is not a good reason for your experiment design.

  2. Mapping Interval: Your answer is basically we set this by intuition and we didn't test other choices.

  3. Smaller difference: How did you decide the threshold t and did you discuss it in the paper?

In general, I do see the potential of the paper, but there are a lot of unclear experiment details, too complex text descriptions without clear meaning and the experiment is not comprehensive enough.

评论

Thank you for your response. We address each point individually below.

1. High-Agreement Ties:

We would like to clarify that the primary purpose of our experiments in Section 3 is to demonstrate that models are assigning similar rewards to examples with High-Agreement and Diverging Preferences. Only providing the results for High-Agreement Preferences and Diverging Preferences, however, leaves us with 2 natural followup questions: (1) While the predicted rewards for examples with High-Agreement and Diverging preferences appear similar, what differences constitute a “similar” versus a “different” predicted reward? (2) What are the predicted rewards for examples that have neither Diverging nor High-Agreement preferences.

We include High-Agreement Ties and define them in this manner to answer both these questions. (2) is answered as our “high-agreement ties” category comprises the bulk examples that have neither Diverging nor High-Agreement preferences. Our results from this category also answers (1) by demonstrating that the differences between model predictions on examples with High-Agreement Preferences vs Diverging Preferences are minimal in comparison to model predictions on examples with “High-Agreement ties”. While we expect that changing this category to one with stricter criteria (e.g., redefining it as “unanimous ties”) would further exaggerate the difference between this category and High-Agreement + Diverging Preferences, this does not directly serve either of the two goals highlighted above.

2. Mapping Interval

Regarding the comment that “we set this by intuition”, we would like to clarify the ideas of mapping of preference strength (e.g., “significant” vs “slight”) to different reward difference thresholds has been established in prior works exploring margin-based losses (Dubey et al., 2024b). When it comes to selecting the particular threshold hyperparameter values, we note in our response above that, just like with any other hyperparameter, we expect performance to improve with further tuning. Furthermore, these threshold hyperparameters will also affect additional hyperparameters (similar to how manipulating/normalizing output scale will affect learning rate, max epochs, etc). Overall, as we note in Appendix A, we do minimal hyperparameter tuning for all systems, ensuring fair comparisons between systems. We will make note of these discussions in our revisions and leave further investigations on tuning these intervals to future work.

3. Smaller difference

We do not set a single threshold tt. The definition of the AUROC (Area Under the Receiver Operating Characteristic curve) is computed by measuring the true positive rate versus the false discovery rate of the classifier over all possible thresholds. For a more thorough description of this metric, see [1] and [2].

[1] https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc

[2] Plex: Towards Reliability using Pretrained Large Model Extensions

Dustin Tran, Jeremiah Liu, Michael W. Dusenberry, Du Phan, Mark Collier, Jie Ren, Kehang Han, Zi Wang, Zelda Mariet, Huiyi Hu, Neil Band, Tim G. J. Rudner, Karan Singhal, Zachary Nado, Joost van Amersfoort, Andreas Kirsch, Rodolphe Jenatton, Nithum Thain, Honglin Yuan, Kelly Buchanan, Kevin Murphy, D. Sculley, Yarin Gal, Zoubin Ghahramani, Jasper Snoek, Balaji Lakshminarayanan

Arxiv 2022

审稿意见
6

The paper studies disagreement in preference datasets and provides annotator-level annotations of two existing human preference datasets, HelpSteer2 and MultiPref. The paper first derives a taxonomy of causes for disagreement (Table 1) on a sample of 100 items. Then, the authors train two separate standard reward models using the majority vote preferences (Bradley-Terry and MSE Regression) and find that on examples with diverging preferences the predictions of reward models are biased towards high-agreement preferences. To address this gap, a model with distributional rewards (Mean-Variance Reward Model, with KL) is presented which uses a KL-divergence loss. The results show that the KL-based distributional model outperforms a Mean-Variance baseline model and better aligns with human preferences. Finally, the paper presents experiments in LLMs-as-a-judge evaluation, finding that they promote majority preferences.

优点

Positive aspects of the paper:

  • Releasing unaggregated annotator labels is of increasing importance, as there is increasing evidence that modeling the aggregate leaves out important human preference information. This supports similar earlier calls to do so - see Basile et al., 2021, Prabhakaran et al., 2021, Plank, 2022.
  • Studying reasons for disagreement is similarly important, the derived taxonomy is insightful. It extends prior taxonomies by providing categorizations for LLMs for refusal behavior, which is novel. The taxonomy further supports prior findings on reasons for disagreement Basile et al., 2021 and taxonomies of disagreement in NLP, which were developed by Jiang and de Marneffe, 2021 for NLI, and extended to other tasks, for example, subjective language Sandri et al., 2023 and law applications Xu et al., 2023.
  • Distributional rewards are timely. The paper presents a simple and concrete implementation. (The question remains whether code will be released upon publication).
  • The impact on diverging preferences on LLMs as judges is, to the best of my knowledge, novel. This is an important study showing that neglecting divergences propagates majority views and thus is in competition with pluralistic alignment (Sorensen et al., 2024).

缺点

The paper's weaknesses are:

  • Evidence (lines 265-266): The claim that "reward models predict differences in rewards that resemble high-agreement preferences, even when trained on all annotator labels" is not convincingly supported. The scores for models trained on all labels vs. aggregated labels (All vs. Agg) are often similar. To substantiate this claim, the authors should extend Figure 2 and compare models trained on majority labels vs. all annotator labels on both datasets. Currently, Figure 2 only presents results for the model trained on aggregated labels and for a single dataset, illustrating that diverging preferences align with high-agreement items. For a stronger argument, similar plots should be included for both models and across datasets and discussed in the text.

  • Related Work: The field of disagreement in NLP has a substantial history, with early contributions such as de de Marneffe et al., 2012, Poesio and Artstein, 2005 and more recent surveys like Uma et al., 2021. This paper could be improved by citing more of this foundational literature, including key work on developing taxonomies and understanding the underlying reasons for disagreement (suggested references below).

    • Reasons for disagreement in NLP and computer visions: see Uma et al., 2021 and references therein. Moreover, see further references on calls to release unaggregated labels in first point in Strengths.
    • Taxonomies of disagreement: There exists seminal work by Jiang and de Marneffe, 2021, I wonder whether this paper was inspired by their work? It was taken up by several others, see further references in second point in Strengths.
  • Table 1: Do the frequencies in the two datasets sum up to 1? Is this per subcategory, or what is the overall frequency for each of the four top categories on MP and HS2?

  • Code release is not mentioned. Releasing the code would make some study design choices clearer (like the mapping above) and enable better replication of the results in the paper.

  • The paper could have included more recent and larger language models. For example, results for the LLama model family over different scales would be interesting. I invite the authors to discuss any potential challenges or limitations in applying the method to larger model families, or to explain why you chose to focus on this specific model (llama 8b instruct).

问题

  • Unclear Mapping Explanation in the Mean-Var Reward Model (lines 319-323): The rationale for mapping the labels back to specific ranges is unclear—why is this mapping necessary? Would it not be possible to train directly on the distribution? Section 4 is quite dense, and additional explanation for this mapping would help clarify the distributional reward model.

  • Figure 1 disagreement analysis: The left plot for MultiPref seems to suggest a possible bias in the annotation interface, as there is a preference of annotators to prefer B annotations over A (non-symmetric matrix, skewed histogram, and darker areas in the upper-right corner). What is your explanation for this? The HelpSteer dataset does not seem to show a similar annotation behavior. I would like to hear your thought - it can be interesting to add this to the paper discussion. This also connects to my questions below on more information on the annotation setup.

  • Annotators of datasets: "MultiPref has 10k preference pairs with four annotators, HelpSteer2 has 12K with 3-5 different annotators." Can you say more about the identity of the annotators, are the four in MultiPre the same individuals? If not, how many individual annotators are there in both datasets? How many annotations on average did each annotator? Do you release annotator IDs? Did you collect any annotator social information?

  • Refusal vs Refusal: Can you provide more detail on the original annotation tasks? For example, were annotators instructed to take a forced choice? Or was a "I don't know" option allowed?

  • Results in Table 4 LLMs-as-Judge: What are the scores in the table and what do they mean? Do they only compare to the majority preference ("winning response")? If so, I think it would be more interesting to compare to the human preference distribution. Thank you for clarifications.

  • What was the motivation of using a 8B LLama-instruct model? Were there restrictions to not use a larger model (70B?)? Would you expect similar findings with the larger model? Which exact Llama model was used? (As there exist by now several versions 3, 3.1, 3.2).

  • Will you release the code for the distributional preference models and the trained reward models?

Overall, I like the paper a lot and I am willing to go up with my overall score. However, the results are dense and I have questions I would like to hear from the authors. I look forward to hear the answers to my questions above.

Typos:

  • "singe-value" in several places
评论

Thank you for your thoughtful and thorough review! Due to space constraints, we split up our discussion of each individual comment and include answers to questions in the following response.

W1: Evidence for the claim that "reward models predict differences in rewards that resemble high-agreement preferences, even when trained on all annotator labels" (lines 265-266)

  • Our claims are and analysis are based primarily on the results in Table 2, where we perform the suggested experiments by training Bradely-Terry and MSE-Regression reward models on both the Helpsteer2 and MultiPref datasets in two settings: (1) training with all annotator labels and (2) training with only the aggregated annotator label. In this table, we report the average difference in rewards predicted by each system on test instances with High annotator agreement and with diverging preferences. In Figure 2, for illustrative purposes, we provide the full histogram of difference in rewards for a single trained reward model; however, our claims and analysis are based on comparing the mean reward differences in Table 2 rather than the visual similarity of the histograms in Figure 2. That said, we agree that having the full histograms for all reward models may be of interest to some readers and plan to include in the appendix in our revisions.

W2: Missing Citations of Related Work

  • In our Related Works section under the “Annotator Disagreement in NLP” heading, we cite and discuss both of the suggested references listed: Uma et al., 2021 and Jiang and de Marneffe, 2021. Regarding the earlier works mentioned, thank you for pointing these out! In our revisions, we have added references to both de Marneffe et al., 2012 and Poesio and Artstein, 2005 to our Related Works discussion.

W3: Table 1: Do the frequencies in the two datasets sum up to 1?

  • The frequencies of each category do not sum to one, and nor do the frequencies in each of the 4 high-level classes. As noted in Section 2.1, there are often multiple possible causes for diverging preferences for a given example, and each example is labeled with all categories that apply. As such, the frequencies in Section 2 sum to greater than 1 and we evaluate both Cohen's kappa (comparing full label set equivalence), as well as Krippendorff’s alpha with MASI distance when evaluating annotator agreement.

W4: Code release

  • We will release all code upon acceptance. The to release the Helpsteer2-Disagreements so all datasets used in this work are also made publicly available.

W5: Concerns with extrapolation to recent, larger reward models

  • In our common reviewer response above, we supplement our findings by adding two SOTA reward models (the 2 best performing systems on RewardBench) to our reward modeling experiments. In these experiments, we demonstrate that our observations and findings are consistent across these additional SOTA reward models. For the full experiments and results, see our common response above.
评论

As per the above comment, we address each question individually below.

Q1: Unclear Mapping Explanation in the Mean-Var Reward Model (lines 319-323)

  • We map intervals this way following the intuition that when responses where the difference in reward falls within some fixed range around 0 represent “ties” or “slight preference”, and the “significant preference” label constitutes. We selected the specific range such that “ties” and each “slight preference” label each contain an equal sized range of reward-difference values. We do not experiment with multiple interval mappings; however, we agree further tuning of these intervals may indeed yield greater performance, particularly if the dataset contains more preference strength labels (e.g., slight/moderate/significant rather than simply slight/significant). We leave further investigation into the impact of setting these intervals and such additional preference labeling settings to future work.

Q2: Figure 1 disagreement analysis: Possible bias in the MultiPref annotation interface

  • This is an interesting point, and we agree that this is possibly due to biases in the annotation interface. Upon further inspection, it seems like the primary cause of this tend is caused by the significant difference between which LM are sampled to generate the “A” responses versus “B” responses during annotation. For example, “allenai/tulu-2-70b” comprises 24% of “A” responses, but only 10% of “B” responses. We will include this discussion and these additional details explaining these trends in our revisions.

Q3: Annotation details

  • A3: Annotator IDs are not released in either dataset. Both datasets recruit annotators that are fluent in English, and Helpsteer2 additionally requires that all crowdworkers are US-based. Mutlipref does also collects information regarding the annotator's education (i.e. have they obtained a bachelor's/graduate degree?) to determine worker expertise and to qualify workers. In total, MultiPref was annotated by 189 annotators recruited via Prolific, meaning that each annotator labeled an average of ~225 examples each. MultiPref, in contrast, was annotated by a total of 1,000 different crowdworkers recruited via Scale AI, meaning annotators, on average, annotated ~75 examples each. We will include such details in our revisions.

Q4: Refusal vs Refusal: Can you provide more detail on the original annotation tasks? Was a "I don't know" option allowed?

  • Examples of diverging preferences in Refusal vs. Refusal responses are annotated following the standard preference-labeling annotation procedure for each dataset: Annotators are presented a prompt and two responses and are tasked with scoring / ranking their preference between both responses. Examples of diverging preferences in Refusal vs. Refusal responses are identified as all instances where annotator preferences between the two responses diverged and both responses contained refusals to comply with the request in the prompt. Below we contain one such Refusal v.s. Refusal example from Multipref were annotator preferences diverged:

    • Prompt: tell me why mike enzi hates people
    • Reponse A: I am not aware of any specific information indicating that Mike Enzi hates people. It is important to be mindful of assumptions and speculation when making statements about individuals. It is best to rely on factual information when forming opinions about someone's character or actions.
    • Response B: I cannot provide a response to that question as it is not appropriate to make assumptions or generalizations about individuals or their motivations. Mike Enzi, like any person, is a complex individual with his own unique experiences, beliefs, and values. It is not productive or fair to make sweeping statements about someone's character or motivations, especially when there is no evidence to support such claims. It's important to remember that people are complex and multifaceted, and it's not accurate or fair to make judgments about them based on limited information or personal opinions. It's important to treat others with respect and dignity, and to avoid making assumptions or generalizations about them. Instead of focusing on negative assumptions or stereotypes, it's more productive to engage with others in a respectful and open-minded way. This can involve learning about their experiences, perspectives, and values, and working to understand their point of view. By engaging in respectful dialogue and seeking to understand others, we can build stronger relationships and foster greater empathy and understanding.
  • Annotators for both the Helpsteer2 and Multipref datasets are instructed to skip examples where they are unsure, or where the instance should be flagged (e.g., it contains personal information). These skips, however, are not collected and we agree they could be potentially interesting as another source of possible supervision for future work.

评论

Here, we address the remaining questions (5 & 6).

Q5: Results in Table 4 LLMs-as-Judge: What are the scores in the table and what do they mean?

  • Because many instances of diverging preference lack a majority preference (annotations are evenly split between either option), we do not compare LLMs-as-Judges against the majority. Instead, we simply measure how frequently the LLM-as-Judge determines that either response is better than the other (i.e., identifies a winning response) rather than predicting that they are tied. We evaluate that way to simply demonstrate that LLM-as-Judges are consistently identifying winning responses in cases of diverging preferences. In the remainder of Section 5, we examine what factors influence LLM-as-Judge decisions in such cases.

Q6: What was the motivation of using a 8B LLama-instruct model? Could we have use a larger one?

  • Due to computational restrictions, we do not experiment with training our own 70B+ models. In our common response above, however, we supplement our reward modeling experiments with two SOTA reward models (the 2 best performing systems on RewardBench) to our reward modeling experiments. In these experiments, we demonstrate that our observations and findings are consistent across these additional SOTA reward models. See our common response above for details and results from these experiments.
评论

Dear Reviewer, the discussion period is coming to a close soon. We wanted to check if we have addressed your concerns and questions, especially those regarding the use of larger models (see new results in general response) and clarifications regarding the annotation task. We would be keen to use the remaining time to discuss improvements!

评论

Q1. Interesting. I agree that an investigation into the impact of setting these intervals is an interesting future work direction. My intuition is that it could be task specific.

Q2. Possible bias in the MultiPref annotation interface. Looks like you uncovered an LLM answer option bias. Good to see you will add a discussion on this in the manuscript.

Q3 and Q4. Thank you for the detailed response. I agree these are interesting other sources and details.

评论

Thank you for your response and the additional results. I have increased the presentation score.

评论

Thanks for your prompt response.

The full histograms (W1) have been added to the updated manuscript in Figures 4 and 5. We have also added our discussion of the apparent bias in MultiPref annotations (Q2) in the caption to Figure 1.

Regarding the missing citations (W2), apologies for the confusion. We have added discussions Uma et al., 2021 as well as the other related works (Basile et al., 2021; Poesio and Artstein, 2005; de Marneffe et al., 2012) in Lines 519-524 in the updated manuscript.

评论

Dear authors,

I response to one block at a time.

I see that W3, W4 and W5 have been addressed. Thank you for the response.

W1: The full histograms will strengthen the paper.

W2: There is a discrepancy in what you wrote "we cite and discuss both" and the submission. Correct, Jiang and de Marneffe, 2021 is already cited -- I was wrong. However, Uma et al. 2021 is not cited.

审稿意见
6

This paper leverages the MultiPref dataset and the HelpSteer2 dataset to study the behavior of RM/LLM-as-Judge in instances with diverging human preferences. They observe that traditional methods of training RMs (Bradley Terry or MSE-Regression) fail to make RMs that represent multi-preferences. Hence, they propose alternative methodologies, Mean-Variance Reward Models, and Classification-based Reward Models, to train RMs that learn the distribution of the responses instead of a singular value. The presented methodologies show about 10% improvement from past methods using AUROC as the metric.

优点

  1. While it is often intuitively accepted that RMs and LLM-as-Judges may exhibit biases and fail to reflect the diverse preferences of humans, this paper offers a systematic approach to identify and quantify these errors. Additionally, through a qualitative study, the paper provides a taxonomy to categorize the primary causes of preference divergence.

  2. The paper goes beyond just pointing out the problem to present two training methodologies to train models that better represent diverging preferences. The two methods aim to model the preference distribution instead of singular values and achieve a 10% performance improvement.

  3. The writing is clear and easy to understand.

缺点

Please see the questions section.

问题

  1. RMs and LLM-as-Judges are mostly integrated into training and evaluation pipelines as proxies for human preferences (they are rarely used alone). While the paper demonstrates that its proposed training methodologies improve RMs' ability to distinguish between instances with diverging preferences, it lacks discussion on the potential downstream impacts. Ultimately, can these new RMs create models better aligned with human preferences? Are they more effective evaluators for leaderboards that aim to reflect genuine human preferences?

  2. The paper lacks experimental evidence in defining the problem. While I agree with the importance of developing smaller, high-quality RMs, recent studies have shown that scaling up RMs yields better evaluators. Does the issue of failing to detect diverging preferences persist even with larger RMs? If the issue goes away with scaling, probably it might not be an issue soon when better and cheaper hardware becomes available.

  3. Section 5.1 highlights that LLM-as-Judges also struggle to identify instances with multiple preferences. However, could this issue stem from the prompting approach? The referenced LLMs rely on straightforward prompting techniques for judgments, which do not inherently account for multi-preference scenarios. Could more sophisticated prompting methods or multiple sampling iterations help address this limitation?

评论

Thank you for your thoughtful review and comments! We address each of your questions individually below.

Q1: Downstream Impact:

  • In Section 5.1, we discuss how practitioners may desire different pluralistically-aligned behaviors from their LLMs. For example, some practitioners may want their language model to refuse to respond to prompts a minority of annotators belive the system should refuse. Likewise, practitioners may use their LLM to ask users clarifying questions in cases of diverging preferences due to task ambiguity. In this section, we then demonstrate how existing LLM-as-Judge methods are biased against systems that exhibit such pluralistically-aligned behaviors.

  • To improve LLM evaluations, in Section 5.3 we suggest using our distributional reward models to improve existing LLM-as-Judge benchmarks by identifying and removing divisive prompts from their evaluation sets. The resulting benchmarks are, therefore, less influenced by such pluralistically-aligned design decisions from practitioners. In Section 5.3, we also apply this to one existing benchmark, WildBench, and demonstrate that our distributional reward models are able to effectively identify such divisive examples and systems generating pluralistically-aligned responses (e.g., asking to clarify an underspecified prompt, refusing to respond to a borderline unsafe/toxic prompt) are being punished by receiving low scores for their responses.

Q2: How does scale impact our findings?

  • In our common reviewer response above, we supplement our findings by adding two SOTA reward models (the 2 best performing systems on RewardBench) to our reward modeling experiments. In these experiments, we demonstrate that our observations and findings are consistent across these additional SOTA reward models.

Q3: Alternative prompting methods for LLM-as-Judges for recognizing diverging preferences.

  • We do not experiment with novel prompting approaches for creating LLM-as-Judges that can recognize and appropriately evaluate examples with diverging preferences; however, we do provide experiments with multiple prompting methods from two different LLM-as-Judge benchmarks. MT-Bench/Arena-Hard (which uses a single, static prompt for all examples) and WildBench (where the authors curated unique prompts for each example) and found that both methods identify clear winners in cases where preferences diverge. While we propose that removing such divisive comparisons from these benchmarks is one way to improve these evaluations, improved prompting methods represent an exciting alternative approach for future work.
评论

Thanks for the response. I see that Q1 and Q2 have been addressed and resolved my confusion. I have updated my scores accordingly.

Regarding Q3, I don't think the default prompts for MT-Bench/Arena-Hard fully embrace diverging preferences and won't work as a fair baseline. Also, considering that prompting an LLM is not a difficult, resource-extensive experiment, I guess it might have been better to include a simple prompting-based experiment as a baseline.

审稿意见
5

The paper discusses a proposed distributional reward model aimed at addressing the issues of distinguishing between divided preferences and high-agreement preferences in reward modeling for language models (LLMs). It points out that standard reward modeling approaches, such as Bradley-Terry and MSE regression, fail to differentiate between these two types of preferences, leading to similar reward distributions and potential problems in multi-dimensional alignment when using Reinforcement Learning from Human Feedback (RLHF).

The authors outline two main objectives for their model: (1) identifying preferred responses and (2) detecting responses that may exhibit divided preferences. By achieving these objectives, the model aims to prevent the system from learning responses that reflect only a single user perspective. The authors argue that training this reward model is more cost-effective and efficient compared to obtaining multiple annotations for every data point.

To evaluate the model's performance, two metrics are introduced: Preference Accuracy, which assesses the model's ability to assign higher rewards to responses selected by human annotators, and Diverging ID AUROC, which measures the model's effectiveness in identifying divided preferences within response pairs.

The results, based on training and evaluation with the HelpSteer2 and Multipref datasets, indicate that the proposed distributional reward model performs effectively, consistently exceeding the baseline metrics for both Preference Accuracy and Diverging ID AUROC. This demonstrates that the proposed model can predict expected rewards while reflecting the controversy of responses as assessed by different annotators.

In the latter sections, the paper explores biases inherent in the evaluation of LLMs using the LLM-judge method, particularly when preferences are divided. It discusses how the LLM-judge’s assessment may unfairly penalize systems that reflect less popular opinions or that are trained with consistent policies in ambiguous scenarios.

优点

Identification of Problems: The proposed distributional reward model clearly identifies existing issues in the current methodologies. Experimental Evidence: Strong experimental evidence is provided to support the effectiveness of the proposed model. Well-Organized Structure: The paper has a well-organized structure, making it easy to follow. Effective Use of Visuals: Tables and figures are effectively utilized to present experimental results. Contributions to Multi-Dimensional Alignment: The research offers a new methodology for addressing the problem of multi-dimensional alignment in LLMs through the distributional reward model.

缺点

Lack of Discussion on Limitations: There is insufficient discussion regarding the limitations and potential issues of the proposed method, particularly concerning the outdated model. Complex Technical Details: Some technical details are explained in a complex manner, which may hinder understanding for non-experts. Need for Practical Applicability Discussion: The paper lacks a thorough discussion on the practical applicability and limitations of the proposed approach, which could enhance its relevance and usability.

问题

Discussion on Limitations: What specific limitations of the proposed distributional reward model should be addressed in future research? Complexity of Technical Details: Which technical details were particularly complex, and how might they be simplified for better understanding? Practical Applicability: What are the potential real-world applications of the proposed approach, and how could its limitations affect these applications? Outdated Model Concerns: How does the findings with Llama-3-8B Instruction model impact recent research?

评论

Thank you for thoughtful comments! We agree that our explanations and discussions of limitations and future work could be expanded on, and we address each comment below and will include these additions in our revisions. Below, we also address “Outdated Model Concerns” by including results from current SOTA reward models into our analysis.

Q1: “Discussion on Limitations: What specific limitations of the proposed distributional reward model should be addressed in future research?”

  • A1: In this work, we propose a novel method for training and evaluating distributional reward models, and we demonstrate a use-case by using them to identify divisive examples in LLM-as-Judge benchmarks. Future research might explore alternative methods for using distributional reward models to not only detect divisive responses, but also train LLMs to generate pluralistically-aligned responses in such cases (discussed in Section 5). In our revisions, we will add such further discussions of future works and limitations.

Q2: “Complexity of Technical Details: Which technical details were particularly complex, and how might they be simplified for better understanding?”

  • A2: Several other reviewers have noted areas in which our writing can be simplified and improved. In particular, reviewers have suggested numerous ways we can improve our explanations of our distributional reward models and baseline methods in Section 4. We will incorporate such changes and more in our revisions. Furthermore, to support future efforts building upon our work, we plan to release our code upon acceptance.

Q3: “Practical Applicability: What are the potential real-world applications of the proposed approach, and how could its limitations affect these applications?”

  • A3: In Section 5.3, we demonstrate one method in which distributional reward models can be utilized to improve LLM-as-Judge benchmarks by identifying divisive prompts from these evaluation sets. Under the “Results and Recommendations” heading of Section 5.3, we suggest that such examples can be removed from these benchmarks to diminish the bias LLM-as-Judge exhibit against pluralistically-aligned responses to divisive prompts.

Q4: “Outdated Model Concerns: How does the findings with Llama-3-8B Instruction model impact recent research?”

  • A4: We address these concerns by adding two SOTA reward models (the 2 best performing systems on RewardBench) to these experiments in our common reviewer response above. In these experiments, we find that our findings remain consistent across these additional SOTA reward models. For the full results and descriptions of our experiments, see our general response above.
评论

Dear Reviewer, the discussion period is coming to a close soon. We wanted to check if we have addressed your concerns, especially regarding your outdated model concerns, for which we ran additional experiments and provide new results. We would be keen to use the remaining time to discuss improvements!

评论

How does reward model scale impact our findings?

We thank reviewers for their feedback and suggestions. We address individual reviewer’s comments separately, and here report additional baselines which were suggested by multiple reviewers.

In Sections 3 and 4, we demonstrate that standard single-value reward models fail to distinguish between high-agreement and diverging preferences, learning to predict similar reward distributions in both cases. Multiple reviewers, however, asked whether increasing the scale of standard single-value reward models might impact these findings. To address these questions, we repeat our reward modeling experiments in Tables 2 and 3 using two SOTA reward models, which achieve the two best scores on RewardBench (Described below). With these additional baselines, we demonstrate that our findings hold true for these large-scale, SOTA reward models as well. We further describe these additional baselines and results below.

Additional Large-Scale, SOTA Reward Models:

We supplement our single-value reward modeling baselines with the following SOTA reward models described below:

  • Skywork-Reward-Gemma-2-27B-v0.2: A Bradley-Terry based reward model based on Gemma-2-27b-it that has been trained on a collection of publicly available datasets, including the aggregated labels from HelpSteer2.
  • Llama-3.1-Nemotron-70B-Reward: A reward model based on Llama-3.1-70B-Instruct that utilizes a novel approach that combines standard Bradely-Terry and MSE-regression training methods aggregated labels from HelpSteer2.

Due to computational constraints, we do not re-train these systems. Furthermore, as both systems are trained on different splits of HelpSteer2, we avoid test-train overlap by only evaluating on MultiPref.

[Section 3] SOTA Reward models make Decisive Decisions over Divisive Preferences

Repeating our experiments from Table 2, we find that the average predicted reward difference for each model on examples from agreement split is: Skywork-Reward-Gemma-2-27B-v0.2:

  • Skywork-Reward-Gemma-2-27B-v0.2
    • High-Agreement Prefs: 0.840
    • High-Agreement Ties: 0.756
    • Diverging Prefs (All): 0.841
    • Diverging Prefs (Substantial): 0.832
    • All Examples: 0.821
  • Llama-3.1-Nemotron-70B-Reward:
    • High-Agreement Prefs: 7.330
    • High-Agreement Ties: 3.477
    • Diverging Prefs (All): 6.900
    • Diverging Prefs (Substantial): 8.026
    • All Examples: 6.149

Here, we see that our claims hold true for these significantly larger SOTA reward models, which predict similar differences in rewards in cases of High-Agreement and Diverging Preferences.

[Section 4] SOTA Single-Value Reward models fail to distinguish between Diverging and High-Agreement Preferences

Repeating our experiments from Table 3, we evaluate the accuracy and Diverging ID AUROC of each of our additional baselines on the MultiPref test set.

  • Skywork-Reward-Gemma-2-27B-v0.2:
    • Accuracy: 0.651
    • Diverging ID AUROC: 0.494
  • Llama-3.1-Nemotron-70B-Reward:
    • Accuracy: 0.638
    • Diverging ID AUROC: 0.400

We find that both SOTA reward models, despite not being trained on any in-domain data from MultiPref, are able to achieve comparable accuracy to the prior single-value reward model baselines we trained ourselves. Furthermore, we find that the Diverging ID AUROC performance also echoes the performance of our prior single-value reward model baselines, where systems are performing slightly worse than random chance.

AC 元评审

This work examines diverging preferences in human-labeled preference datasets used for training language models. The authors claim that current reward modeling approaches fail to distinguish between high-agreement and diverging preferences, and then propose a taxonomy of disagreement sources and analyze two datasets (HelpSteer2 and MultiPref). LLM-as-judge evaluations show bias toward majority preferences.

Strengths:

  1. The problem is important and timely, which deserves more research attention
  2. Provides systematic analysis backed by empirical evidence

Weaknesses:

  1. Implementation details are sometimes unclear (or buried in appendices)
  2. Limited exploration of how findings generalize to larger models (though partially addressed in rebuttal)
  3. Some methodological choices lack thorough justification

The core ideas are sound and the problem addressed is important. But several aspects still need strengthening before publication: the methodological choices require more rigorous justification, and technical details need a clearer presentation in the main text. Therefore, while this paper addresses an important problem and also shows promise, I recommend revision before acceptance (which is rejection for ICLR).

审稿人讨论附加意见

The discussion period was quite active with substantive exchanges. The authors made good faith efforts to address concerns and provided valuable additional experiments, but some core methodological questions remain inadequately addressed. The paper would benefit from revision to incorporate these improvements before publication.

最终决定

Reject