Reward Modeling with Ordinal Feedback: Wisdom of the Crowd
How to deal with different levels of preferences in RLHF: use quantitative labels.
摘要
评审与讨论
The paper extends the canonical setup of binary feed-back RM to ordinal feedback.
给作者的问题
N/A
论据与证据
See Other Strengths And Weaknesses.
方法与评估标准
I think more experimental results is needed to prove the superiority of the proposed method. For example, the authors could leverage the method in [1], setting up a seperated RM as the oracle and test the performance of the proposed method by evaluating the accuracy.
[1] Gao L, Schulman J, Hilton J. Scaling laws for reward model overoptimization[C]//International Conference on Machine Learning. PMLR, 2023: 10835-10866.
理论论述
I didn't check all the proofs in detail but the theorems provides seem to be reasonable and sounding.
实验设计与分析
See Methods And Evaluation Criteria
补充材料
I didn't check all the supplementary material in detail.
与现有文献的关系
The paper provides a good vision for the current RM training.
遗漏的重要参考文献
N/A
其他优缺点
The theory seems to be fascinating. However, the empirical superiority of the proposed method against the binary baseline seems not to be obvious due to Table 2 and Figure 2.
其他意见或建议
N/A
We thank the reviewer for all the comments. Here are our responses to the questions.
- “I think more experimental results is needed to prove the superiority of the proposed method. For example, the authors could leverage the method in [1], setting up a seperated RM as the oracle and test the performance of the proposed method by evaluating the accuracy.”
Thanks for the reviewer’s kind reminder. Actually, that is exactly what we have done in our work: setting up a trained RM as the gold-standard model to simulate human preference data. We apologize for the confusion caused, and we’ll improve our presentation by emphasizing that point in our future versions.
- “The theory seems to be fascinating. However, the empirical superiority of the proposed method against the binary baseline seems not to be obvious due to Table 2 and Figure 2.”
We’re happy that the reviewer appreciates our theory. For the empirical effectiveness of our method, we want to make some clarifications. The motivation behind our work is that (some of) the current practice of RM training simply discards all those fine-grained labeled samples (e.g., “slightly better”), even though many companies have already launched fine-grained feedback systems for human annotators. This waste of precious human-annotated data motivates us to think about whether any better ways can be applied to use that part of the data than simply discarding it. As for Figure 2, our comparisons of different levels of feedback (binary, 3-level, 5-level, and oracle) are made under the same amount of data. However, we can use more data compared to the current throwing-away solutions. For example, if there are 80% of binary labels and 20% of fine-grained labels (e.g., “tied”), the current practice uses only those 80% of data, while our approach uses all of them. We further show that incorporating some “tied” samples doesn’t harm (and even benefits) the RM learning via Table 2. In a word, our approach makes improvements in two ways: decreasing the noise and increasing the effective data sample size.
We thank the reviewer again for the time, and we’re glad to refine our draft to those suggestions. Please let us know if there is any further questions arise in the following week; we will get back to you timely.
The paper examines reward learning in scenarios where annotators provide ordinal feedback. Specifically, annotators select an option from a set , which, in the common case of binary feedback, could be . The authors first offer a statistical justification for the benefits of this extension, arguing that compared to binary feedback, more granular feedback can reduce the Rademacher complexity when using the cross-entropy loss. They then support this claim with empirical evidence, using simulated fine-grained feedback to demonstrate the advantage.
给作者的问题
Please discuss the weaknesses I mentioned. I'll update my score if I find your response convincing.
论据与证据
I'm particularly concerned about the following claims:
-
The authors assert that Assumption 3.1 (wisdom of the crowd) is sufficient for obtaining meaningful comparisons. This is certainly not the case. I'll elaborate further when discussing the theoretical claims, but intuitively, one can always expand the set of possible feedback by introducing new options that the user never selects. This expansion has no impact on anything and still satisfies Assumption 3.1. Similarly, I can construct a larger set of options that contain less information and still meet Assumption 3.1.
-
Theorem 3.2 is initially presented to establish that there exist and that satisfy Assumption 3.1. While this is true (and fairly straightforward), the authors later use the construction in this theorem to simulate ordinal feedback. I found no justification for this step, which I will further discuss under experimental design.
-
A minor point, but I believe the authors are overinterpreting Theorem 4.6 and Corollary 4.7. What these results suggest is a possible reduction in Rademacher complexity when transitioning from binary to ordinal feedback. However, the extent of this reduction remains unknown; it could even be zero. I think it has never been in doubt that more granular feedback can make learning easier; the key question is by how much, and these theoretical results do not provide an answer.
方法与评估标准
I believe the choice of datasets, models, and evaluation criteria was appropriate.
理论论述
I checked the main theoretical results, including Proposition 4.3, Corollary 4.4, Theorem 4.6, and Corollary 4.7. While the proofs appear sound, I have doubts about their usefulness.
In particular, the argument for hierarchical expectations in Proposition 4.3 raises concerns. To establish the existence of hierarchical expectations, we need to specify beta parameters. Condition (a) provides equations, while Condition (b) contributes equations (since the terms sum to one, and each row of also sums to one, making one equation redundant). This results in at most equations. However, at least one of these equations is dependent—combining (a) and (b), we see that the expectation remains the same for both and . Consequently, there are at most independent linear constraints for parameters, implying that a suitable can always be found for any . I might be missing something here, but this suggests the result is somewhat trivial.
A more fundamental issue with the theoretical results is that we can arbitrarily expand by introducing new options that are chosen with zero probability. Such an expansion preserves Assumption 3.1 and implies that the size of should not inherently matter in comparisons. Given this, I do not see how the authors justify their claim in Section 5.2 that 5-level feedback is necessarily better than 3-level feedback.
实验设计与分析
My primary concern with the experimental results is the construction of ordinal feedback. As mentioned earlier, Theorem 3.2 establishes the existence of a feedback system, but I see no justification for using it to simulate ordinal feedback. In fact, I believe this is a poor choice, as users almost deterministically select one of the closest s.
A more reasonable approach would be to use an established ordinal feedback model. For instance, the Luce-Shepard model, which generalizes the Bradley-Terry model mentioned in the paper, seems like a natural alternative. It would be helpful to understand why the authors did not consider such a model for their simulations.
I'm also a bit concerned that the results do not show a significant difference in domain. Maybe having more fine-grained feedback is not always as helpful as we thought.
补充材料
I did not.
与现有文献的关系
I believe the paper closely relates to soft labeling, and the authors have effectively clarified that connection.
遗漏的重要参考文献
The authors have covered related work to the best of my knowledge.
As a minor point regarding references, I wouldn't call DPO and direct alignment methods RLHF, as the introduction calls them.
其他优缺点
The question of whether more fine-grained feedback is necessary, and to what extent, has been on my mind for some time. I believe this paper addresses an important problem. However, I would have liked to see stronger justifications or a clearer quantification of the value of ordinal feedback, particularly through a more well-grounded model of human annotators.
其他意见或建议
Section 4.2 can be removed without changing the story.
There are multiple typos in the paper like Omega instead of or using instead of when it is a random variable. I encourage the authors to do another pass on the writing.
Thanks for the valuable questions. Before we dive into the rebuttal, we’d like to summarize the reviewer’s major concerns.
-
The construction of ordinal feedback via Theorem 3.2 is not justified. There are possible reasonable options (e.g., the Luce-Shepard model) that have not been considered.
-
Assumption 3.1 alone is insufficient to guarantee an impactful expansion on the feedback set by introducing some vague options.
-
Theorem 4.6 and Corollary 4.7 prove that ordinal feedback may bring up some benefits, but they do not fully characterize how large the benefits are.
-
The linear system of ’s is underdetermined: there are variables yet only independent linear constraints, which makes Proposition 4.3 seemingly trivial.
For Point 1, the answer is that the Luce-Shepard model is not considered because it cannot induce a natural reward function (RM). Algorithms like RLHF need a reward function for every prompt-response pair for the following policy gradient steps. That reward function is obtained via pair-wise comparison in the BT model: two reward scores, two probabilities. But if multiple levels of feedback are considered, those probabilities do not naturally correspond to two reward scores. For example, the Luce-Shepard model formulates those probabilities with multiple scores: . How to obtain two scores with three ’s remains unclear, as the additional “tied” term cannot be neglected. Our work is to construct a simple plug-in solution to those multiple-level feedback systems while keeping only two score functions. Due to practical limitations, we have to turn to RMs rather than real human annotators to simulate the labeling process; Theorem 3.2 is only to show how this simulation should be conducted rather than modeling a single annotator’s behavior (e.g., whether s/he chooses the closest or not). We admit that the simulation process may not fully prove the effectiveness of our method in the real world, but that’s the best one can do without really hiring annotators.
Points 2 and 3 are about the same question: How large is the benefit of ordinal feedback? We thank the reviewer for encouraging us to think deeper. The reduction of the Rademacher complexity (Theorem 4.6 and Corollary 4.7) is a direct result of Jensen’s inequality. The answer to the question is: the reduced quantity is approximately of the same order as the conditional variance in the hierarchical coupling. For example, suppose there are two ordinal feedbacks and , is a hierarchical expectation of , and the coupling is . Then Jensen’s gap would be of the same order of (see https://arxiv.org/pdf/1707.08644 for example), by assuming the supremum in the Rademacher complexity as a function of feedback (or ) is twice differentiable, -strongly convex, and -smooth (which can be satisfied in some non-restrictive cases). We are glad to include more in our future versions. The quantity plays a critical role in the reduction of noises: if the hierarchical expectation is constructed by introducing some options that would never be chosen, then this conditional variance would be zero (since almost surely). But it’s very unlikely that all the annotators would avoid a “tied” option for all the samples. And hence the conditional variance would always be non-zero, indicating a strict benefit of ordinal feedback.
The last question’s answer is: there are limitations on the range of ’s. The feasible region of each is , as it stands for the conditional probability. The underdetermined linear system’s solutions’ existence may be trivial in , but it is possible to be infeasible in . For example, a binary feedback with can never be a hierarchical expectation of a 3-level feedback with .
For the minor point about the significance of the improvement, an important point is that we are comparing different levels of feedback in the same amount of data. However, many practices are still throwing away all the “tied” samples (e.g., Llama 2/Llama3). We believe those thrown-away “tied” samples would benefit the learning process even within the same amount of data (Table 2), not to mention that including them would further increase the data volume.
We thank the reviewer again for the suggestions and for carefully reading our manuscripts. We look forward to further discussions in the following week.
Thanks for the clarifications. I now see your point about why Theorem 3.2 is quite different from Luce-Shepard. That being said, I’m still not satisfied with using the construction in Theorem 3.2 for simulations—it feels like a very unnatural plug-in. The new discussion on the magnitude of the benefits in Theorem 4.6 is interesting. I was also expecting strong convexity to be necessary for such results. Overall, the authors have addressed my concerns to some extent, but I still tend to lean toward rejection, primarily because I’m not convinced by the simulation results. I would also like to see the new theoretical results presented in the paper so I can read and verify them.
Many thanks for the follow-up comments. We’d like to make a bit more clarifications.
“I’m not convinced by the simulation results.”
We believe it’s a minor point. But still we’d like to quickly explain why we have designed this simulation; in fact, such a simulation setup is inevitable not only for us but also for any future work studying the ordinal feedback system of reward learning. We’d also like to clarify the contributions of our work in the following two aspects:
-
To show the benefits of finer feedback. This part is supported by reducing the Rademacher complexity theoretically and comparing the efficiencies of different levels of feedback in learning the reward model empirically. For the second empirical part, it is based on a comparison among different feedback levels. We note that this has to be done through a simulation instead of real datasets for anyone, because although there are various kinds of preference datasets of different feedback granularities (e.g., binary or 3-level with ties), each dataset is only of one granularity itself. In other words, there is no way one can compare different feedback levels within the same dataset. Therefore, one has to use simulation to sample two parallel datasets simultaneously -- one under a finer feedback while the other under a coarser feedback. In addition, we show in Theorem 3.2 how the simulation should be conducted.
-
To provide an easy way to fully use the current fine-grained datasets and a first framework for people to understand reward learning under such finer granularity. There are also other works that focus on finer feedback (e.g., Reward Learning From Preference With Ties and A Statistical Framework for Ranking LLM-Based Chatbots), but they are quite different from ours. The first difference is that they aim to model each option’s probability, obtaining improvements when predicting the chances of a tie and so on. However, the ultimate goal of learning a reward model is to guide the alignment of LLMs. For methods proposed in these works, how to incorporate that knowledge into the downstream alignment task may need further efforts (e.g., how to best tune the tied ratio in the BTT model). Comparatively, our goal is to provide a theoretically supported solution to deal with the fine-grained feedback without introducing further ingredients (e.g., use 0.5 in the cross-entropy loss when meeting a tie). Our framework is easily compatible with the current reward learning codebase and also compatible with the downstream alignment of the LLMs, such as RLHF or rejection sampling-based SFT or DPO. In this aspect, we hope this work can be a plug-in solution for the industry.
“I would also like to see the new theoretical results presented in the paper so I can read and verify them.”
- We have uploaded a 4-page proof of the variance-based Jensen’s gap result via this link (https://anonymous.4open.science/r/icml25-anonymous-4r32/rebuttal_theory.pdf). We’d be happy and grateful if you could take a look at these new results. We’d also like to summarize our theoretical contribution again: we show that a more fine-grained feedback system can reduce the “variance”. With the generalized version of the BT model’s assumption (Assumption 3.1), we can get rid of the “bias” term, and the benefit is exactly the reduced “variance” (thanks again for encouraging us to think one step further). This analysis is novel, and it also provides new and independent insights into the seemingly unrelated methods of knowledge distillation/soft labelling by giving concrete proofs on the reduction of variance.
This paper proposes a new framework for reward learning with ordinal feedback, which uses more fine-grained preference pairs generalized from binary preferences. Using a generalized probability model, the authors prove that the ordinal feedback framework reduces the Rademacher Complexity. The authors also use this theoretical argument to render a new bias-variance tradeoff for soft labels in knowledge distillation. Empirically it is shown that fine-grained feedback improves reward model’s accuracy both in distribution and out of distribution.
update after rebuttal
Thanks the authors for their detailed responses. My questions and concerns are mostly addressed, and I will keep my original rating.
给作者的问题
- Despite in the domain of RLHF, what is the difference between your framework and training upon soft labels in knowledge distillation?
- How does upsampling the conflicting labels in preference dataset relate to this ordinal method?
论据与证据
Yes. The paper provides detailed explanation on the problem formulation, theoretical statements, and numerical experiments.
方法与评估标准
Using fine-grained feedback pairs for reward learning is a reasonable approach to the problem.
理论论述
Yes, I checked the proofs for Theorem 3.2 and Theorem 4.6 and did not find any major issues.
实验设计与分析
Yes, the experiments are done with the standard reward learning pipeline and seem fine to me.
补充材料
Partially. Only the aforementioned proofs.
与现有文献的关系
Improving granularity of feedback pairs seems a reasonable approach for addressing the shortcomings of current reward learning frameworks. The authors provide both theoretical and empirical justifications for the benefits of their ordinal feedback framework.
遗漏的重要参考文献
No.
其他优缺点
About weaknesses of the paper, one problematic aspect is about the way of collecting data. The ordinal feedbacks are supposed to be obtained from human labeling, but in empirical simulations are in fact generated from a teacher reward model. In this case, it is hard to distinguish the proposed framework from knowledge distillation.
其他意见或建议
Minor typos:
Line 223: uncompiled latex control sequence Omega
Line 366: “tide” should be “tied”.
We thank the reviewer for appreciating our work and the interesting questions.
- Despite in the domain of RLHF, what is the difference between your framework and training upon soft labels in knowledge distillation?
The major difference between our work and the knowledge distillation of trained RMs is the character of those RMs. In our work, we use the trained RMs’ soft labels to simulate the underlying belief of a population of annotators. Ideally, we would like the data to be collected from real human annotators, but current open-sourced human preference datasets do not include multiple levels of feedback simultaneously, which makes it hard to show the benefits of more fine-grained feedback numerically. We have to resort to the oracle RMs to simulate those multiple levels of feedback; in practice, for LLM companies, such ordinal feedback should be collected from real human annotators. Using RMs to simulate human annotators’ feedback, perhaps, isn’t the best way to convince the audience, yet it is a probably the only feasible method and has been applied in many papers (e.g., “Scaling Laws for Reward Model Overoptimization”). We are not hoping to improve over those trained RMs via knowledge distillation. Instead, we view those RMs as the gold-standard models (“oracle”) to value the benefits of fine-grained feedback. On the contrary, the knowledge distillation hopes to obtain a better student model via the soft labels from the teacher model. Such a process, although closely related to our theoretical results (Theorem 4.9), has a different goal from our work. Our work aims to show that (1) fine-grained ordinal feedback can be taken a good advantage of by endowing it a numerical meaning rather than just throwing it away, and (2) by providing the population with ordinal feedback options, there are some provable benefits (under the assumption of the “wisdom of the crowd”).
- How does upsampling the conflicting labels in preference dataset relate to this ordinal method?
We thank the reviewer for this question. In our opinion, the samples with conflicting labels are those with stronger noises or larger variances (that is, the preference probability close to 0.5 in our framework). If we treat the preference learning problem as a binary classification problem (to decide which one is preferred), the noisy parts are those close to the underlying decision boundary. Upsampling that part is a method to improve the learning on those “hard cases”, reducing the noise by averaging. In that sense, our ordinal feedback method shares the same spirit of decreasing the noise but through a different approach. Our goal is to prevent the noises caused by the annotation system itself. For example, if the annotation system only contains binary options, and annotators are faced with a sample with very similar and . Then, the probability of each response being chosen is one-half, leading to possible conflicting labels in the dataset. But many of the conflicts can be avoided if a third option, “tied”, were provided. Instead of upsampling those “hard cases” under the same feedback system, we try to point out that the learning could also be improved by designing a better feedback system.
We thank the reviewer again for the questions and suggestions. Also, we have corrected those typos. We are happy to have further discussions in the following week.
This paper proposes Reward Modeling with Ordinal Feedback as an alternative to traditional binary feedback in reward modeling. The authors argue that binary preference data discards valuable information, such as subtle distinctions between choices and tied responses. They introduce an ordinal feedback framework that incorporates a marginal unbiasedness condition, justified by the sociological concept of wisdom of the crowd. The paper provides a theoretical foundation demonstrating that ordinal feedback reduces Rademacher complexity, improving generalization. Empirical experiments validate that fine-grained feedback enhances reward model performance in both in-distribution (ID) and out-of-distribution (OOD) settings, and incorporating a certain proportion of tied responses further improves model training.
给作者的问题
- Consider RM is a core composite in RLHF, what is the ultimate goal for generalizing the RMs to cover 'hard cases' and improving the expressiveness of 'rewards'?
- How does the theoretical work make wider impact to the optimization algorithms and human-in-the-loop in operation, e.g., in an active learning setting?
论据与证据
The below major claims are supported through theory and experiments.
- Ordinal feedback improves reward model learning: this is supported by theoretical results showing reduced Rademacher complexity and empirical experiments demonstrating better accuracy than binary feedback.
- Ordinal feedback connects to knowledge distillation: a bias-variance trade-off analysis links the benefits of soft labeling to ordinal feedback.
方法与评估标准
The methods and evaluations are solid, but comparisons to other preference modeling techniques (e.g., DPO variants) would strengthen the findings.
理论论述
The proofs appear solid, leveraging convex analysis and coupling arguments.
- The generalization issue is justified by Rademacher complexity.
- The connection to knowledge distillation is justified by a bias-variance trade-off analysis.
实验设计与分析
The authors conduct ablation studies with 2 base models and 4 feedback types (binary, 3-level, 5-level, oracle). It has sufficient evidence and analysis supporting that finer-grained labels enhance learning. However, the only 2 base model appears to be not very solid for the LLM community where more types of foundation models exist.
补充材料
Yes, I have reviewed the supplementary material where source code is presented. It seems that the code repo is clean and complete, however, I'm not in a position to reproduce the experiments and verify its correctness.
与现有文献的关系
The paper extends Bradley-Terry models and connects to RLHF reward modeling. It (re)-addresses the issues in BT models and the importance of ordinal feedbacks, soft labeling and knowledge distillation techniques in the proposed framework.
To my knowledge, there is another line of work searching for generalized RM modeos. In a broader landscape, RLHF is exposed to 'intransitivity' risk, because it rely on Bradley-Terry model as the preference function, where all preferences are transitive by assumption.
- Certain literatures have shown that such 'transitive' relationship between preference annotations may not always hold and some techniques are explored but not mentioned in this paper.
- https://arxiv.org/abs/2409.19325 (Duan et al, 2017) presented some evidence and can be of interest for future work.
遗漏的重要参考文献
This paper covered its essential references.
其他优缺点
See discussed above.
其他意见或建议
See discussed above.
We thank the reviewer for the kind comments on our work. Here are our responses to the reviewer’s concerns and questions.
- Experiments only contain 2 base models, which appear not to be very solid for the LLM community.
Thanks for pointing out that issue. We have conducted some further experiments on Qwen 2.5 base models, and the updated results can be found in this anonymous link: https://anonymous.4open.science/r/icml25-anonymous-4r32. We apologize for not including further foundation models due to the limitations of time and computational resources. We are glad to add more results to our future versions.
- The literature on the “intransitivity” risk of the BT model during RLHF is not mentioned in this paper.
Thanks for the reviewer’s notice. The term “intransitivity” describes cases with cyclic relations in the pairwise preference comparisons. The literature focusing on the intransitivity risk of RLHF (e.g., the paper the reviewer mentioned) mainly focuses on providing alternative ways to model the preference probabilities (e.g., the Blade-Chest model) rather than a scalar reward score in the BT model. However, our paper aims to provide a plug-in solution to fully use the fine-grained feedback under the same framework of the scalar reward function. In that sense, the intransitivity of RLHF is somewhat “orthogonal” to our work. But we are glad to include more discussions about the intransitivity case of preference learning in our future versions, and thank the reviewer for pointing them out.
- Consider RM is a core composite in RLHF, what is the ultimate goal for generalizing the RMs to cover “hard cases” and improving the expressiveness of “rewards”?
Thanks for the question. As we understand, we guess the “hard cases” represent those samples where has a preference probability close to 0.5. From our interpretation, those “hard cases” are discarded even though options such as “tied” are introduced by (some of) the current practice, while those cases do have a positive chance to appear in the following RLHF step. For example, LLMs’ responses to some prompts may be intrinsically hard to distinguish, making almost all the pairwise comparisons for those prompts impossible to learn. Discarding those “hard cases” might cause the reward function landscape to be unpredictable and undesirable on those prompts, deteriorating the RLHF quality (e.g., over-optimization issue). For the second part (improving the expressiveness of “rewards”), we believe it’s a promising direction for future LLM alignment studies, considering the subtlety and diversity of human preferences. Researchers have made several attempts to handle the preference beyond a scalar reward function, including multi-dimensional scores (e.g., HelpSteer2 includes “helpfulness”, “correctness”, and so on) and the intransitive model mentioned by the reviewer. We believe the ultimate goal of such research is to train LLMs to better align with subtle and diverse human values.
- How does the theoretical work make wider impact to the optimization algorithms and human-in-the-loop in operation, e.g., in an active learning setting?
It’s an interesting question to see how this work is linked to the active learning algorithms. For example, the uncertainty sampling algorithm originates from the spirit that “those samples close to the decision boundary are more valuable in learning the decision boundary”. In that sense, when training RMs, uncertainty sampling would require the learner to query more samples “near the boundary”, that is, preference probabilities close to 0.5, and expect a better outcome. However, that might not be the case in RM training, as we have shown that as the tied ratio increases to 100%, the training result becomes worse (e.g., Table 2 and Figure 3). Another interesting bunch of active learning algorithms is the representation-based algorithms (e.g., CoreSet or MaxHerding), which require the queried samples to cover the feature space as much as possible. In that sense, our results imply that covering some samples close to the boundary (those “tied” ones) would benefit the learning (Table 2 and Figure 3) compared to the current practice of simply discarding these samples. We believe there are also other possibilities to improve the learning of RMs once all the fine-grained labels are included.
We are grateful for the questions and suggestions. We are glad to discuss with you in the following week if there are any further questions/comments.
The paper presents a practical, theoretically grounded, and easy-to-integrate framework for improving reward modeling by using ordinal feedback. While the experimental setup relies on simulation due to current data limitations, the justification is reasonable, and the results are clear. Given its relevance to RLHF and LLM alignment, along with a well-supported rebuttal, this paper provides valuable insights for both theory and practice.