General Preference Modeling with Preference Representations for Aligning Language Models
摘要
评审与讨论
The paper proposes a novel approach to modeling human preferences for enhancing the alignment of foundation models with human values. Traditional methods, such as the Bradley-Terry (BT) reward model and supervised pair preference models (PairPM), have limitations in expressiveness, consistency, and computational efficiency. The authors introduce preference representation learning, which embeds responses into a latent space to capture complex preference structures, achieving linear query complexity. They also propose a preference score-based General Preference Optimization (GPO) method, extending reward-based reinforcement learning from human feedback. The experimental results demonstrate that the proposed General Preference Model (GPM) outperforms the BT reward model on the RewardBench benchmark by up to 5.6% and effectively models cyclic preferences. Additionally, evaluations on downstream tasks like AlpacaEval2.0 and MT-Bench show significant performance improvements of up to 9.3% after post-training with GPO and GPM.
优点
-
The preference representation learning approach can capture intricate preference structures, addressing the limitations of traditional methods in handling intransitive preferences.
-
By embedding responses into a latent space, the method achieves linear query complexity, making it computationally more efficient than PairPM, which has quadratic query complexity.
-
The proposed method ensures a more consistent preference probability of compared pairs, reducing the ad-hoc nature of PairPM implementations.
-
Extensive experiments on benchmarks and downstream tasks demonstrate the superiority of GPM over existing methods, providing strong empirical evidence for its effectiveness.
缺点
-
The introduction of a latent space and preference representation learning adds complexity to the model, which might require more sophisticated training and tuning processes.
-
The latent space embeddings and preference scores might be less interpretable compared to simpler models, making it harder to understand why certain preferences are modeled in specific ways.
-
While the paper provides comparisons with the BT reward model and PairPM, a more comprehensive comparison with other state-of-the-art methods would strengthen the claims about the superiority of GPM.
问题
I don't understand why the acc of GPM is 100% on cyclic preference data. Does the experiment involve any information or data leakage?
Q1: Accuracy of GPM on Cyclic Preference Data
Thanks for your feedback. We appreciate your concern and have thoroughly investigated our experimental setup. The high accuracy is due to the GPM's ability to fully capture cyclic preferences by design, as it models preferences in a more expressive latent space. The BT model, on the other hand, cannot model cyclic preferences effectively, resulting in lower accuracy. We have ensured that there is no information or data leakage in our experiments. The dataset used for testing was kept separate from the training data. However, we acknowledge that the cyclic preference dataset used is relatively small (around 200 samples), which might contribute to high accuracy. To address this, we plan to construct a larger and more diverse cyclic preference dataset to further evaluate and validate the model's performance.
Thank you again for your valuable feedback and for recognizing the contributions of our work. We appreciate your consideration and hope that our clarifications might encourage you to view our submission even more favorably.
Dear reviewer, thank you for your thoughtful review and for recognizing the strengths of our work. We appreciate your constructive feedback and the opportunity to address your concerns.
- Complexity of Training and Tuning Processes
Thanks for your feedback. We understand your concern that introducing a latent space and preference representation learning might add complexity to the model. However, in practice, the changes to the training procedure are minimal. Our General Preference Model (GPM) replaces the scalar output in the reward model's (RM) training with a vector output. When the embedding vector dimension is small (e.g., 2, 4, 6, or 8, as used in our experiments), the training process remains very similar to standard RM training.
The added components, such as the eigenvalue scale gate and the eigenvector embedding head, do not significantly increase computational complexity or require extensive tuning. The model can be trained using the same frameworks and optimization techniques as traditional RMs, ensuring ease of implementation and scalability.
- Interpretability of Latent Space Embeddings and Preference Scores
We appreciate your point regarding the interpretability of the latent space embeddings. On the contrary, our model provides opportunities to understand and analyze the preference structures more deeply. By embedding responses into a latent space, the GPM allows the preference model to automatically discover different components within human preferences.
As detailed in Section 4.2 (originally presented in Appendix A.2) of our paper, the eigenvalue scale gate computes context-dependent scaling factors based on the prompt, adjusting the importance of various aspects such as helpfulness, instruction-following, and creativity. The eigenvector embedding head generates embeddings that capture nuanced characteristics of the responses relevant to human preferences.
This structure enables the model to adapt dynamically to varying user preferences and makes it possible to analyze the embeddings to understand which aspects are being prioritized in different contexts. Therefore, the GPM enhances interpretability by providing a richer representation of preferences.
- Comparison with Other State-of-the-Art Methods
Thank you for suggesting a more comprehensive comparison. We have extended our experiments to include additional state-of-the-art (SOTA) models to demonstrate the effectiveness of GPM across different architectures and sizes. Below is the updated table comparing our GPM with the Bradley-Terry (BT) models and other SOTA models on RewardBench.
Table: Comparison between BT Models, GPM, and Other SOTA Models on RewardBench
| Model | Type | Chat (%) | Chat-Hard (%) | Safety (%) | Reasoning (%) | Overall (%) |
|---|---|---|---|---|---|---|
| Base Model: Gemma-2B-it | ||||||
| BT RM | Custom Classifier | 71.51 | 64.69 | 75.00 | 61.90 | 68.27 |
| GPM (Embed Dim. 8) | Custom Classifier | 78.49 | 66.23 | 84.32 | 80.47 | 77.38 (+9.11) |
| Base Model: Llama-3.1-8B-Instruct | ||||||
| BT RM | Custom Classifier | 88.55 | 85.75 | 91.49 | 96.47 | 90.56 |
| GPM (Embed Dim. 8) | Custom Classifier | 93.58 | 87.50 | 91.08 | 95.44 | 91.90 (+1.34) |
| Base Model: Gemma-2-9B-it | ||||||
| BT RM | Custom Classifier | 91.62 | 85.96 | 92.70 | 95.55 | 91.46 |
| GPM (Embed Dim. 4) | Custom Classifier | 93.58 | 87.72 | 92.30 | 95.71 | 92.33 (+0.87) |
| Other Models for Comparisons | ||||||
| GPT-4 | Generative | 95.30 | 74.30 | 87.60 | 86.90 | 86.00 |
| GPT-4o | Generative | 96.10 | 76.10 | 88.10 | 86.60 | 86.70 |
| Gemini-1.5 | Custom Classifier | 92.30 | 80.60 | 87.90 | 92.00 | 88.20 |
| RLHFlow/pair-pm-8B | Custom Classifier | 92.30 | 80.60 | 89.70 | 94.70 | 87.10 |
| ArmoRM-8B | Custom Classifier | 98.30 | 65.80 | 90.50 | 97.30 | 90.40 |
| Nemotron-4-340B | Custom Classifier | 95.80 | 87.10 | 91.50 | 93.60 | 92.00 |
| Llama-3.1-Nemotron-70B | Seq. Classifier | 97.5 | 85.7 | 95.1 | 98.1 | 94.10 |
| Skywork-Gemma-2-27B-v0.2 | Seq. Classifier | 96.1 | 89.9 | 93.0 | 98.1 | 94.30 |
- Across all base models, GPM consistently outperforms BT RM. For example, with Gemma-2B-it, GPM achieves an overall score of 77.38%, which is a 9.11% improvement over BT RM.
- In addition, our model with sizes 8B and 9B surpassed previous SOTA methods including RLHFlow/ArmoRM-Llama3-8B-v0.1, Nvidia/Nemotron-4-340B-Reward, OpenAI/GPT-4-0125-preview and Google/Gemini-1.5-pro-0514.
- Our Model's Strengths:
- Our GPM achieves competitive performance using smaller, more efficient models (e.g., 2B and 8B parameters), highlighting its effectiveness in resource-constrained settings.
- Our work specifically addresses the challenge of modeling cyclic and complex preferences, which is less explored in existing models. This focus contributes to advancements in understanding and handling intricate human preferences.
- The ability of GPM to improve performance across various model sizes demonstrates its scalability and adaptability.
Thank you for your response. I have reread the article and believe this work has value for the research community. I will keep my score.
Dear Reviewer,
Thank you for taking the time to reread our paper and for your encouraging comment. We are pleased to hear that you believe our work has value for the research community.
If there are any remaining concerns or questions that we can address to improve our paper further, please let us know. We are committed to refining our work based on your feedback.
We also invite you to review our responses to the other reviewers' comments, where we have provided additional clarifications and results that might address some of your earlier points.
We sincerely hope that with these revisions and clarifications, you might consider raising your score for our submission.
Thank you again for your time and thoughtful consideration.
Sincerely,
The authors
- In terms of methods proposed:
- This paper proposes the GPM approach to model general preference. GPM computes preference probability by multiplying a skew-symmetric block-diagonal matrix (referred to as “Skew-symmetric Preference Operator”) with learned embeddings (referred to as “preference representation”). This approach has several advantages: (1) ensures P(y1>y2) = P(y2<y1) by design, regardless of input positions. (2) has the capacity to model cyclic preference (e.g., A>B>C>A) which the Bradley-Terry model fails at; (3) has linear complexity O(K) in computing preference among a pool of K candidates compared to previous approaches like PairRM which has quaratic complexity O(K^2).
- This paper proposes the GPO objective to align LLM with preference models. GPO is adapted from the SPPO loss by Wu et al. (2024b) for iterative preference optimization except that the preference score instead of preference probability in the loss form.
- In terms of experiments conducted:
- (1) Train and evaluate Reward models on a cyclic preference dataset constructed from UltraFeedback. The aim is to show GPM can model cyclic preferences while BT cannot.
- (2) Train GPMs using the Skywork Reward Data Collection and evaluate on Reward Bench with two base models (2B & 8B). The aim is to show that GPM outperforms BT-based RM in terms of reward modeling.
- (3) Finetune a 2B and a 8B model with GPO and SPPO using the BT-based Reward model and the GPM reward model. Evaluate on AlpacaEval 2.0 and MT-Bench. The aim is to show GPO+GPM yields better downstream LLMs.
- In terms of experimental results:
- (1) They show that GPM has the capacity to model cyclic preferences as intended.
- (2) They show that GPM attains overall higher scores on reward bench compared to BT-based RM. The gain with 2B model is substantial (+5.6), while the gain with 8B model is marginal (+1.0).
- (3) They show substantial gain with GPO+GPM compared to SPPO+BT RM on AlpacaEval 2.0. The results for MT-Bench warrant more discussion than there is in the paper (see Question 5).
优点
- The proposed GPM approach enjoys several desirable theoretical properties as explained in the Summary. This is a good contribution for preference modeling.
- The three sets of experiments are well designed. They cover the main research questions nicely formualed on Line 384-389 that are warranted by the proposed GPM and GPO approaches.
- Experimental results of GPO+GPM as evaluated on AlpacaEval 2.0 shows substantial gain compared to previous SPPO+BT RM approach.
缺点
- Presentation of methods:
- The von Neumann winner is introduced in Sec.5 where GPO is proposed, but it is not clear how this concept relates to the proposed GPO.
- The current presentation of Sec.5 is confusing and redundant. It starts with a general review of previous work, repeats the points on computational efficiency that are already made in the introduction, and finally introduces the GPO method in Sec. 5.2. It seems more appropriate to move the texts up to Sec.5.1 into Background/Related Work to make a clear distinction of previous work and this paper's contribution.
- Key modeling techniques (Eigenvalue Scale Gate; Eigenvector Embedding Head) are in the Appendix rather than the main text.
- Clarity in experimental setup and reporting:
- It seems like the RMs in Table 1 are trained and evaluated on the same dataset. If that's the case, it should be made clear in the main text.
- It is not clear in the main text how the BT RM is trained (see Question 4).
- The "1st" and "2nd" columns in Table 4 lacks explicitly explanation in the caption.
- Under-discussed experimental results:
- Table 2, 3 and 4 warrant more thorough discussion. For example, Line 481 reads: “These results suggest that integrating the General Preference representation model (GPM) into policy optimization can enhance the downstream performance of language models”, but Table 4 for MT-Bench shows GPM+GPO yielding marginal gain in MT-Bench score compared to SPPO+BT RM. The results do not support this general claim.
- Overall, while the paper has good novelty and presents a good set of experimental results, a more detailed, methodical discussion of results is in order. Presentation and clarity in methods could also be improved. The paper would make a much stronger case if these issues are addressed.
问题
- How is the von Neumann winner related to GPO?
- I am not sure what the bold font in Table 4 represents. Could you clarify?
- How is the BT RM trained? From appendix it looks like BT RM refers to the reward model formed by the DPO-trained policy and the reference model, where rewards are calculated as the log-likelihood ratio multiplied by beta. If so, it should be made clear in the main text.
- Line 1307 (Appendix B.2): "For the Bradley-Terry (BT) model, the temperature parameter β was set to 1, following standard practice (Rafailov et al., 2024)". Can you clarify what "standard practice" refers to? It is not standard to set beta = 1 in Rafailov et al. In fact, beta is an important hyper-parameter that needs sweeping.
- Could you explain the experimental results on MT-Bench in Table 4 in more detail? Specifically, it seems like the improvement is within the margin from sampling compared to the baseline even with the proposed GPO+GPM.
2.a. Training and evaluation datasets for RMs in Table 1
A: Thank you for your feedback. Indeed, the Reward Models (RMs) in Table 1 were both trained and evaluated on the same cyclic preference dataset constructed from Ultrafeedback. We have updated the main text to explicitly state this, ensuring transparency in our experimental setup. This clarification helps in understanding the context and significance of the results presented.
2.b. Training of BT RM
A. Thanks for your attention. The BT RM was trained using standard reward model training configurations, employing the logistic sigmoid loss function , where and $$r_l$ are the scalar rewards for the winning and losing responses, respectively. This method allows the BT RM to learn scalar rewards that model transitive preferences. We have revised our paper to include a detailed description of how the BT RM is trained, including the loss function and training settings.
2.c. Explanation of ''1st'' and ''2nd'' columns in Table 4
A: Thank you for bringing this to our attention. The “1st” and “2nd” columns in Table 4 represent training using SPPO and GPO methods, respectively. We have revised the caption of Table 4 to explicitly explain this, making the table self-contained and more understandable.
- Under-discussed Experimental Results
A: We appreciate your feedback on the need for a more detailed analysis of our experimental results. Regarding the MT-Bench results in Table 4, we acknowledge that the gains achieved by GPM+GPO over SPPO+BT RM are marginal. This suggests that while our method shows significant improvements in certain benchmarks (e.g., AlpacaEval 2.0), its impact on MT-Bench is less pronounced. We now think that MT-Bench may not be the most suitable benchmark for highlighting the strengths of our approach, as it focuses on general language capabilities rather than alignment with complex human preferences. Therefore, we have moved the MT-Bench results to the Appendix as supplementary material. In the main text, we have revised our claims to more accurately reflect the experimental outcomes, emphasizing the gains observed in AlpacaEval 2.0.
Additionally, we have expanded our discussion of Tables 2 and 3 in the main text, providing a more thorough analysis of the results. We delve into why the GPM outperforms the BT RM in modeling general preferences (transitive and cyclic part) and how the embedding dimension affects performance. This enhanced discussion helps to contextualize our findings and reinforces the contributions of our work.
Q1. How is the von Neumann winner related to GPO?
A: As mentioned earlier, we realize that the introduction of the von Neumann winner may not clearly aid the understanding of GPO. Our GPO method builds upon the iterative framework using multiplicative weight updates, similar to the approach in Freund and Schapire (1999) of the form:
This iterative process allows our method to converge asymptotically, akin to Theorem 4.1 in Wu et al. (2024). We have moved the mention of the von Neumann winner and explanation of the convergence properties of GPO with established iterative optimization frameworks in Appendix C.
Q2. I am not sure what the bold font in Table 4 represents. Could you clarify?
A: Certainly. In Table 4, the bold entries indicate that the GPM outperforms the corresponding BT RM under the same training settings. This formatting is consistent with Tables 2 and 3. We have revised the caption of Table 4 to clearly explain the meaning of the bold font, ensuring that the table is self-explanatory.
Q3. How is BT RM trained?
A: The BT RM is trained using standard reward model training configurations. Specifically, we use the logistic sigmoid loss function to learn scalar rewards based on pairwise preferences, where and represent the rewards for the preferred and less preferred responses, respectively. This approach enables the BT RM to model transitive preferences effectively. We have added this explanation to the main text, and we hope it may provide readers with a clearer understanding of the training methodology for BT RM.
Q4. Line 1307 (Appendix B.2): “For the Bradley-Terry (BT) model, the temperature parameter β was set to 1, following standard practice (Rafailov et al., 2024)”. Can you clarify what “standard practice” refers to?
A: Thanks for your feedback. We have revised the citation and now reference Ouyang et al. (2022) ("Training language models to follow instructions with human feedback") and Christiano et al. (2017) ("Deep reinforcement learning from human preferences"), for standard reward model training practices.
Q5. Could you explain the experimental results on MT-Bench in Table 4 in more detail?
A: We appreciate your request for a deeper analysis of the MT-Bench results. As noted earlier, the improvements achieved by GPM+GPO over SPPO+BT RM in MT-Bench are marginal and within the margin of sampling variability. This suggests that our method may not offer significant advantages in this particular benchmark.
MT-bench is a multi-round conversation benchmark while our training data for both the preference model and LLM policy are primarily single-round question-answering or instruction following. The performance of these two different tasks may not be highly correlated. We believe that this phenomenon is common in language model alignment methods.
Since we have moved the key modeling technique parts (scale gate and embedding head) which consume 1 page, we have moved the MT-Bench results to the Appendix. We revised our claims in the main text to reflect the experimental findings accurately.
Once again, we sincerely thank you for your constructive feedback. Your insights have been instrumental in helping us improve the quality and clarity of our work.
Thanks for the response and for your efforts in revising the paper. I think the revised version improves clarity and presentation substantially.
There remain some points made in the analyses that are not supported by experimental results. Specifically:
[Line 453] Table 2 presents the results. The GPM consistently outperforms the BT model for both base models on RewardBench,with notable improvements in tasks involving complex reasoning (e.g.,Chat Hard and Reasoning).These results highlight the superior expressiveness of the GPM in preference modeling.
Table 2 shows that there is marginal or no gains at all from GPM compared to BT on Reasoning for Llama-3.1-8B-Instruct and Gemma-2-9B-it base models. It does not follow that GPM is more expressive than BT in preference modeling.
[Line 487] Table 2 shows that increasing the embedding dimension generally improves performance.
Again, the Table does not show that. On the contrary, it seems like there are sweet spots of embedding dimensions for a particular base model and for particular domains (e.g., Dim<=4 under Llama-3.1-8B-Instruct for Chat and Dim=8 for Chat-Hard). There seems to be trade-off between different domains. These are interesting findings, but are not described correctly in the current form.
[Table 2] Other state-of-the-art models
There are more performant models on the RewardBench leaderboard (https://huggingface.co/spaces/allenai/reward-bench) than the ones cited in Table 2. In particular, the 8B model from Skywork achieves an overall score of 93.1 and seems to me more comparable to the reward models reported in the table in terms of size.
I would not use the headline "Other state-of-the-art models" because it is inaccurate. What about something like "Other models reported on Reward Bench"?
Are there particular reasons why these models are more suitable for comparison? If so, they could be included in the caption.
To sum up, the revised paper has a clear presentation and results are solid. However, some results remain incorrectly analyzed. I will keep my scores now. I think the paper will benefit from a round of close examination of its analyzes and discussion of results.
Dear reviewer, thank you for your thoughtful and detailed review of our paper. We appreciate your recognition of the strengths and contributions of our work, as well as your valuable feedback on areas that need improvement. We have carefully considered your comments and have made substantial revisions to address your concerns. Below, we provide detailed responses to each of the points you raised.
- a. The von Neumann winner and its relation to GPO
A: Thanks for your feedback. To clarify, the von Neumann winner concept arises from game theory and relates to finding optimal strategies in zero-sum games. In our context, the policy optimization problem can be framed as a two-player zero-sum game between policies, and the von Neumann winner corresponds to a policy that performs optimally against any opponent.
However, upon reflection, we realize that introducing the von Neumann winner may have caused unnecessary confusion in understanding GPO within a limited page. Therefore, we have moved the mention of the von Neumann winner from Section 5 to Appendix C, which now focuses on the iterative framework of GPO, explaining how it builds upon existing methods like multiplicative weight updates and how it converges asymptotically. Here is our newly added theorem (in Appendix C):
Consider the optimization problem defined by the GPO loss (Equation 12) and assume it is realizable. Let denote the sequence of policies generated by GPO, and define as the average policy. Given that the preference score is bounded within , by setting , we have:
b. Structure and clarity of Section 5
A: Thank you for pointing this out. We have restructured Section 5 to improve clarity and focus. Specifically, we have moved the general review of previous work and discussions on computational efficiency to the Related Work section. This allows us to dedicate Section 5 entirely to introducing and explaining the GPO method. The revised Section 5 now begins with a clear formulation of the policy optimization problem and proceeds to detail our proposed solution without redundancy.
c. Key modeling techniques (Eigenvalue Scale Gate; Eigenvector Embedding Head) are in the Appendix rather than in the main text.
A: We agree that these components are crucial to understanding our General Preference Model (GPM) and should be presented prominently. We have moved the descriptions of the Eigenvalue Scale Gate and the Eigenvector Embedding Head from the Appendix into Section 3 of the main text. This includes detailed explanations of how these components function and their roles in enhancing the expressiveness and computational efficiency of the GPM. By bringing these details into the main text, we aim to provide readers with a complete understanding of our model’s architecture and capabilities.
Dear Reviewer,
Thank you for your follow-up comments and for acknowledging the improvements in clarity and presentation in our revised paper. We greatly appreciate your constructive feedback, which has helped us further refine our work.
We have carefully considered the points you raised and have made the following revisions to address your concerns:
- Analysis of Results in Table 2 (Line 453).
A: We appreciate you bringing this to our attention. We have revised the text to more accurately reflect the experimental results. The updated passage now reads:
“The results are presented in Table 2. On RewardBench, using the Gemma-2B-it base model, GPM achieves an average score of 77.38%, which is an improvement of 9.11% over the BT model’s average score of 68.27%. Specifically, in the Chat task, GPM improves performance from 71.51% (BT RM) to 78.49%, and in the Chat-Hard task, from 64.69% to 66.23%.
For the Llama-3.1-8B-Instruct base model, GPM achieves an average score of 91.86% (embedding dimension 6), representing a 1.30% improvement over the BT model’s average score of 90.56%. In the Chat task, GPM improves from 88.55% (BT RM) to 93.30%, and in the Chat-Hard task, from 85.75% to 86.18%.
Using the Gemma-2-9B-it base model, GPM achieves an average score of 92.33% (embedding dimension 4), showing an improvement of 0.87% over the BT model’s average score of 91.46%. In the Chat task, GPM boosts performance from 91.62% (BT RM) to 93.58%, and in Chat-Hard, from 85.96% to 87.72%.
These results indicate that GPM outperforms the BT model across various base models and tasks, particularly in the Chat and Chat-Hard tasks. Note that BT RM is a special case of GPM when the embedding dimension d = 1 (see Section 4).”
This revision acknowledges that while GPM generally outperforms BT, the improvements are more pronounced in certain tasks and base models, particularly for smaller models like Gemma-2B-it.
- Discussion on Embedding Dimension Effects (Line 487).
A: We agree with your assessment and have revised the text to accurately describe the nuanced effects of embedding dimensions. The updated passage now reads:
“We conducted ablation studies to assess the impact of varying the embedding dimension in GPM. As shown in Table 2, the performance of GPM varies with the embedding dimension. For the Llama-3.1-8B-Instruct base model, an embedding dimension of 6 achieves the highest average score of 91.86%, compared to 91.37% with dimensions 2 and 8. In the Chat-Hard task with the same base model, the highest score of 87.50% is achieved with an embedding dimension of 6, compared to 87.06% with an embedding dimension of 8.
Similarly, for the Gemma-2B-it base model, the highest average score of 77.38% is achieved with an embedding dimension of 8, showing an improvement over lower dimensions, such as 74.94% with dimension 4. Nevertheless, for some tasks and models, increasing the embedding dimension beyond a certain point does not yield additional benefits and may even lead to slight performance declines. These results suggest that the optimal embedding dimensions vary across different base models and tasks.”
This revision reflects the observed trends in the data.
- Inclusion of More Performant Models in Table 2.
A: Thank you for this valuable suggestion. We have updated Table 2 to include more performant models, such as Llama-3.1-Nemotron-70B and Skywork-Gemma-2-27B-v0.2 (Top 2 models), to provide a more comprehensive comparison. We have also changed the heading to “Other Models Reported on RewardBench” to accurately reflect the content.
We have chosen these models for comparison because they are among the publicly reported results on RewardBench and provide relevant points of reference in terms of various model sizes and architectures, such as BT reward models (Skywork-Gemma-2-27B-v0.2, Llama-3.1-Nemotron-70B), PairRM (ArmoRM-8B), regression loss-based reward models (ArmoRM-8B), and generative reward models (GPT-4o, Gemini-1.5).
Furthermore, we have added a footnote to clarify our reproduced results:
“We report our reproduced results, which may be different from the officially reported results of Skywork-Reward-Llama-3.1-8B-v0.2.”
This transparency ensures that readers are aware of potential discrepancies between our reproduced results and the original reports, acknowledging that exact replication can be challenging due to differences in experimental setups (using different training codebases and different GPUs).
We hope that the updated manuscript meets your expectations and contributes valuable knowledge to the field.
Thank you once again for your thoughtful feedback. We hope our responses fully address your concerns. If so, we wonder if you could kindly consider raising your rating score? We will also be happy to answer any further questions you may have. Thank you very much!
The paper introduces the GPM with preference representation learning for efficiently capturing complex human preferences. Traditional models, such as the BT, struggle to represent intransitive preferences and suffer from computational inefficiencies, particularly for tasks with large response sets. The GPM model addresses these issues by embedding responses into a latent space, enabling more expressive preference modeling with linear complexity. This paper also proposes GPO and show gains on policy optimization.
Summary of contributions:
- This paper proposed GPM, which efficiently models complex, cyclic, and intransitive preferences with linear complexity.
- This paper demonstrates that GPM outperforms BT on various benchmarks, including RewardBench.
- Enhanced language model alignment on tasks such as AlpacaEval2.0 and MT-Bench when using GPO (w/ GPM).
优点
- This paper is well written and is very clear.
- This paper provides a novel approach by embeds responses into a latent space to capture intricate preference structures efficiently. The preference representation learning and skew-symmetric operators is innovative and well-suited for addressing limitations of traditional methods
缺点
-
It's very interesting to see GPM can capture the cyclic preferences that previous methods cannot. Further experiments on how "capture cyclic preferences" can help existing LLMs to produce better results or show some improvement on downstream applications can demonstrate the true value of this.
-
As the authors already mentioned in the limitations section, this paper would benefit from having enough discussion and analysis on representation vector (v) generation (model architecture choice). Without solid ablation study, it's hard to judge whether this method can be generalized. The performance pattern difference on 2b and 8b models also shows that this method may require specific recipe for specific use cases.
问题
- Can the authors elaborate a bit more on line474-475, "To avoid the preference bias when using GPT-4-turbo as the evaluator"? Why GPT 4omini would be better to be used as alpacaeval2.0 evaluator here?
- Are the results using GPM general longer than baseline? As alpaca 2.0 eval introduced LC WR to mitigate length bias, it could be helpful for the authors to further elaborate why they think WR is a better metric to use (Table 3 in main text) in this case? As shown in appendix Table 6, the gain of the proposed methods on LC WR is smaller.
Dear Reviewer, thank you for your thoughtful review and for recognizing the clarity and novelty of our work. We appreciate your constructive feedback and are glad to have the opportunity to address your concerns.
- Practical Benefits of Capturing Cyclic Preferences
We understand your interest in seeing how modeling cyclic preferences can lead to better performance in downstream applications. Cyclic preferences are indeed common in complex human judgments, especially when multiple evaluation criteria conflict with each other.
In real-world scenarios, users often have conflicting priorities such as instruction following, helpfulness, honesty, and creativity. For example, a response that strictly adheres to instructions might lack creativity, while a more creative response might deviate from the instructions. This inherent conflict can lead to cyclic preferences.
Theoretical work by Balduzzi et al. (2019) has shown that every finite function game decomposes into transitive and cyclic components, suggesting that cyclic preferences are an inherent part of multi-criteria evaluations (Open-ended Learning in Symmetric Zero-sum Games, ICML 2019). Empirical studies, like the one by Lee et al. (2023), have also observed that annotator disagreements often arise from differing priorities among correctness, helpfulness, and style (Common Law Annotations: Investigating the Stability of Dialog System Output Annotations, ACL 2023 Findings).
To demonstrate the practical benefits of capturing these preferences, we conducted experiments on downstream tasks such as AlpacaEval 2.0. Our General Preference Model (GPM) consistently outperformed the Bradley-Terry (BT) model with different base models. As shown in Tables 3, 4, 5, and 6 of our paper, GPM achieved higher win rates, indicating better alignment with complex human preferences.
By effectively modeling cyclic preferences, our GPM enhances language model alignment, leading to improved performance on tasks where such preferences play a significant role.
I think it probably related to the way this paper is written. The paper shows results that GPM outperforms BT-type RM. To clarify my earlier suggestions: could you measure how much of the performance gains can be attributed to capturing cyclic preferences? If quantifying this isn't feasible, some qualitative analysis would still be valuable.
"By effectively modeling cyclic preferences, our GPM enhances language model alignment, leading to improved performance on tasks where such preferences play a significant role." ---> Similar to the statement provided here, the paper seems to suggest a causal relationship for this (correct me if I am wrong); however, it would be beneficial to include evidence that extends beyond mere demonstrations of correlation.
Dear Reviewer e2YS,
Thank you for your follow-up comment and for giving us the opportunity to clarify how capturing cyclic preferences can contribute to the performance gains of our General Preference Model (GPM) over the Bradley-Terry (BT) type Reward Model. Here's how we address this:
- Theoretical Connection Between GPM and BT RM:
- As detailed in our paper (Line 274), the BT model is a special case of GPM when the embedding dimension . In this case, GPM reduces to modeling only transitive preferences, equivalent to BT RM.
Relation to Bradley-Terry Model.
If we set , , where is a constant and (e.g., ), and , then the preference score reduces to:
and the preference probability becomes:
which is exactly the Bradley-Terry (BT) model.
- By increasing (i.e., using higher-dimensional embeddings), GPM can capture cyclic preferences that BT RM cannot.
- According to Theorem 1 by Balduzzi et al. (2019), “every finite function game decomposes into a sum of transitive and cyclic components.” This underscores the importance of modeling both components to fully capture general preferences.
-
Controlled Experimental Setup:
- We trained both BT RM and GPM under identical conditions, using the same models and hyperparameters. The key difference is GPM's ability to model cyclic preferences due to its higher embedding dimension.
- The consistent performance improvements of GPM over BT RM suggest that the additional capacity to model cyclic preferences contributes to these gains.
-
Quantitative Analysis:
- We already conducted ablation studies varying the embedding dimension in GPM in Table 2 and Table 7 of our paper.
- When , GPM's performance matches that of BT RM (BT RM is a special case of GPM). As we increase to 2, 4, we observe measurable performance gains compared to .
- This indicates that the ability to capture cyclic preferences (enabled by higher ) leads to improved results.
Thank you again for your valuable feedback. Please let us know if there are any further questions or if there's any other aspect you'd like us to address.
- Discussion and Analysis on Representation Vector Generation and Generalization
Thanks for your feedback. We agree that providing more discussion and analysis on the generation and generalization of the representation vector would strengthen our paper.
We already conducted ablation studies to assess the impact of key architectural components, particularly the eigenvalue scale gate and L2 normalization in the embedding head in the initial submission. Our findings indicate that generally removing the L2 normalization or the eigenvalue scale gate leads to performance degradation using base model LLama-3.1-8B.
We restate the results of ablation studies here for convenience:
| Embedding Type | Dim. | Chat | Chat-Hard | Safety | Reasoning | Average |
|---|---|---|---|---|---|---|
| Base Model: Llama-3.1-8B-Instruct | ||||||
| w. scale gate w. l2 | 2 | 91.62 | 88.38 | 90.68 | 94.82 | 91.37 |
| w. scale gate w.o. l2 | 2 | 93.85 | 86.84 | 90.68 | 91.60 | 90.74 |
| w. o. scale gate w. l2 | 2 | 92.18 | 86.18 | 91.89 | 94.05 | 91.08 |
| w. o. scale gate w.o. l2 | 2 | 93.30 | 87.94 | 91.22 | 93.55 | 91.50 |
| w. scale gate w. l2 | 4 | 93.30 | 86.18 | 91.22 | 95.69 | 91.60 |
| w. scale gate w.o. l2 | 4 | 94.13 | 86.18 | 89.86 | 90.55 | 90.18 |
| w. o. scale gate w. l2 | 4 | 92.46 | 87.28 | 91.76 | 93.19 | 91.17 |
| w. o. scale gate w.o. l2 | 4 | 93.58 | 86.40 | 90.95 | 95.33 | 91.56 |
| w. scale gate w. l2 | 6 | 91.90 | 87.50 | 91.62 | 96.40 | 91.86 |
| w. scale gate w.o. l2 | 6 | 93.02 | 85.75 | 91.08 | 91.31 | 90.29 |
| w. o. scale gate w. l2 | 6 | 92.18 | 85.53 | 90.81 | 94.20 | 90.68 |
| w. o. scale gate w.o. l2 | 6 | 93.30 | 87.94 | 90.95 | 90.90 | 90.77 |
| w. scale gate w. l2 | 8 | 93.58 | 87.50 | 91.08 | 95.44 | 91.90 |
| w. scale gate w.o. l2 | 8 | 93.02 | 87.06 | 90.81 | 92.20 | 90.77 |
| w. o. scale gate w. l2 | 8 | 91.90 | 86.62 | 91.22 | 92.63 | 90.59 |
| w. o. scale gate w.o. l2 | 8 | 93.02 | 87.72 | 90.90 | 90.16 | 90.39 |
Reward normalization, achieved through L2 normalization, is crucial for stable and effective learning. This aligns with findings in recent works, such as "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (https://arxiv.org/abs/2402.03300).
Regarding generalization, our method shows consistent improvements across both 2B and 8B models. While hyperparameter tuning can optimize performance, the underlying approach does not require specific recipes for different use cases, indicating good generalization capabilities.
To further demonstrate the robustness and generalizability of GPM, we conducted experiments using two additional base models: Gemma-2-2B-it and Gemma-2-9B-it. These experiments aim to show that our approach is effective across different model architectures and sizes.
| Model | Embed Dim. | Chat | Chat-Hard | Safety | Reasoning | Average |
|---|---|---|---|---|---|---|
| Base Model: Gemma-2-2B-it | ||||||
| BT RM | 1 | 80.73 | 71.27 | 85.54 | 80.26 | 79.45 |
| GPM | 2 | 81.84 | 72.81 | 87.97 | 84.42 | 81.76 (+2.31) |
| 4 | 80.73 | 73.46 | 87.84 | 84.28 | 81.58 | |
| 6 | 80.17 | 71.71 | 86.49 | 84.83 | 80.80 | |
| 8 | 82.40 | 71.27 | 85.68 | 81.29 | 80.16 | |
| Base Model: Gemma-2-9B-it | ||||||
| BT RM | 1 | 91.62 | 85.96 | 92.70 | 95.55 | 91.46 |
| GPM | 2 | 92.46 | 85.96 | 92.30 | 94.56 | 91.32 |
| 4 | 93.58 | 87.72 | 92.30 | 95.71 | 92.33 (+0.87) | |
| 6 | 92.46 | 86.18 | 92.43 | 95.67 | 91.69 | |
| 8 | 91.62 | 85.96 | 92.43 | 95.89 | 91.48 |
Q2. Clarification on Preference Bias and Use of GPT-4-o-mini
Thanks for your feedback. You asked why we used GPT-4-o-mini as the evaluator to avoid preference bias when using GPT-4-turbo. Recent studies have identified that language models can exhibit self-bias when evaluating similar models. For instance, the paper "Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement" (https://arxiv.org/abs/2402.11436) discusses how models can favor responses similar to their own outputs.
To mitigate this bias, we used GPT-4-o-mini, which is less similar to the models being evaluated, providing a more neutral and fair assessment. Additionally, we provided results using GPT-4-turbo and DeepSeek-v2 evaluators, which show consistent improvements, reinforcing the robustness of our findings.
Q2. Results Length and Choice of Metrics
Thanks for your feedback. You inquired whether our results using GPM are generally longer than the baseline and why we focused on Win Rate (WR) over Length-Controlled Win Rate (LC WR) in Table 3.
Our GPM may produce slightly longer responses, but the performance gains are not solely due to increased length. The additional length often reflects more comprehensive and informative content, enhancing the overall quality of the responses.
Reward normalization, through L2 normalization in the embedding head, helps control output length and ensures stable training. This is consistent with findings in recent research emphasizing the role of reward normalization in improving performance.
Regarding the choice of metrics, both Win Rate (WR) and Length-Controlled Win Rate (LC WR) are important. WR captures overall preference, including content quality, coherence, and relevance, while LC WR mitigates length bias by controlling for response length. Our results show consistent gains in both WR and LC WR. The gains in LC WR, though slightly smaller, are still significant, demonstrating that our model’s improvements are due to better content quality, not just longer outputs.
We conducted further experiments to control for length bias by adjusting hyperparameters, such as the L2 normalization in the embedding head and the beta coefficient in the GPO loss function. The results, presented in the tables below, indicate that with proper tuning, we can manage output length without compromising performance.
Table 1: AlpacaEval 2.0 Results with Scale Gate and Embedding Head L2 Normalization
| Size | Type | Iter | SPPO Win Rate | SPPO Avg. Len | GPO Win Rate | GPO Avg. Len |
|---|---|---|---|---|---|---|
| base | 32.26 | 1959 | 32.26 | 1959 | ||
| 2B | BT RM | 1 | 46.09 | 1939 | 49.94 | 1929 |
| 2 | 58.41 | 2032 | 64.88 | 2049 | ||
| 3 | 67.14 | 2136 | 71.68 | 2151 | ||
| GPM | 1 | 49.15 (+3.06) | 2066 | 57.12 (+7.18) | 2102 | |
| 2 | 63.53 (+5.12) | 2301 | 67.78 (+2.90) | 2343 | ||
| 3 | 70.91 (+3.77) | 2498 | 74.78 (+3.10) | 2582 | ||
| 8B | BT RM | 1 | 36.95 | 1740 | 40.26 | 1702 |
| 2 | 50.36 | 1868 | 56.30 | 1933 | ||
| 3 | 58.38 | 1948 | 59.17 | 1969 | ||
| GPM | 1 | 41.42 (+4.47) | 1861 | 46.64 (+6.38) | 1850 | |
| 2 | 56.07 (+5.71) | 2029 | 60.37 (+4.07) | 2115 | ||
| 3 | 63.42 (+5.04) | 2385 | 67.48 (+8.31) | 3249 |
Thanks for the clarification and additional experiments.
wrt Q2. Results Length and Choice of Metrics
In the paper of "length controlled alpaca-eval" (https://arxiv.org/pdf/2404.04475), the authors reduce length-bias for alpaca eval and they introduce LC WR, where they demonstrate it's better than WR. In addition, as shown in the alpacaeval leaderboard, ranking by LC WR and WR can be quite different. I think it's important to demonstrate gains using LC WR.
In table 3, for LC WR, is BT RM + GPO better than GPM+GPO for most of the time?
Table 2: AlpacaEval 2.0 Results Without Scale Gate and Embedding Head L2 Normalization
| Size | Type | Iter | SPPO Win Rate | SPPO Avg. Len | GPO Win Rate | GPO Avg. Len |
|---|---|---|---|---|---|---|
| base | 32.26 | 1959 | 32.26 | 1959 | ||
| 2B | BT RM | 1 | 46.09 | 1939 | 49.94 | 1929 |
| 2 | 58.41 | 2032 | 64.88 | 2049 | ||
| 3 | 67.14 | 2136 | 71.68 | 2151 | ||
| GPM | 1 | 49.75 | 2032 | 53.23 | 1975 | |
| 2 | 60.76 | 2112 | 66.32 | 2155 | ||
| 3 | 68.76 | 2201 | 74.50 | 2318 | ||
| 8B | BT RM | 1 | 36.95 | 1740 | 40.26 | 1702 |
| 2 | 50.36 | 1868 | 56.30 | 1933 | ||
| 3 | 58.38 | 1948 | 59.17 | 1969 | ||
| GPM | 1 | 46.39 | 1920 | 48.51 | 1900 | |
| 2 | 54.37 | 2082 | 58.56 | 2258 | ||
| 3 | 60.13 | 2021 | 59.48 | 3985 |
Table 3: Ablation Study of Coefficient Beta on GPT-4o-mini Evaluator
| Model | Win Rate (%) | Avg. Length |
|---|---|---|
| GPO-Llama-3-8B-Instruct-Iter3_gp_2b_ww_beta0.001 | 74.78 | 2582 |
| GPO-Llama-3-8B-Instruct-Iter3_gp_2b_ww_beta0.002 | 73.88 | 2568 |
| GPO-Llama-3-8B-Instruct-Iter2_gp_2b_ww_beta0.001 | 67.78 | 2343 |
| GPO-Llama-3-8B-Instruct-Iter2_gp_2b_ww_beta0.002 | 67.59 | 2337 |
| GPO-Llama-3-8B-Instruct-Iter1_gp_2b_ww _beta0.001 | 57.12 | 2102 |
| GPO-Llama-3-8B-Instruct-Iter1_gp_2b_ww_beta0.002 | 56.22 | 2097 |
These tables show that with and without L2 normalization, our GPM still outperforms the BT model. In addition, we can adjust the hyperparameter to control the alignment strength and the average output length. We can control the output length through proper configuration, and performance improvements are not solely due to longer responses.
We appreciate your valuable feedback, which has helped us enhance our paper. We hope that our responses have addressed your concerns comprehensively. Given the strengths you’ve acknowledged in our work, we kindly ask you to reconsider your rating.
Dear Reviewer e2YS, we wanted to express our appreciation once more for your thoughtful review of our paper.
We have provided detailed responses to your comments and questions, and we hope they have addressed all your concerns. As there are still a few days left in the discussion period, we are eager to engage in any further dialogue that might help clarify our work.
Please feel free to let us know if you have any additional questions or if there are points you’d like to discuss further. Your insights are invaluable to us, and we are committed to ensuring that all your concerns are fully addressed.
Thank you for your time and consideration.
Dear Reviewer e2YS,
Thank you for your continued engagement. We have conducted additional experiments to improve the LC WR performance of GPM + GPO.
As presented in the previous response, we varied the hyperparameter β in the GPO objective function to control the alignment strength and the average output length. By tuning β, we can balance the trade-off between model performance and response length.
Implementing Length-Normalized GPO Objective:
To further address length bias, we conducted experiments using a length-normalized GPO (LN-GPO) objective, inspired by the length normalization term introduced in SimPO (https://arxiv.org/abs/2405.14734). The LN-GPO learning objective is:
Here are the preliminary results with β = 5:
| Model | Win Rate (%) | Avg. Length | LC WR (%) |
|---|---|---|---|
| LN-GPO-Llama-3-8B-Instruct-Iter1_gp_2b | 48.31 | 2112 | 45.55 |
| LN-GPO-Llama-3-8B-Instruct-Iter1_bt_2b | 43.38 | 1951 | 45.51 |
These results indicate that with length normalization, GPM + GPO achieves higher Win Rates and comparable LC WR scores relative to BT RM + GPO.
We emphasize that length bias can be effectively controlled by tuning hyperparameters like β and incorporating length normalization into the GPO objective. By doing so, we can improve LC WR for GPM + GPO, demonstrating gains in both metrics.
We have included these updated results and analyses in the revised version of our paper to provide a comprehensive evaluation of our model under LC WR metrics.
We hope these updates address your concerns. Given your earlier positive remarks on the clarity and innovation of our approach, we kindly request that you reconsider your rating. Specifically, we feel the current rating of 3 may not accurately reflect the strengths of our work, as indicated in your comment:
Strengths:
This paper is well written and is very clear.
This paper provides a novel approach by embeds responses into a latent space to capture intricate preference structures efficiently. The preference representation learning and skew-symmetric operators is innovative and well-suited for addressing limitations of traditional methods.
Presentation: 3 good
We will also be happy to answer any further questions you may have. Thank you very much!
Dear Reviewer e2YS,
Thank you for your continued engagement and valuable feedback. We have carefully addressed your concerns and made substantial revisions to the manuscript. Here is how we have addressed your specific points:
-
Attributing Performance Gains to Capturing Cyclic Preferences:
-
We conducted ablation studies varying the embedding dimension k in our General Preference Model (GPM). The Bradley-Terry (BT) model is a special case of GPM when k = 1 modeling only transitive preferences.
-
By increasing k to higher dimensions, GPM captures cyclic preferences. We observed measurable performance gains as k increased from 1 to higher values, suggesting that the ability to model cyclic preferences contributes to improved performance.
-
This provides both quantitative and qualitative evidence supporting the causal relationship between modeling cyclic preferences and performance gains.
-
-
Demonstrating Gains Using Length-Controlled Win Rate (LC WR):
- We implemented length normalization in our GPO objective (LN-GPO) to mitigate length bias. Our newly added experiments show that GPM + LN-GPO achieves higher LC WR and higher WR relative to BT RM + LN-GPO, demonstrating gains even when controlling for response length. We have included these results and analyses in the revised manuscript.
We believe these revisions comprehensively address your concerns. We kindly invite you to review our updated manuscript.
If you find that your concerns have been satisfactorily addressed, we would greatly appreciate it if you could consider adjusting your rating.
Thank you again for your time and thoughtful consideration.
Best regards,
The Authors
Thanks for the additional results and the discussion. The gains on LC WR are very small. This has been my primary concern with this paper. I would like to maintain the current rating due to this reason.
Dear Reviewer e2YS,
Thank you for your additional comments and for taking the time to thoroughly review our paper. We would like to address your concern by providing additional context and clarification regarding our contributions and experimental results.
1. Scope of Our Research and Key Questions
Our paper focuses on reward modeling in language models, and we have structured our work around three primary research questions:
- Q1: Can the General Preference Model (GPM) effectively capture and model cyclic and intransitive preferences, where traditional models like the Bradley-Terry (BT) reward model struggle?
- Q2: How does GPM perform on standard preference modeling benchmarks (RewardBench) compared to the BT model?
- Q3: How does using GPM for downstream policy optimization impact language model performance on real-world tasks such as AlpacaEval compared to reward-based approaches?
2. Performance on Q1 and Q2
In addressing Q1 and Q2, our experiments demonstrate that GPM consistently outperforms the BT reward model:
- Cyclic and Intransitive Preferences: GPM effectively captures and models cyclic and intransitive preferences, a significant limitation in traditional reward models like BT RM. This capability is crucial for accurately reflecting complex human judgments in language models.
- Benchmark Results: On standard preference modeling benchmarks such as RewardBench, GPM shows superior performance compared to the BT model, validating its effectiveness in various settings.
3. Performance on Q3
For Q3, we evaluated the impact of GPM on downstream policy optimization:
- Win Rate Improvements: Our experiments indicate that GPM outperforms BT RM under the standard Win Rate metric across different evaluators and learning objectives, including SPPO, GPO, and LN-GPO.
- Alignment Performance: While the gains in LC WR are modest, GPM achieves comparable or better alignment performance without sacrificing computational efficiency.
We acknowledge that the improvements in LC WR are small. However, several factors contribute to this outcome:
- In our experiments, we used the same hyperparameter β for both BT RM and GPM. Upon further analysis, we discovered that the reward scores produced by BT RM and GPM are not in the same range (the average BT RM reward is approximately 5.769, while GPM’s is about 7.462). This discrepancy suggests that different optimal β values may exist for each model.
- Due to limitations in computational resources and budget for extensive evaluations using AlpacaEval 2.0 (which relies on the OpenAI GPT-4 API), we were unable to perform comprehensive hyperparameter tuning within the rebuttal period.
- We are conducting additional experiments with optimized hyperparameters for both BT RM and GPM. We believe that appropriate tuning could lead to more significant gains for GPM. We will include these updated results in the final version of the paper.
We appreciate your recognition of our additional results and the discussion provided. Our primary aim is to present a more general and expressive reward modeling approach that can handle complex preference structures, which we believe is a meaningful advancement in the field.
Thank you again for your valuable feedback. We remain open to any further questions or discussions you may have.
Best regards,
The Authors
This paper studies the intransitive (cyclic) preferences problem in reward modeling. Traditional reward models such as BT model and PairRM cannot handle cyclic preferences scenarios. To address this issue, this paper proposes General Preference Optimization (GPM) via embedding the preference information in the latent space (called preference representation learning). The proposed reward model can be applied to common direct preference alignment methods such as DPO and SPPO. The experimental results validate the effectiveness of GPM in handling cyclic preference scenarios.
优点
- This paper studies the intransitive (cyclic) preferences problem, and analyzes the weakness of traditional BT and PairRM reward models. This problem in preference rewarding is very interesting and needs more efforts from the community.
- The proposed GPM method is well-motivated, and it keeps the computational complexity with the BT model while can handle the cyclic preferences in the same time. Besides, this method can be adapted into the direct preference alignment methods (though may not be directly used in RLHF pipeline with policy-gradient based method e.g., PPO since it is pair based optimization, please correct me if understand wrong).
- The experimental results demonstrate the effectiveness of the proposed method in handling the cyclic preferences.
缺点
- It may be not appropriate to use Games such as Tic-Tac-Toe, Go, starcraft Elo system to establish the motivation of the cyclic preference. This paper mainly studied the language model preference, i.e., the generated response for the given question. Generally, for the given user question, the preferences of candidate responses satisfies total ordering, which means there will be nearly no cyclic preference cases. Think about in the preference annotation workflow, for the same question, it rarely has the case Response A > Response B > Response C while Response A < Response C, except changing the criteria of evaluating the response. Also, the success of Lmsys Chatbot Arenas system can also validate this point.
- The strength of GPM in terms of computational complexity is overclaimed. Starting at L324, the authors claim that they have advantage in computational efficiency. However, the most commonly used BT reward model only has O(K) complexity (K times inference forward), and does require the embedding computation, which means it has better efficiency than the proposed GPM. I hope the authors can revise this paragraph to make it clear.
- Actually, the cyclic preference is very rare in the chatbot (LLM generation) scenarios, thus the authors have to establish a specially designed cyclic "instruction following ≻ honesty ≻ truthfulness ≻ helpfulness ≻ instruction following" for ultrafeedback dataset. This preference criteria establish needs to be explained and justified. How well it aligns with the overall score? It is also suggested to show provide the accuracy in terms of overall score in Table 1. Thus, the results may not directly validate the effectiveness of the proposed method since the established criteria is not well justified.
- The alignment performance of LLMs using GPM is somewhat moderate, showing no surprises. As shown in Table 3 and Table 4, the performance of GPM in the iteration 3 is marginally better than methods using BT model. Notably, for a more fair comparison, we should control the length bias in the preference. The length controlled win-rate results are shown in Table 6 (Appendix), which show that the GPM has no significant advantages in aligning the LLM over BT model. And I suggest the authors to use the LC win rate to show their results in the main body, as LC win rate is more fair.
问题
Please see the weakness. My major concern is about the commonness or generalization of cyclic preference in LLM alignment (chatbot scenario), and the relatively weak alignment performance using GPM.
Dear reviewer, thank you for your thoughtful and detailed feedback on our submission. We appreciate the opportunity to address your concerns and clarify several points in our work.
- Relevance of Cyclic Preferences in LLM Alignment
Thanks for your feedback. While annotation processes aim for consistency, real-world applications often involve multiple, conflicting criteria that can lead to intransitive (cyclic) preferences. For instance, consider the interplay between instruction following, honesty, truthfulness, and helpfulness:
- A response that strictly follows instructions might sacrifice helpfulness if the instructions are incomplete or misleading.
- Prioritizing honesty might lead a model to admit uncertainty, which could reduce perceived helpfulness or failure to follow instructions.
- Emphasizing truthfulness might result in blunt or overly detailed responses that are less engaging or helpful.
These conflicts aren't merely hypothetical. They reflect inherent challenges in aligning models with complex human values. Importantly, this phenomenon is theoretically underpinned by game theory. According to Theorem 1 by Balduzzi et al. (2019), "every finite function game decomposes into a sum of transitive and cyclic components." This means that in most cases, cyclic preferences are not just possible but inevitable (Open-ended Learning in Symmetric Zero-sum Games, ICML 2019).
Building on this, our own Theorems 4.4 and 4.5 (Theorem A.3) in the paper demonstrate that our method can effectively model both the transitive and cyclic components of preferences. By doing so, our approach is equipped to handle almost all possible preference structures, ensuring robustness in a wide range of scenarios.
Moreover, recent empirical studies support the presence of cyclic preferences in LLM outputs:
- Lee et al. (2023) found that annotator disagreements often stem from differing priorities among correctness, helpfulness, and style (Common Law Annotations: Investigating the Stability of Dialog System Output Annotations, ACL 2023 Findings).
- In large-scale deployments, aggregating feedback from users with varying backgrounds and needs can lead to non-transitive preferences.
Another cause of non-transitive preference probability can arise from a mixture of user groups. Say we have 1/2 of users with transitive preference: P(A>B)=1, P(B>C)=2/3, P(A>C)=2/3. The other 1/2 of users with transitive preference P(B>C)=2/3, P(C>A)=1, P(A<B)=2/3. The overall preference probability is: P(A>B) = 2/3, P(B>C)=2/3, P(C>A)=2/3, leading to cyclic preferences.
By explicitly modeling cyclic preferences, our method enhances the alignment of language models with nuanced and sometimes conflicting values inherent in human judgments.
- Computational Complexities
Thank you for highlighting the need for precision regarding computational efficiency. You're correct that the BT model achieves an O(K) complexity when evaluating K responses and doesn't require embedding computations. We intended to emphasize that our General Preference Model (GPM) maintains this O(K) complexity while accommodating cyclic preferences—a capability that the standard BT model lacks without additional computational overhead. To clarify:
- BT Model: Efficient with O(K) complexity but limited to modeling transitive preferences.
- Pairwise Preference Models: Can handle cyclic preferences but typically require O(K²) complexity due to the need to evaluate all pairs.
- Our GPM: Achieves O(K) complexity similar to BT but extends expressiveness to handle both transitive and cyclic preferences efficiently. The compute overhead of embedding interaction is no more than pair-wise vector inner-product, which is negligible. We will revise the manuscript to accurately reflect this comparison, ensuring that we present a fair and balanced view of GPM's computational advantages without overclaiming.
- Cyclic Criteria in the Ultrafeedback Dataset
We appreciate your request for a more detailed explanation of the cyclic preference criteria used in our ultrafeedback dataset. The cycle "instruction following ≻ honesty ≻ truthfulness ≻ helpfulness ≻ instruction following" was crafted to simulate realistic trade-offs that occur in human evaluations:
- Instruction Following vs. Helpfulness: A model may strictly follow instructions but provide unhelpful responses if the instructions are flawed or incomplete.
- Honesty vs. Truthfulness: A response might be honest about the model's limitations but not entirely truthful if it withholds certain information, and vice versa.
These criteria are not artificially constructed but mirror genuine conflicts that arise in complex tasks. By incorporating them into our dataset, we aim to evaluate how well models can navigate these trade-offs. To strengthen our justification, we will include an analysis of how these criteria align with overall user satisfaction scores.
- Alignment Performance and Length-Controlled Win Rate
Thank you for emphasizing the importance of controlling length bias in evaluation metrics. We agree that the length-controlled (LC) win rate provides a more equitable assessment of model performance. In response to your suggestion, we will move the LC win rate results from Appendix Table 6 into the main body of the paper. We will update our discussion to highlight these results, acknowledging that while the performance gains of GPM over the BT model may be modest in some cases, the key advantage of GPM lies in its ability to model general preferences without sacrificing alignment performance.
We are grateful for your constructive feedback, which has helped us identify areas for improvement in our manuscript. Thank you again for your valuable insights and for helping us improve our contribution to the field.
Thank the authors for their detailed responses. I have read through the responses. However, I'm not satisfied with some points of the response. I hope the authors could carefully revise the paper.
- It would be good to just show some real examples to demonstrate the exists of cyclic preference in chatbot scenario. In my opinion, given the same condition (criterion, question and response candidates), it is hard to find response A is preferred over B, and B is over C, while C is preferred over A. This situation may only exist if the condition has been changed (e.g., user changes its requirement, would like a more safety response instead of helpfulness).
- In terms of the computational cost, the revised manuscript still emphasizes the previous reward models account for N^2 complexity, which implies that the proposed method has an advantage, which is overclaimed.
- Could you give me some examples from the ultrafeedback dataset? I would also like to the correlation between each aspect with the overall score.
- As the general alignment performance of GPM is not clearly better than existing reward models, can we imply that the cyclic preference issue is not prevalent in general scenarios? Can the authors provide more explanation or insights?
Consider the following situation where there are 3 different actions ( ) and we have a population composed of 3 types of humans with respective preferences , defined in the following way: , for , except for the following cases: (thus ), (thus ), and (thus ).
Now, let us assume these 3 types form a near-uniform distribution over humans, for example, Type 1, Type 2 Type 3. The corresponding population preference is thus . In the case (so Type 1 is slightly less frequent than the other types) then a reward model will assign a slightly better reward (assuming a Bradley-Terry model) to action , thus optimizing the expected reward (the RLHF solution) will produce a deterministic policy choosing exclusively .
However, here we are in a situation where the preferences are not uniformly aligned across humans (Moskovitz et al., 2023). In the case of uniform sampling of humans (i.e., ), the Nash equilibrium of is a uniform mixture between the 3 policies.
Actually, the preference model corresponding to any is defined as:
,
,
,
,
and
= , for .
By a simple calculation, we deduce that for any , the Nash equilibrium of this preference model consists in selecting and with probability each, and with probability .
We believe our work aligns with these findings, emphasizing the importance of modeling general preferences rather than relying solely on scalar reward models. We'd be interested to hear your thoughts on this perspective and how it relates to our approach.
We will incorporate these clarifications and examples into our revised manuscript. We hope this addresses your concerns. Thank you again for your thoughtful review.
(continued discussion on NLHF)
Here we show that even if each human has transitive individual preferences, the resulting average preference model may not be transitive. Let us consider a specific case of a preference model defined as the probability (under some random outcome ) that , where is a (deterministic) absolute scoring function:
where we define the function , which behaves as an indicator for the event , and assigning a value of in the case where . For example, this could represent the probability that a randomly selected human prefers choice over choice in a context . Consider the following example, where there are 3 possible responses and 3 possible humans chosen uniformly at random: \nu=\mathcal{U}\left(\left z_1, z_2, z_3\right \right). Define the scoring function as follows:
Notice that this defines a transitive preference model for each individual human z \in\left z_1, z_2, z_3\right . However, when aggregated the preference model satisfies . This example thus illustrates that even if for each individual, preferences are totally ordered, when averaged over humans, the resulting preferences model may be non-transitive.
- Computational Cost Claims
Thanks for your feedback. Our goal was to clarify that while traditional reward models like the Bradley-Terry (BT) model handle transitive preferences efficiently (O(N) complexity), they cannot model cyclic preferences without modification. Models that handle cyclic preferences often rely on pairwise comparisons among all responses, leading to O(N²) complexity. Our Clarification:
- BT Model: O(N) complexity but limited to transitive preferences.
- Pairwise Models Handling Cyclic Preferences: O(N²) complexity due to exhaustive comparisons.
- Our GPM: Handles both transitive and cyclic preferences with O(N) complexity by embedding responses into a latent space and computing preferences through inner products, avoiding exhaustive pairwise computations. We'll revise the manuscript to make this distinction clearer and ensure we're not overclaiming.
- Alignment Performance
While GPM's overall alignment performance may not vastly surpass existing models in all metrics, its ability to handle cyclic preferences offers benefits:
- Robustness: GPM performs consistently even when cyclic preferences are present, which traditional models might mishandle.
- Expressiveness: By modeling a wider range of preference structures, GPM captures diverse user feedback more effectively. Addressing cyclic preferences enhances the reliability and fairness of language models, ensuring they align better with complex human judgments. We believe that addressing cyclic preferences is crucial, even if they're not prevalent in all scenarios. GPM's ability to handle these preferences efficiently adds value by enhancing robustness without additional computational costs.
As we mentioned before, recent research, such as "Nash Learning from Human Feedback" (NLHF) by Daskalakis et al. (ICML 2024, Google DeepMind), highlights similar limitations of traditional reward models in capturing non-transitive preferences. They argue that preference models offer a more flexible and nuanced framework, capable of modeling the diversity of human preferences, including non-transitive cases.
Here, we illustrate that in some situations, the solution offered by the Nash equilibrium of the preference model (which we refer to as the NLHF solution) is more aligned with the diversity of human preferences than the optimum of the reward model (which we refer to as the RLHF solution).
Dear Reviewer pX7m,
Thank you for your continued engagement with our work and for your valuable feedback. We have carefully addressed all your concerns in our previous responses and have made significant revisions to the manuscript. We would like to summarize how we have addressed your specific points:
-
Real Examples Demonstrating Cyclic Preferences: We provided detailed examples from the Ultrafeedback dataset to illustrate that cyclic preferences can occur in chatbot scenarios under consistent conditions, which is now presented in the Appendix E of our paper.
-
Computational Cost Clarification: We clarified the comparison of computational complexities in the revised manuscript. We acknowledged that while the Bradley-Terry (BT) model has an O(N) complexity and handles transitive preferences efficiently, it cannot model cyclic preferences without modifications. We emphasized that our General Preference Model (GPM) also has O(N) complexity but can handle both transitive and cyclic preferences efficiently.
-
Prevalence of Intransitive Preferences in General Scenarios: We discussed both theoretical and empirical evidence supporting the prevalence of cyclic preferences in language model outputs. This underscores the importance of modeling intransitive (e.g., cyclic) preferences to capture the complexity of human judgments.
We believe that these revisions and explanations address your concerns comprehensively. We kindly invite you to review our updated manuscript.
If you find that your concerns have been satisfactorily addressed, we would greatly appreciate it if you could consider adjusting your rating. Thank you again for your time and thoughtful consideration.
Best regards,
The Authors
Dear reviewer, thank you for your continued engagement with our work. We appreciate the opportunity to have a constructive discussion here.
- Real Examples Demonstrating Cyclic Preferences in Chatbot Scenarios
We understand your concerns about the presence of cyclic preferences in chatbot settings under consistent conditions. To illustrate that cyclic preferences can occur even without changing the criteria, we'd like to share examples from the Ultrafeedback dataset (https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized).
Example 1:
Prompt: "Provide a detailed history and significance of the Mississippi Arts Center in Jackson". We have three responses within the Ultrafeedback dataset:
Response A:
The Mississippi Arts Center, located in Jackson, Mississippi, is a vital cultural institution that has played a significant role in the state's arts scene for decades. It is a facility of the Mississippi Department of Archives and History and is home to the Mississippi Arts Commission, Mississippi Craft Center, Mississippi History Museum, and the Eudora Welty House Museum.
The Mississippi Arts Center was established in 1968 as the Mississippi Arts Commission. Its primary purpose is to support and develop the arts in Mississippi through a variety of programs and initiatives. Over the years, it has provided grants to arts organizations, schools, and individual artists; managed a state artist roster; developed arts education programs; and promoted cultural tourism in Mississippi.
The physical Mississippi Arts Center building opened in 2004 and houses various facilities and spaces for arts exhibitions, education, and programs. The Mississippi Craft Center is a featured part of the arts center, displaying and selling the works of local artisans and craftsmen. It also offers educational programs, classes, and workshops for both children and adults.
The Mississippi History Museum is another important component of the arts center. The museum explores the state's history from prehistoric times up to the present day, featuring exhibits on Mississippi's Native American culture, the Civil War, and the Civil Rights Movement.
In addition to these facilities, the Mississippi Arts Center is also home to the Eudora Welty House Museum, which is dedicated to the life and work of Eudora Welty, a renowned short-story writer and photographer from Mississippi. The museum features tours of her home, which has been restored to its original condition.
The Mississippi Arts Center in Jackson holds significant importance as a vital cultural institution and resource for both the local community and the state of Mississippi. It plays a crucial role in promoting the arts, preserving history, and providing educational opportunities related to Mississippi's rich cultural heritage. As a result, it has become an essential part of Jackson's cultural landscape and a must-visit for those interested in exploring Mississippi's unique history and arts scene.
Response A provides extensive information about the Mississippi Arts Center but mixes in details about other institutions like the Mississippi Arts Commission and the Mississippi History Museum, leading to partial inaccuracies.
- Instruction Following: 2 (Partially addresses the instruction but includes unrelated institutions)
- Honesty: 5 (Confident and detailed)
- Truthfulness: 3 (Some inaccuracies due to mixing institutions)
- Helpfulness: 4 (Informative but somewhat confusing)
Response B:
The Mississippi Arts Center (MAC) is a state-funded, non-profit art museum located in Jackson, Mississippi. It is the largest art museum in the state, and its mission is to "inspire creativity, enrich lives, and connect communities through the arts."
The MAC was founded in 1938 as the Mississippi Fine Arts Commission. It was originally housed in the Old Capitol building in Jackson. In 1976, the MAC moved to its current location in the former Woolworth's building on Capitol Street.
The MAC's collection includes over 15,000 works of art, including paintings, sculptures, drawings, prints, photographs, and decorative arts. The collection represents a wide range of artists from Mississippi and the American South.
The MAC also offers a variety of educational programs, including art classes, workshops, lectures, and tours. The MAC's Education Department serves over 100,000 people each year through its programs and outreach activities.
The MAC is a significant cultural institution in Mississippi. It is a place where people can come to learn about the arts, to create art, and to experience the power of art to enrich lives.
Here are some additional facts about the Mississippi Arts Center:
The MAC's building was originally a Woolworth's department store. It was designed by architect James Gamble Rogers, who also designed the Woolworth's buildings in New York City and Washington, D.C.
The MAC's collection includes works by some of the most important artists ...
Response B offers detailed facts about the Mississippi Arts Center but includes significant errors, such as incorrect founding year and location, despite being confident.
- Instruction Following: 3 (Addresses some aspects but makes errors)
- Honesty: 2 (Confident but includes significant mistakes)
- Truthfulness: 5 (Avoids fabrication but contains factual errors)
- Helpfulness: 3 (Detailed but confusing key institutions)
Response C:
The Mississippi Arts Center, also known as the Mississippi Museum of Art in Jackson, has a rich history and significance in the state's cultural and artistic landscape. Founded in 1911 as the Mississippi Art Association, its primary mission was to promote the understanding, appreciation, and enjoyment of the visual arts in Mississippi.
In the beginning, the association held exhibitions in various locations across the state. However, in 1978, the Mississippi Museum of Art at the Mississippi Agency building in Downtown Jackson was established, providing a permanent home where the association could showcase its growing collection. Over time, the museum has evolved and expanded.
In 2007, the museum relocated to a new, state-of-the-art facility at 380 South Lamar Street in Downtown Jackson. This move allowed for the enlargement of galleries, versatile exhibition spaces, and the incorporation of educational programs and events.
The Mississippi Museum of Art houses an extensive collection of over 5,800 works spanning from the early 19th century to contemporary times. Notably, the collection includes pieces from renowned Mississippi artists such as Walter Anderson and Marie Hull, as well as internationally recognized names like John Singer Sargent and Andy Warhol.
The significance of the Mississippi Museum of Art lies in its vital role in promoting the rich artistic heritage of Mississippi. It serves as a cultural hub that provides opportunities for artists, educators, and visitors to engage with and enjoy the visual arts. By offering guided tours, workshops, educational programs, and special events, the museum fosters creativity, sparks dialogue, and enriches the lives of its visitors.
In summary, the Mississippi Museum of Art holds an important place in Jackson's cultural landscape as a premier institution for showcasing regional, national, and international artworks, as well as fostering an understanding and appreciation of visual art forms.
Response C delivers a comprehensive history of the Mississippi Museum of Art, mistakenly equating it with the Mississippi Arts Center, but avoids fabrication.
- Instruction Following: 5 (Comprehensive but misinterprets the prompt)
- Honesty: 4 (Accurate information but about a different institution)
- Truthfulness: 3 (Misunderstands the instruction)
- Helpfulness: 5 (Thorough and informative)
Analysis
- Response A preferred over Response B: A is more accurate than B, which contains significant errors.
- Response B preferred over Response C: Some evaluators prefer B over C because B discusses the correct institution (despite errors), while C focuses on a different one.
- Response C preferred over Response A: Others prefer C over A because C provides a thorough and helpful response, even if it's about a different institution, whereas A includes confusing mixed references. This forms a cycle: A > B > C > A, even though the question and evaluation criteria remain the same.
Even under identical conditions, evaluators may prioritize aspects differently:
- Accuracy vs. Helpfulness: Some value factual accuracy over completeness, while others prefer a comprehensive answer even with minor errors.
- Interpretation of Instructions: Evaluators differ in how strictly they expect adherence to the prompt versus the overall quality of the response. These subjective judgments naturally lead to cyclic preferences among responses.
Recent studies also support the presence of cyclic preferences in LLM outputs:
- Lee et al. (2023) found that annotator disagreements often stem from differing priorities among correctness, helpfulness, and style.
- In large-scale deployments, aggregating feedback from users with varying backgrounds can lead to non-transitive preferences. Mixing preferences from different user groups can result in cycles. For example, if half the users prefer A > B > C and the other half prefer B > C > A, the aggregated preferences can form a cycle: A > B > C > A.
"Nash Learning from Human Feedback" (NLHF) by Daskalakis et al. (ICML 2024, Google DeepMind) demonstrates that even when individuals have transitive preferences, aggregating them across a diverse population can lead to non-transitive, cyclic preferences due to differing priorities among user groups.
Dear reviewer pX7m,
We appreciate your engagement with our work and have provided additional responses to your follow-up questions. As the discussion period is ending soon, we wanted to check if there are any remaining concerns we can address. We sincerely hope that with these revisions and clarifications, you might consider raising your score for our submission.
We’re keen to have a constructive discussion and look forward to your reply.
This paper extends the BT model in a principled manner to capture the preferences using a higher dimensional embedding space (and the embedding dimension 1 recovers the BT model). The authors show that the extension can effectively capture cyclic preferences, and experimentally corroborate this fact as well. When applied to RewardBench the new extension leads to narrow gains (which might be statistically insignificant due to lack of statistical error bars) over the BT model. Overall, the conceptual extension presented in this paper is very interesting and was appreciated by all reviewers and the AC. However, there are still several concerns remaining after extensive discussions between authors and reviewers. In particular, the paper claims that the gains provided by the extended model is due to better modeling of the cyclic preferences. However, there isn't sufficient evidence (even after revisions and numerous responses ) for that to be the case. In particular the narrow gains might be due to a richer representation of the preferences beyond scalar modeling. A more comprehensive qualitative and quantitative study of how the new model compares with BT (beyond performance numbers on benchmarks) really helps understand the proposal of this paper and when it might be useful. For example, the gains are much narrower in some cases and larger for some models/datasets/etc. Finally, the results need to be presented with statistical error analysis. Some of the interpretations of the results might be due to noise. Given this, the paper unfortunately needs to be rejected at this time.
审稿人讨论附加意见
The paper has led to significant discussions between authors and reviewers, and the reviewers were eventually still unconvinced of the value of the proposed method. After reading the paper and the discussion, and also the comments that the authors had shared with the AC, I have to side with the reviewers on the raised issues.
Reject