PaperHub
5.3
/10
Rejected4 位审稿人
最低3最高8标准差1.8
3
5
5
8
3.8
置信度
正确性2.3
贡献度2.5
表达2.5
ICLR 2025

Data-Centric Human Preference Optimization with Rationales

OpenReviewPDF
提交: 2024-09-27更新: 2025-02-05
TL;DR

Human Preference Optimization with Rationales

摘要

关键词
dpopreference learningalignment

评审与讨论

审稿意见
3

The paper presents a new direct preference optimization method that leverages preference rationales (natural language explaining the reasons why one response is better than another). The proposed method adds a supervised loss term to the DPO objective, jointly training/anchoring the model to generate a valid preference rationale. Each preference rationales are generated with an LLM-as-Judge, augmenting a conventional binary preference dataset.

The method can be seen as a form of hybrid distillation from both preference data (DPO) and from LLM-as-Judge rationales.

优点

  • The area of generative reward modeling is important and gaining traction.
  • Promising experimental results across two datasets, performing comparable or better than DPO.

缺点

  1. Limited novelty and poor positioning with respect to the growing literature on synthetic preference generation and generative reward modeling (see missing references below, to be discussed in the paper). In addition, the authors focus entirely on direct preference optimization as an alignment method, but reward modeling + reinforcement learning remain a major paradigm for LM alignment. How does this work translate to this setting and compare to the following baselines?

References RLCD: Reinforcement Learning from Contrastive Distillation for Language Model Alignment. Yang et al., 2024. West-of-N: Synthetic Preference Generation for Improved Reward Modeling. Pace et al., 2024. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Zheng et al., 2024. Self-Taught Evaluators. Wang et al., 2024.

  1. I found the theoretical analysis and motivation for the method unclear.
  • Equation 2 (L230) -> why is the joint probability decomposed in this way? Why doesn’t the preference label also depend on the rationale? Surely there isn’t a single ground-truth preference considering what is discussed in the intro (multiple valid preference labels based on different raters’ values)? In fact, does Section 5 not use the opposite formulation (“the preference inferred from the rationale”)?
  • Information-theoretic results may be interesting but are completely relegated to the appendix, so they cannot be counted as a contribution of the paper. Authors state “Our analysis demonstrates a closed-form relationship between rationale informativeness and its alignment with true preferences”, without including any explanation for this claim. What does this mean and what is the form of this relationship?
  1. Finally, the experimental setup is too weak to demonstrate the added value of the proposed method.
  • Is performance improvement statistically significant? Fig 2 suggests that DPO > RDPO with 1K Ultrafeedback data, but we obtain the opposite result in Fig 3. If the result is due to statistical uncertainty, this should be measured and shown on plots (RDPO outperforms DPO by a similar margin in Fig 2, which could therefore not be statistically significant).
  • Preference dataset sizes are typically >>11K (see top-perfoming RMs on RewardBench, for example). Why did the authors focus their analysis on such a small, non-representative dataset size. Also, why is there no improvement in performance with DPO beyond 1K preferences?
  • Related: L353, why pick the DPO model trained with 12K Ultrafeedback preferences as baseline, if its SFT performance is lower than that or models trained on less data?
  • Why not evaluated model performance on established benchmarks such as RewardBench/AlpacaEval?
  • How does RDPO with poor quality rationales (e.g. permuted / opposite) perform against standard DPO? I imagine much worse, since we are training on biased information. How can practitioners ensure that their rationales’ quality is sufficiently high to afford gains and not harm performance?
  • Why is RDPO performing similarly to DPO then trained on Llama-3-8B in Figure 5?

问题

See weakness points above.

Some additional questions I had when reading the paper, that I believe should be clarified:

  • How are draws measured in Fig 2?
  • I don’t understand this sentence: “the drawing rate for the RDPO model is stable and low across different training data sizes, which shows that RDPO winning rate is higher not due to flipping the draw points but the losing points.” Can authors clarify?
  • Fig2 caption typo: Winrare
评论

Dear Reviewer EsKw,

We want to thank you for your elaborate comments and suggestions, which led to valuable extension of our work. We have carefully addressed your concerns and welcome any additional feedback. Please let us know if you have any further questions, we will be happy to address them.

Kind Regards,
Paper8553 Authors

评论

We appreciate the references provided by the reviewer, as they represent significant contributions to the field. Our work primarily focuses on the DPO extension, highlighting its adaptability to other pairwise preference learning methods, such as ORPO, due to our formulation's ability to incorporate a rationale loss. Since RLCD shares similarities with DPO, it could potentially support contrastive prompts with the appropriate rationales. The same reasoning applies to the West-of-N method, which selects the best and worst responses from a pool for pairwise preference training. Our framework could extend this approach by including rationales for each response pair. Self-Taught Evaluators also share some similarities with our approach, as they create judgments to identify better responses. However, their judgments are typically limited to selection without providing detailed rationales. Moreover, they consider their method in the SFT setting, flattening all data into instructions, which differs from our preference learning objective. Additionally, in the synthesis generation process, they generate one general response and one clearly orthogonal response, which is not assumed in our work.

评论

Joint probability decomposition

We do not assume the conditional independence of preference and rationale given the prompt x. Therefore, if we were to apply Bayes' rule to the joint probability in the reverse order, we would obtain: p(ywsuccyl,rx)=p(ywsuccylx,r)cdotp(rx)p^*(y_w \\succ y_l, r | x) = p^*(y_w \\succ y_l |x, r) \\cdot p^*(r | x), which demonstrates that the preference also depends on the rationale. We choose to decompose the probability in the original way because it allows us to separate the common preference term from the rationale term, which can be easily integrated into most pairwise preference learning methods.

Information-theoretic results

Due to space constraints, we have moved our theoretical results to the Appendix and kept a general overview of our results in the main text. We apologize for any inconvenience or potential confusion this may cause. Our first result (Theorem 1) establishes the connection between the mutual information of the true preferences and rationale given the prompt, I(Z;R∣S), and the informativeness of the rationale. Specifically, this dependency indicates that as the informativeness of the rationale about the preference increases, the mutual information also increases. This suggests that higher-quality rationales can enhance the understanding of the true preferences. Our second result (Theorem 2) shows that training with rationales can reduce generalization error, especially when the rationale is useful for predicting the preference, which can eventually boost learning efficiency.

评论

How are draws measured in Figure 2?

We use the LLM as a judge to determine which response is preferred between models. However, in cases where the judge cannot decide on a preferred response, we consider it a draw. Additionally, we shuffle the responses to prevent any ordering bias from influencing the judge’s decision.

Clarification of the drawing rate

We apologize for the confusing phrasing of our claim. In the experiment, we observed that as the data size increases, the RDPO winrate improves relative to DPO, while the draw rate remains constant. This indicates that the increase in RDPO's winrate over DPO is due to converting previously losing cases into winning ones. We hope this clarifies our claim.

Typo

Thank you for catching that.

评论

Explanation of Figures 2 and 3

In Figure 2, we compare DPO and RDPO against the SFT model, but both DPO and RDPO involve varying data sizes. As the reviewer pointed out, DPO outperforms RDPO compared to SFT at 1K data points. In Figure 3, we fixed the DPO model at 12K data points for Ultrafeedback and varied the RDPO training data size, which may have contributed to the potential confusion. We apologize for this oversight. For the plots, we conducted the winrate evaluation three times and observed consistent scores with negligible standard deviation. We apologize for any misunderstanding.

Why focus on a small dataset size?

While training on larger datasets is possible, our primary goal is to enhance data quality for preference learning by augmenting the dataset, as an alternative to simply increasing annotated data. We demonstrate data efficiency by reducing the number of annotated pairs needed—by 2 to 4 times compared to DPO—while achieving similar performance. Although increasing annotated data could potentially improve performance, this is not guaranteed. Many popular pairwise preference datasets are similar in scale to ours, such as the OpenHermes2.5-DPO dataset (https://huggingface.co/datasets/argilla/OpenHermes2.5-dpo-binarized-alpha), the multi-turn preference dataset Capybara Preferences (https://huggingface.co/datasets/argilla/Capybara-Preferences-Filtered), and the code preference dataset CodeUltraFeedback (https://huggingface.co/datasets/coseal/CodeUltraFeedback_binarized). Regarding the Ultrafeedback dataset (original version), it appears that DPO does not significantly improve performance over SFT, although DPO still achieves a majority winrate against the SFT model. However, for the ORCA dataset, we observe a performance increase with the DPO model, as shown in Figure 2.

Choices of DPO model.

We selected the model that had converged on the winrate later in training, as it had been exposed to more preference data.

RewardBench/AlpacaEval Benchmarks

We evaluated our results on the AlpacalEval 2.0 dataset as a benchmark. In our experiments, we focused on the direct winrates of DPO and RDPO against their own responses to assess performance, rather than comparing against a third-party model response (GPT-4 in the case of AlpacalEval). This approach allowed us to directly observe the impact of adding rationales to the training, providing a clearer comparison with models trained without them.

Poor quality rationales

When RDPO is trained with low-quality rationales, we observe a degradation in performance, with standard DPO (without rationales) performing better. To improve the quality of the rationales, we recommend using stronger models for generation (in our case, Llama-3-8B-Instruct is sufficient, though more powerful models are emerging with ongoing developments). Additionally, after generating the rationales, we suggest verifying their correctness by leveraging a mixture of verifiers for robust assessment. For high-stakes domains, we envision having experts supervise the rationale generation process to ensure correctness and maximize the informativeness of the rationales. Additionally, we can leverage multiple models to improve the quality of the rationale, which can boost the valuable cues for the model to learn.

RDPO performance in Figure 5

In Figure 5, we examine the effect of the source model on the generation of rationales and their impact on performance. We observe that RDPO still achieves a higher winrate over DPO when using Llama-3-8B-Instruct. However, when the rationales are generated using the weaker source model, Phi3-Mini-4K, the winrate gap narrows. This suggests that for the Llama-3-8B-Instruct model, the rationale should be generated by a stronger model than Phi3-Mini-4K, highlighting the importance of the source model. This finding aligns with our theoretical result, which indicates that more informative rationales lead to improved preference prediction.

评论

Dear Reviewer EsKw,

We want to thank You for Your helpful comments, which led to a number of interesting discussions. We have responded to each of Your concerns and questions. Hopefully, You will find that they adequately address Your concerns. Before the rebuttal phase is over, please let us know if You have any more questions or need any clarification. We would be happy to address them.

Best Wishes,
Paper8553 Authors

评论

Dear Authors,

Thank you for your rebuttal. Unfortunately, I retain core concerns regarding experimental rigor and missing baselines.

I disagree that because the methods mentioned in my review are designed for preference modeling, they do not constitute important experimental baselines. You state that “our primary goal is to enhance data quality for preference learning by augmenting the dataset, as an alternative to simply increasing annotated data”. These works share the same motivation; it is therefore important for the reader to know whether to invest in generating rationales for RDPO or in one of these alternative augmentation strategies.

Regarding experiments, I still do not consider a 52% win rate to be evidence for the superiority of RDPO when trained on Llama-8B. This is a similar-sized, arguably better model than the Mistral 7B model on which most of your results are based. Since RDPO does not provide performance gains across different models, it is challenging to be convinced of its added value. I recommend authors investigate the reasons for this lack of performance. I also find it very surprising that you obtain confidence intervals of zero for evaluation results as noisy as preference judgments -- I think it would be worth discussing your uncertainty estimation framework in more detail.

评论

We appreciate the reviewer’s emphasis on the importance of experimental baselines. However, it is crucial to clarify that our method and the referenced works ([1, 2, 3]) are not directly comparable as they address different problems in preference learning. The referenced methods focus on generating synthetic preference datasets. In contrast, our work tackles a distinct challenge: enhancing existing preference datasets with rationales to improve how models learn from human preferences. While both approaches contribute to preference learning, they serve different stages of the pipeline and target different use scenarios - data generation creates new annotated preference pairs, while our method enriches existing preference annotations with explanatory depth. This fundamental difference in objectives and application scenarios means direct comparisons between these approaches would not effectively evaluate our specific contribution.

Furthermore, including the referenced methods as baselines could create a misleading impression of direct competition, overlooking their complementary potential. For instance, preference data generation and rationale augmentation address separate challenges within the preference learning pipeline and can coexist synergistically.

Given these differences, we believe our evaluation strategy of focusing on how adding rationales can enhance existing preference learning frameworks better aligns with our core contribution. To address the reviewer’s concerns, we will revise the paper to further clarify the unique roles of data generation and rationale augmentation in preference learning. This discussion will also elaborate on the specific challenges associated with self-synthetic data generation and why our approach represents a distinct and complementary contribution to the field.

评论

We thank the reviewer for their thoughtful attention to the Llama-3.1-8B-Instruct experiment presented in Figure 5. While this case exhibits a marginal performance gap, with a win rate of 52–45% (normalized to 54–46%), we respectfully emphasize that our method consistently delivers significant improvements across multiple tested settings. Specifically:

  • Across datasets: Mistral-7B-Instruct-v0.2 trained with rationale-enhanced preference optimization surpasses its counterpart trained without rationales on both the Orca and Ultrafeedback datasets, achieving win rates exceeding 60%.
  • Across models: RDPO consistently outperforms DPO on Mistral-7B-Instruct-v0.1, Mistral-7B-Instruct-v0.2, and Zephyr-beta models, achieving win rates consistently above 55%.
  • Across methods: Both RDPO and RORPO demonstrate improvements over DPO and ORPO, respectively, as shown in evaluations on Mistral-7B-Instruct-v0.2 and Llama-3.1-8B-Instruct using the AlpacaEval 2.0 benchmark.

These findings reinforce RDPO’s value as a broadly applicable preference-learning framework that robustly improves existing approaches, such as DPO and ORPO, rather than being tailored to a specific model or dataset. Regarding the Llama-3.1-8B-Instruct experiment, the modest improvements can be attributed to two key factors. First, the inherent capability of Llama-3.1-8B-Instruct surpasses Mistral-7B-Instruct-v0.2, making substantial gains more challenging to achieve on this stronger baseline. Second, the general preference datasets like Orca and Ultrafeedback, which include pre-existing responses, may not be fully optimized for Llama-3.1-8B-Instruct. For example, prior works [1-5] generate new response pairs dynamically during training, producing synthetic datasets that differ substantially from our setting that improves pre-existing preference data in an offline manner. While their online generation strategy may explain their larger improvements, this observation reveals an exciting opportunity to extend our method: creating preference pairs and incorporating rationales in an online manner. We appreciate the reviewer's insight in highlighting this direction and have added it to our discussion of future work.

This broad pattern of improvement aligns with established practices in machine learning research, where techniques are evaluated based on their overall effectiveness across multiple scenarios rather than performance in any single setting. [6-9] We have expanded our discussion of the Llama-3.1-8B-Instruct experiment in the paper to provide deeper analysis. However, we respectfully suggest that this single case be viewed in the context of RDPO’s broader demonstrated effectiveness, which remains its key contribution to the field. We believe this comprehensive evaluation provides strong evidence for our method's contribution to preference learning.

[1] RLCD: Reinforcement Learning from Contrastive - Distillation for Language Model Alignment. Yang et al., 2024.
[2] West-of-N: Synthetic Preference Generation for Improved Reward Modeling. Pace et al., 2024.
[3] Self-Taught Evaluators. Wang et al., 2024.
[4] Meng, Yu, Mengzhou Xia, and Danqi Chen. "Simpo: Simple preference optimization with a reference-free reward.", 2024
[5] Wu, Yue, et al. "Self-play preference optimization for language model alignment.", 2024
[6] Ethayarajh, Kawin, et al. "Kto: Model alignment as prospect theoretic optimization.", 2024
[7] Park, Ryan, et al. "Disentangling length from quality in direct preference optimization.”, 2024
[8] Pal, Arka, et al. "Smaug: Fixing failure modes of preference optimisation with dpo-positive.", 2024
[9] Guo, Yiju, et al. "Controllable preference optimization: Toward controllable multi-objective alignment.", 2024

评论

Dear Reviewer EsKw,

Thank you for your insightful feedback, which sparked valuable discussions about our work. We have thoroughly addressed your questions. As the deadline approaches, please do not hesitate to reach out with further questions, as we would be happy to address them.

Warm Regards,
Paper8553 Authors

评论

Dear Reviewer EsKw,

As the discussion period wraps up today, we want to emphasize how much we value your feedback. Having addressed your questions, we now kindly ask for your insights. We sincerely appreciate you taking the time to share your thoughts.

Warm Regards, Paper8553 Authors

审稿意见
5

This paper investigates if incorporating rationales along with binary preference data can help improve alignment preference. To this end the authors propose rationale-DPO, an extension to the popular alignment method DPO. They compare the two algorithms with on different datasets (orca, ultrafeedback) and for different models (Mistral-7B-v0.1, Zephyr-7B-Beta etc.). The authors also propose a simplified information theoretic analysis to better understand rationale based preference modeling.

优点

This study is well motivated and adds to the literature of incorporating richer feedback (additional of rationales in this case) to do more efficient RLHF alignment.

缺点

Weakness:

  1. While the problem is well motivated, the methodology to maximize the likelihood of generating the given preference and the rational is a very intuitive and simple method and is not significantly novel.
  2. Difficulty collecting data: procuring rationales can be more expensive as compared to getting just binary feedback. In addition, for certain cases like when you are comparing artwork it might not be possible to humans to explain their choice. While using LLMs to generate rationales is an efficient way of scaling the method, there is a risk of getting a misaligned response if that model is misaligned (for ex. not harmless) and it may also lead to an echo chamber as no new perspective beyond what the rationale generating LLM believes is true will be in the dataset. How do you envision addressing these challenges?
  3. In Figure 2, it seems that DPO win rate is only lagging behind RDPO by ~5-8% for the same amount of data points, however, RDPO requires a lot more text for a single data point.

问题

  1. Instead of plotting data points w.r.t performance metrics, it will be worthwhile to plot the total number of text tokens used for training w.r.t the performance metrics. For example, if the rationale itself is quite longer than the original texts for comparison it can contain a lot more information which might explain the improvement in performance. Additionally, it is also worthwhile to report the average training time for the both procedures.

  2. For the vs DPO and vs SFT section, can you please provide the exact formula you used to compute the win rates? Are there any tie-breaking rules?

评论

The method is very intuitive and simple.

We present a novel, data-centric approach to advancing the field of preference learning. Our method’s simplicity enables seamless integration with existing techniques, such as DPO, ORPO, and potentially SimPO or SPPO, as discussed in our paper. By incorporating rationales, we demonstrate how models can more effectively and efficiently learn from preference datasets by explicitly understanding the reasoning behind human choices. This rationale-driven approach shifts the focus from algorithm-centric improvements to uncovering the underlying logic of preferences. Our findings reveal that providing sufficient explanations significantly enhances model performance, offering a fresh and previously underexplored perspective on preference learning.

Difficulty of collecting data

We appreciate the reviewer's perspective on data collection. As illustrated in Figure 2, simply increasing the number of preference pairs does not always lead to better performance. Our method prioritizes extracting deeper insights from existing preference data. In specialized domains, generating additional valid preference pairs can be particularly challenging due to limited domain expertise. Our approach seeks to maximize learning from expert-curated data while addressing these limitations. Additionally, obtaining high-quality human preference data is often expensive. By leveraging rationales, our method enhances the utility of existing data without requiring further collection efforts. We believe rationales help guide the model to adopt correct reasoning patterns, avoiding pitfalls like reliance on superficial features (e.g., response length), which are common in traditional preference learning approaches [1,2]. Although generating rationales introduces additional LLM inference costs, the expense for a 500K dataset can remain within $100. This cost is offset by the significant gains in learning efficiency, making it especially valuable when high-quality preference data is scarce or costly. Moreover, in cases where human annotators themselves struggle to determine a preferred answer due to unclear reasoning, it is inherently challenging for a model to discern the rationale behind a choice. These scenarios demonstrate the inherent difficulty of the task, regardless of the approach used. Finally, when addressing model misalignment, we recommend employing multiple diverse models to mitigate biases and outliers in the generated rationales. This strategy can improve the quality of the rationale signals provided to the model, further enhancing its learning outcomes.

[1] Azar, Mohammad Gheshlaghi, et al. "A general theoretical paradigm to understand learning from human preferences." International Conference on Artificial Intelligence and Statistics. PMLR, 2024.
[2] Park, Ryan, et al. "Disentangling length from quality in direct preference optimization." arXiv preprint arXiv:2403.19159 (2024).

Figure 2 and 3 explanation

While Figure 2 shows that RDPO outperforms DPO by approximately 5-8 percentage points, our approach also achieves the same win rate against SFT as DPO, but with significantly fewer pairwise data points. By enhancing the data with rationales, we reduce the annotation effort by a factor of three. Additionally, Figure 2 highlights the win rate against the SFT model, serving as an intermediate comparison between DPO and RDPO. In the direct comparison, depicted in Figure 3, RDPO achieves a win rate of 60-65% against DPO across various training data sizes for both the Orca and Ultrafeedback datasets.

评论

Runtime and average length

We appreciate the reviewer's suggestion. We report the runtime for RDPO and DPO for one epoch on Llama-3.1-8B-Instruct, using 12,000 Orca examples, as follows:

  • RDPO General: 6770 seconds
  • RDPO Specific: 6950 seconds
  • DPO: 3583 seconds

While processing additional tokens nearly doubles the runtime, RDPO compensates for this by requiring fewer annotations while achieving comparable or superior performance to DPO. Additionally, we report the average response lengths for the Orca dataset:

  • Chosen responses: 786
  • Rejected responses: 981
  • Rationale responses: 411

Winrate computation

To compute the winrates against DPO and SFT, we generated responses from each model for 512 fixed test samples from a given dataset. For each comparison (RDPO vs. DPO and RDPO vs. SFT), we used an LLM judge to select the better response. If the judge could not determine a preferred response, the comparison was marked as a draw. The win rate is calculated as the number of RDPO responses preferred, divided by the total of 512 samples.

评论

Dear Reviewer 5bwS,

We want to thank you for your helpful comments, which led to valuable discussion about our work. We have carefully addressed your concerns and welcome any additional feedback. Please let us know if you have any further questions, we will be happy to address them.

Kind Regards,
Paper8553 Authors

评论

Dear Reviewer 5bwS,

We would like to thank You for Your comments and suggestions, we tried our best to address every question raised. We hope that our answers could resolve Your concerns. We are happy to address additional suggestions. Since the rebuttal period is closing soon, we would love to be able to respond any further questions.

Best Wishes,
Paper8553 Authors

评论

Dear Reviewer 5bwS,

Thank you for your thoughtful feedback, which contributed to meaningful discussions about our work. We have thoroughly addressed your questions and updated our work to reflect the changes. With the deadline approaching, we welcome any additional input and are more than happy to address any further questions.

Best Regards,
Paper8553 Authors

评论

Dear Reviewer 5bwS,

As the discussion period wraps up today, we want to emphasize how much we value your feedback. Having addressed your questions, we now kindly ask for your insights. We sincerely appreciate you taking the time to share your thoughts.

Warm Regards,
Paper8553 Authors

审稿意见
5

The paper introduces a new method for incorporating machine-generated rationales into preference fine-tuning, enhancing language models’ performance without extra human annotation. The authors demonstrate that maximizing rationale likelihood alongside preference loss improves model efficacy.

优点

  1. The paper proposes a new approach to integrate model-generated rationales in preference tuning, avoiding the need for additional human labels.
  2. Experimental results show that optimizing rationale likelihood alongside preference loss boosts model performance, reducing annotation needs and training data volume.

缺点

  1. The proposed method can be seen as a combination of the preference loss such as DPO and the rationale log-likelihood. The paper lacks further exploration of how the two components contribute to improved performance. A few questions are:

    • a. In the ablation study on γ\gamma, it seems the scale of gamma (from 1.0 to 10.0) does not matter at all. Did the authors try smaller γ\gamma or extremely large γ\gamma?
    • b. How does tuning solely on rationale likelihood without DPO loss affect performance? Will the performance increase?
    • c. Justification is needed for a variable γ\gamma given the theoretical suggestion of γ=1\gamma=1.
  2. Experimentation lacks rigor and thoroughness:

    • a. Reporting win-rate against DPO alone does not fully capture the rationale’s benefit. It is hard to evaluate the absolute improvement brought by the rationale loss. It would be better to report win-rate against a fixed opponent such as GPT-4 on AlpacaEval 2.0. This can ensure that the baseline DPO model is properly trained to a satisfactory performance.
    • b. Another related question is that there is no evidence that the DPO model in this paper is fully optimized. One may question if the dataset is weak or if the hyperparameters are adequately explored. For example, Llama3-8b-instruct + Ultrafeedback annotated by PairRM (see SimPO’s GitHub page for their dataset and model performance) can achieve a 40% LC win-rate, and the LC win-rate reported in the appendix is below 25%. I understand that SimPO did not release their training configuration, but the point here is that one cannot effectively conclude that the rationale loss significantly improves the performance.
    • c. The length bias is a key issue in preference fine-tuning. In the main text, it is reported that RDPO can produce much shorter responses and maintain a higher win-rate against DPO. This is quite surprising and deserves more analysis or explanation from the authors. On the other hand, in section B.4, the length on the AlpacaEval 2.0 dataset remains close to DPO or the original model.

问题

See Weaknesses.

评论

Dear Reviewer jgVq,

We want to thank you for your helpful feedback, which led to valuable discussion about our work. We have carefully addressed your questions and welcome any additional feedback. Please let us know if you have any further questions, we will be happy to address them.

Kind Regards,
Paper8553 Authors

评论

Winrate against a fixed opponent

We agree with the reviewer that using a fixed comparison would be easier to interpret than a relative one. To address this, we have used a fixed SFT model for comparison [Figure 2]. However, while RDPO and DPO achieving the same win rate against the fixed model may suggest similar performance, it does not necessarily imply that RDPO and DPO are equivalent. Therefore, we also conducted direct comparisons between the two models to obtain their head-to-head win rate [Figure 3]. Additionally, for some experiments, we evaluated the models using AlpacaEval 2.0 [Figures 9 and 10].

DPO is not fully optimized

We completely agree with the reviewer that DPO could potentially be further optimized with better hyperparameters. In our work, we used the hyperparameter settings commonly adopted for DPO and ORPO. Using the available dataset, we were able to achieve an improved LC Winrate on the AlpacaEval 2.0 for both models and datasets. The higher win rates observed in other models are likely due to their use of different datasets and preference learning objectives. For instance, both SimPO and SPPO use the Ultrafeedback dataset but generate responses based on scores from an external reward model. Therefore, we can only assess the benefit of rationales within the context of the given dataset and learning objectives we used, where we observed positive results. Additionally, the Ultrafeedback dataset, created by SimPO with assistance from PairRM, is highly optimized to score well on the AlpacaEval benchmark. However, a closer inspection reveals that some response pairs do not have clear preferences, such as:

A: "The Carr Fire is burning in Shasta and Trinity Counties in California, United States. It was reported at the intersection of Highway 299 and Carr Powerhouse Road in the Whiskeytown district of the Whiskeytown--Shasta--Trinity National Recreation Area."
B: "The Carr Fire is burning in Shasta and Trinity Counties in California, United States, specifically at the intersection of Highway 299 and Carr Powerhouse Road in the Whiskeytown district of the Whiskeytown--Shasta--Trinity National Recreation Area."

Or:

A: "According to the passage, Quinn would be described as a 'very good friend'."
B: "According to the paragraph, Quinn would be described as a 'very good friend'. This is a direct reference to the phrase 'because he was a very good friend'."

These examples suggest that the method may rely on a reward model tailored to excel on a specific benchmark, rather than providing the model with a true understanding of human preferences.

Dataset Links:
https://huggingface.co/datasets/princeton-nlp/llama3-ultrafeedback
https://huggingface.co/datasets/princeton-nlp/mistral-instruct-ultrafeedback

Produced Text Length

In our experiments, we observe that RDPO generates shorter responses compared to DPO for the test sets of their respective datasets. In the AlpacaEval 2.0 benchmark, both RDPO and RORPO produce shorter responses on average than the original model, and in most cases, shorter than DPO and ORPO. However, as the reviewer noted, the difference is minimal in some instances. This could be attributed to the style of questions in the AlpacaEval 2.0 benchmark, which often require longer responses to address open-ended queries. In contrast, the datasets evaluated in the main paper include a mix of both closed-ended and open-ended questions.

评论

I thank the authors for their detailed response. My concerns about the experiments have been addressed.

However, I believe more effort should be devoted to examining the function of the two components: the pairwise alignment loss (e.g., DPO, IPO, etc.) and the rationale SFT loss. The core contribution of the paper lies in proposing the rationale SFT loss and demonstrating its benefit to alignment when used in combination with the pairwise alignment loss.

It is therefore important to evaluate how much the rationale SFT loss contributes on its own. As acknowledged by the authors and noted by other reviewers, the rationale SFT loss does not appear to be novel, at least in isolation. Nevertheless, the combination of these components is reasonable and valuable. For this paper to establish its position in the literature, a deeper examination of the interaction between the DPO loss and the rationale loss is necessary.

评论

We sincerely appreciate the reviewer’s insightful feedback and fully agree on the importance of analyzing the contributions of the rationale SFT loss and the pairwise alignment loss (e.g., DPO) independently, as well as exploring their synergistic effects. To address this, we conducted a series of experiments to isolate the impact of each component and evaluate their combined effect. Specifically, we investigated an extreme case where the rationale loss alone drives preference optimization, with the DPO alignment loss set to zero. This approach was based on the hypothesis that rationales inherently encode preferences by combining preference-response pairs, the preferences themselves, and the associated reasoning processes, thereby providing a rich and effective training signal.

RDPO (Preference + Rationale)DPO (Preference-Only)Rationale-Only
General64.559.161.8
Detailed64.459.161.3

For these experiments, we fine-tuned Mistral-7B-Instruct-v0.2 on the Orca dataset across three settings: RDPO (combining DPO and rationale loss), DPO (excluding rationale loss), and Rationale-Only (excluding DPO loss). The results, as shown in the table above, reveal that rationales alone can substantially improve model performance, achieving a high win rate of over 61% without explicit pairwise preference modeling. This improvement likely stems from the informational richness embedded in rationales, which compensates for the absence of pairwise alignment. While DPO also demonstrated a majority win rate against the SFT baseline (above 59%), training with both rationale and preference losses (RDPO) consistently achieved the highest win rate (64.5%) across both general and detailed settings. This highlights the benefit of integrating rationales into the preference objective, effectively leveraging the strengths of both losses to produce superior performance.

To further investigate how rationales enhance DPO preference learning, we examined the reward margin metrics. As shown in the table below, RDPO not only achieved higher reward margins between chosen and rejected responses but also demonstrated faster convergence compared to DPO. This can be explained through the following: while DPO explicitly aims to maximize reward margins, the inclusion of rationales provides an implicit quality signal, offering explanations for the differences between chosen and rejected responses. This signal reinforces the model's ability to improve reward margins by guiding it toward more informed preferences.

Training Points01000200030004000500060007000800090001000011000
DPO0.000.050.190.320.420.490.540.580.620.630.650.66
RDPO0.000.10.250.460.670.760.830.850.860.870.890.91

These findings underscore the complementary nature of the rationale SFT loss and the pairwise alignment loss. While DPO explicitly optimizes reward margins, the rationale prediction loss provides supplementary supervision, enabling the model to learn the reasoning underlying response preferences. This integration not only strengthens the selection process but also accelerates training convergence. By combining these two approaches, RDPO amplifies their individual strengths, resulting in more efficient and effective preference learning.

评论

Exploration of components

While we agree with the reviewer that conducting more in-depth experiments to evaluate the interaction between the preference training objective and the rationale objective would be valuable, we have undertaken several analyses to explore this relationship. Specifically, we have studied the impact of incorporating rationales into the original preference learning objectives. Our investigations include assessing performance improvements by varying dataset sizes, evaluating rationales of different quality, conducting parameter ablation studies, and integrating rationales into various preference learning methods (e.g., DPO, ORPO) across different models and datasets. Notably, we observed that adding rationales designed to promote specific properties (e.g., conciseness) consistently imparts those traits across models and datasets—a behavior not evident in the original DPO approach. This suggests an intriguing avenue for future research: exploring whether rationales can transfer certain meta-properties to models.

Scale of gamma

As gamma decreases towards 0, we observe performance degrading to that of vanilla DPO/ORPO. Starting around a value of 0.2, we see the benefit of incorporating rationales into the preference dataset, which enhances the performance of the rationale-enhanced model compared to the vanilla preference-trained model. Performance then stabilizes as gamma approaches 10, with a slight decline in performance occurring as gamma increases beyond 10, up to 100. However, it is important to note that for different preference objectives, the optimal value of gamma may vary, highlighting the significance of gamma in achieving the best performance.

Rationale without DPO

We appreciate the reviewer’s suggestion. While our framework was introduced to complement existing preference learning objectives for more efficient data learning, we have not yet explored the use of rationales independently, without the preference objective. This suggestion is akin to the concept of Self-Taught Evaluators, where the preference learning objective is removed and models are supervised fine-tuned. We value this input and see it as a promising direction for future work.

Wang, Tianlu, et al. "Self-taught evaluators." arXiv preprint arXiv:2408.02666 (2024).

Gamma in theory and in practice

The gamma introduced in the theoretical formulation (beta=0.5+gamma\\beta = 0.5 + \\gamma) represents the informativeness or quality of the rationale, while the gamma used in our experiments refers to the influence of the rationale loss on the preference training, which is independent of the rationale's quality. We apologize for any confusion caused by the notation.

评论

Dear Reviewer jgVq,

We want to thank You for Your helpful comments, which led to a number of important extensions of our work. We have addressed each of Your questions. Please let us know if You have any more questions, we would be happy to address them within our allowed period.

Best Wishes,
Paper8553 Authors

评论

Dear Reviewer jgVq,

Thank you for your feedback! We have addressed your concerns and would greatly appreciate any additional feedback you may have. If there are further suggestions to improve our work, we would be happy to address them.

With Appreciation,
Paper8553 Authors

评论

Dear Reviewer jgVq,

As the discussion period wraps up today, we want to emphasize how much we value your feedback. Having addressed your questions, we now kindly ask for your insights. We sincerely appreciate you taking the time to share your thoughts.

Warm Regards, Paper8553 Authors

审稿意见
8

This paper presents a data-centric approach to RLHF by enriching preference datasets with machine-generated rationales. These rationales offer explanations for choices between preferred and non-preferred responses, addressing ambiguity and enhancing the effectiveness of preference learning. The proposed framework integrates rationales into the training process, can save annotation costs by 3x, and lands the fine-tuned model at better performance. Extensive experiments demonstrate that rationale-enriched learning outperforms traditional methods, with benefits across various preference optimization algorithms.

This work underscores the potential of rationale-based data augmentation in preference learning, paving ways for more effective language model alignment and encouraging further exploration of unpaired preference learning scenarios.

优点

  1. This paper is well written. The notations are clear.

  2. It provides up-to-date literature on RLHF techniques. It underscores the potential of rationale-based data augmentation in preference learning, paving ways for more effective language model alignment and encouraging further exploration of unpaired preference learning scenarios.

  3. Among many lines of work addressing the economic utility of dataset design and construction in RLHF, mechanism design has been recently explored to enhance the overall economic utility of dataset construction in RLHF. The merits of introducing mechanism design are well supported by game theory studies, both theoretically and practically:

Zhang, G., & Duan, J. (2024). VickreyFeedback: Cost-efficient Data Construction for Reinforcement Learning from Human Feedback. https://arxiv.org/abs/2409.18417

Matsushima, H., Noda, S.: Mechanism design with general ex-ante investments. Journal of Mathematical Economics 106, 102831 (2023)

  1. The experiments on Orca and UltraFeedback are convincing, with rational theoretical analysis using mutual information as a tool and in-depth ablation discussion in the appendix B.2.

缺点

This paper underlines the impact of including rationale in the RLHF fine-tuning process. In other words, the proposed method generally leverages auxiliary data to enhance the model performance.

However, generating qualitative rationales alongside existing datasets might increase the annotation cost in dollar terms. Therefore, the breakeven analysis and the operating guidance could have been more straightforward to project owners with a limited annotation budget in dollar terms.

问题

It would be great if the total cost (rationale annotation cost vs. fine-tuning performance) breakeven could be revealed in dollar terms, and the operating guidance could be discussed for project owners with a limited annotation budget.

One way could be to provide a detailed cost-benefit analysis, including estimated costs for generating rationales (e.g., API costs if using a language model) versus the potential savings from reduced annotation needs. This would give project owners more concrete information to assess the method's practicality within their budget constraints.

评论

Dear Reviewer ZabL,

We want to thank you for your positive assessment. We appreciate the economic perspective on the utility of the dataset which aligns with our motivation to improve the effectiveness of data. We appreciate your suggestion on guiding the project owners with limited annotation budget to show the practicality of the method. Please let us know if you have further questions, we will be happy to address them.

Kind Regards,
Paper8553 Authors

评论

As highlighted by the reviewer, our goal is to enhance the utility of the dataset by improving the information efficiency of the annotated pairs. By providing rationales alongside the annotations, we aim to not only enable more efficient learning from preferences—thus reducing the need for excessive annotation—but also to potentially improve the interpretability and understanding of these preferences.

We appreciate the reviewer for the suggestion. To assist project owners in evaluating the trade-offs, we present a cost-benefit analysis of the approach. The table outlines the cost of using the API to generate rationales for a given number of annotations. It also highlights the RDPO win rate compared to the SFT model for each data budget and estimates the number of annotations that could potentially be saved from using DPO to achieve the same level of performance as RDPO.

API Rationale Cost\$0.13\$0.19\$0.26\$0.32\$0.39
Annotations Used1K1.5K2K2.5K3K
Annotations Saved3K6K6.5K6.8K>10K
vs SFT Winrate54%56%58%60%62%

While we used open-weight models to generate the rationales in our study, the table illustrates the associated costs when utilizing an API model, specifically gpt-4o-mini. This model is referenced with pricing of \0.150per1Minputtokensand0.150 per 1M input tokens and \\0.600 per 1M output tokens. We show the results on the Mistral-7B-v0.2-Instruct trained on the Orca dataset.

评论

Dear Reviewer ZabL,

We want to once again thank you for your helpful comments. We have addressed your questions. Please let us know if you have any more questions, we would be happy to address them before the rebuttal period ends.

Kind Regards, Paper8553 Authors

评论

I thank the authors for their responses. Most of my concerns are addressed in meaningful details. Therefore, I remain my comment on this paper.

评论

We would like to thank you for your positive assessment!

Kind Regards,
Paper8553 Authors

评论

We would like to thank the reviewers for their valuable feedback, which significantly helped us improve our work and better position it.

  • In response, we have incorporated related work on synthetic preference data generation to clarify the positioning of our contributions and highlight the specific problems addressed by our study (a concern raised by Reviewer EsKw).
  • We added an explanation regarding the performance of LLaMA 3.1 in Table 4 (Figure 5 in the original version), discussing potential reasons for its lower performance increase (a concern raised by Reviewer EsKw).
  • To enhance the understanding of rationale contributions in preference learning objective, we have studied each component in isolation and how they contribute to preference learning (a concern raised by Reviewer jgVq).
  • Additionally, we included a detailed cost analysis, covering both runtime and token generation costs (a concern raised by Reviewer ZabL and Reviewer 5bwS).
  • Finally, we corrected typos throughout the manuscript.

We appreciate your comments and we welcome further discussion until the discussion period is closing.

AC 元评审

Thank you for your submission to ICLR. This paper aims to equip preference learning methods with machine-generated rationales, which explain the reason behind preference choices. The authors show that these rationales can increase learning efficiency.

The reviewers agree that the problem is well motivated, and that it is valuable to develop frameworks such as this for incorporating rationales into preference tuning procedures. However, there were still a number of concerns from the reviewers about this paper. In particular, there were concerns about the novelty of the presented methodology, experimental rigor/precision, and both the choice of and performance against baseline methods. In the end, a majority of the reviewers remained unconvinced about their concerns.

审稿人讨论附加意见

During the response period, there was some healthy discussion between the authors and most of the reviewers. In their rebuttal, the authors responded back to questions and comments given by reviewers. However, the majority or the reviewers remained unconvinced and did not convert to a positive score.

最终决定

Reject