PaperHub
5.5
/10
Poster4 位审稿人
最低3最高3标准差0.0
3
3
3
3
ICML 2025

Reward-Augmented Data Enhances Direct Preference Alignment of LLMs

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24

摘要

关键词
AlignmentRLHFLLM

评审与讨论

审稿意见
3

This paper addresses a common limitation in preference-based alignment methods, where only relative preferences are considered, while qualitative aspects of responses are overlooked. It introduces reward-conditioned LLM policies that are trained to generate responses conditioned on rewards. Leveraging a simple relabeling strategy, it constructs preference pairs based on the quality scores to train the reward-conditioned LLMs. Experimental results show that the approach consistently improves the performance of DPO across diverse models and benchmarks.

给作者的问题

Q. After the proposed reward-conditioned training, how do the models perform on the flip task, i.e., evaluating the score of a given response similar to a type of a generative verifier?

论据与证据

Regarding the limitations of vanilla direct alignment, the paper highlights two key issues: (a) high-quality rejected responses may be unlearned, and (b) low-quality chosen responses may be reinforced. While these are indeed potential issues, the claims are made more at a conceptual level, based on possible language policies that can be learned. However, the extent to which these issues arise depends on the optimization and the composition of the preference dataset. For instance, assuming overfitting is avoided, the likelihood of a high-quality rejected response can remain unchanged while the likelihood of the chosen response increases to drive down the loss. Similarly, both low-quality chosen and rejected responses can have their likelihoods reduced, with the rejected response being penalized more to achieve a lower loss. While precisely describing all possible optimization outcomes is challenging, this part of the paper could be made more rigorous to strengthen the claims.

The claim that reward-conditioned policies learn from the full spectrum of responses is sound, as they are explicitly conditioned on the target score of the response to be generated.

方法与评估标准

Given quality scores of responses, the paper proposes a simple labeling strategy to implement reward-conditioned alignment. Specifically, given (x,yw,rw,yl,rl)(x, y_w, r_w, y_l, r_l), it proposes to create two new preference pairs with each of the two responses as the target response to generate for the corresponding score. This is a simple, intuitive approach to utilizing quality scores of individual responses that allows direct application of exiting alignment methods such as DPO.

For evaluation, the paper considers one preference dataset (UltraFeedback) and five different language models to assess the proposed method against vanilla DPO. While the method does not always outperform DPO, in cases where it does, the performance margin is sometimes large depending on the base model. The evaluation could have been strengthened by applying the reward-conditioned alignment to alternative preference-based alignment methods such as IPO, etc.

理论论述

Theorem 4.1 shows that under mild conditions, the proposed reward-augmented DPO converges to the optimal policy.

实验设计与分析

The paper adopts UltraFeedback for preference-based alignment and assesses five open language models across six academic benchmarks. Both the number of language models and benchmarks seem sufficient for a thorough evaluation. However, the experiments could have been strengthened if preference-based alignment methods other than DPO, which have been introduced to address different limitations of DPO, are also evaluated.

补充材料

No supplementary material has been reviewed.

与现有文献的关系

The paper proposes a simple data relabeling method that, in cases where individual quality scores are available, allows for better use of those scores to train models in a way that mitigates the limitations of vanilla DPO, such as unlearning high-quality responses and reinforcing low-quality responses. This is related to several prior works that attempt to address similar problems, such as IPO (reduces overfitting), conservative DPO (uses label smoothing), MMPO (considers relative quality differences), etc. While the paper demonstrates performance gains over vanilla DPO, additional comparisons with other closely related alignment methods could have further strengthened the study.

遗漏的重要参考文献

Several prior works, such as [1] and [2], also study closely related problems, including incorporating qualitative aspects of responses into alignment and mitigating overfitting in DPO. These methods should ideally be evaluated alongside vanilla DPO or, at the very least, discussed.


[1] Kim et al., Margin Matching Preference Optimization: Enhanced Model Alignment with Granular Feedback.
[2] Park et al., Disentangling Length from Quality in Direct Preference Optimization.

其他优缺点

While the proposed method is a simple approach to utilizing quality scores of individual responses, it is unclear whether it is the most effective. For example, given individual quality scores, responses from different original preference pairs could be combined to create a significantly larger set of preference pairs. I am curious if the authors have considered or evaluated alternative approaches to using quality scores.

其他意见或建议

None.

作者回复

Your valuable comments have greatly helped us improve our manuscript. Below are our specific responses to the raised questions:

Weakness 1: Analysis of possible optimization outcomes.

  • In Section 3, we analyzed categorical LLM policies, i.e., tabular stochastic policies without function approximations. In this setting, for any prompt xx with the chosen and rejected responses ywy_w and yly_l, the optimal policy after RLHF must satisfy π(ywx)=1\pi^*(y_w|x)=1 and π(ylx)=0\pi^*(y_l|x)=0 in order to maximize the expected reward, i.e., maxπEyπ(x)[r(x,y)]\max_\pi E_{y\sim\pi(\cdot|x)}[r(x, y)], since r(x,yw)>r(x,yl)r(x, y_w) > r(x, y_l). When only slightly more annotators prefer ywy_w over yly_l, i.e., r(x,yw)r(x, y_w) is slightly larger than r(x,yl)r(x, y_l), π(ylx)=0\pi^*(y_l|x)=0 will cause the LLMs unnecessarily unlearn the high-quality yly_l. Similar limitations are discussed in more detail in Section 3.1.
  • For function-approximated LLM policies, the analysis becomes significantly more complex. To demonstrate that these issues persist in practice, we provided empirical evidence in Figure 3 and will incorporate the newly conducted ablation on the O.O.D. dataset HelpSteer 2:
8910
Qwen2-7B-It-416.7-356.5-334.8
+DPO (UF)-484.5-419.4-401.7
+DPO (RA)-438.6-366.4-341.1

Weakness 2: Missing related works including [1] and [2].

  • The comparison with R-DPO [2] was provided in Table 12. We thank the reviewer for pointing out the missing related work [1], which we empirically compared in the following table. We report the MT-Bench scores of performing our method on the Llama3-SFT checkpoint:
OursMMPODPO
MT-Bench7.667.587.41
  • We will add the above results to experiments and incorporate the following paragraph to related work:
    "Similar to our work, [1] also investigates how preference optimization can overlook qualitative aspects of responses. However, their focus is on overfitting to preference data, and they propose incorporating quality margins into the optimization objective. In contrast, our approach does not involve algorithmic modifications, but rather directly targets the limitations identified in Section 3. Our work also differs from [2], which introduces a constraint-based regularization term specifically aimed at mitigating verbosity bias."

Weakness 3: Lack of empirical comparisons with DPO variants such as IPO and alternative approaches that use quality scores.

In addition to DPO, we compared our method against 15 SOTA baselines. These include approaches that enhance DPO from various perspectives, such as IPO, as well as methods that incorporate quality scores during fine-tuning, including SteerLM, DPA, and MMPO. The results are presented in Figure 4 and Tables 10 and 12. For your convenience, we summarize the results below and will move Tables 10 and 12 to the main body of the manuscript.

Zephyr-SFTDPODPASteerLMNCA-PNCA-RINCA-PINCA-ROurs
LC Win Rate6.2111.6011.13-11.5012.8713.6814.8316.66
Win Rate3.948.5810.588.218.439.5611.0011.3413.37
Llama-3-8B-ItSLiC-HFORPOCPORRHFKTOIPORPOR-DPOSimPOOurs
LC WR22.9226.928.528.931.333.135.640.841.144.748.2
WR23.1527.527.432.228.431.835.641.737.840.553.2

If the reviewer has other baseline methods in mind, please let us know and we will be happy to include them as comparisons.

Question 1: How do the models perform on the flip task, i.e., evaluating the score of a given response similar to a type of generative verifier?

Our method, which is designed for direct preference optimization, is not directly applicable to training generative verifiers, which typically involve predicting reward tokens during training. That said, our approach is not incompatible with generative verifiers. For instance, at inference time—when the model is prompted to generate high-reward responses—the probability assigned to a given response can serve as an approximate measure of its quality. More generally, for LLM-as-a-Judge settings, reward-conditioned training can be applied to preference data that reflects the quality of judgments. By conditioning on the highest quality scores, the model exhibits its best judgment capabilities, rather than simply assigning high scores indiscriminately.


We hope the reviewer could consider raising the score if we resolved the reviewer's concerns. We would be happy to have further discussions if the reviewer has any additional questions or comments.

[1] Kim et al., ''Margin Matching Preference Optimization: Enhanced Model Alignment with Granular Feedback.''
[2] Park et al., ''Disentangling Length from Quality in Direct Preference Optimization.''

审稿意见
3

The paper presents a data augmentation approach for learning on pairwise preference data that doubles the amount of data by modifying the prompt to include a description of the quality (a reward score) of the preferred response and treating each response as the chosen response. DPO is then used to update the parameters of the LLM using the newly constructed preference pairs. The authors cast the data augmentation in the framework of goal-conditioning. To motivate the data augmentation method the authors examine several limitations with DPO. The paper evaluates training with their augmented data on several LLMs, but different LLMs are used for different experiments. The method is compared to a couple of ablations or modifications of the proposed augmentation approach. The only baselines are training without the augmented data and using SPPO. On average the proposed augmentation improves performance according to AlpacaEval2.0 win rates. However, for experiments with results for multiple LLMs the magnitude of the gains is not consistent. Additionally, the gains are small for MTBench. There is a slight, potentially not meaningful, average performance gain on NLP benchmark tasks.

给作者的问题

I do not have questions for the authors.

论据与证据

The authors make broad claims about the benefits of the data augmentation method. While the authors include a variety of LLMs in their experiments, very few experiments are conducted across LLMs. Therefore, it is difficult to understand how general the claims apply across LLMs and their various training and data regimes. For example, the authors claim that their Half RA configuration has comparable performance to RA on the full dataset. However, for the performance gaps vary by the LLM (and by task) with a difference of 6 and 11 for AlpacaEval LC WR and WR, which would not be considered comparable. Therefore, it is important to point the reader to results that are reported for all LLMs. A version of Figure 2, but across all LLMs is provided in Table 9 (Appendix B.2). However, this is not referenced anywhere in the main body of the paper. Such references must be made the analysis should reflect that the LLM used influences the performance gains. The analysis should also reflect that there are no meaningful performance gains on MTBench along with an explanation or hypothesis for why this is the case when AlpacaEval2.0 performance has much higher performance gains.

The authors claim that DPO is limited in its ability to model preference data because of overfitting and unlearning of high-quality responses, because they are the rejected response in a given pair. For example, in the second paragraph of the introduction. Many claims are made about how DPO behaves, but no evidence or citation is provided. Additionally, it is not mentioned that some of these issues are tested for in the paper.

方法与评估标准

The proposed method makes sense for the problem at hand and the evaluation benchmarks are standard for alignment focused tasks.

理论论述

I did not assess as they are all in the supplementary material.

实验设计与分析

  1. The authors motivate the structure of DPO and its offline nature as its key limiting factor. However, this is partially addressed by training with PPO+RM. The authors should include this as a baseline to compare against.
  2. The proposed approach is to modify the prompt to be more specific about the quality of the response. Therefore, SFT on the augmented prompt + response is an important baseline.
  3. The hyper-parameters used are listed in Appendix B.1. However, there is a lack of detail about how any parameters beyond the smooth parameter was set. The strategy for selecting the hyper-parameters MUST be detailed.
  4. DPO can be sensitive to the exact hyper-parameters and the best hyper-parameters can vary across LLMs. If the hyper-parameters are not optimized per LLM, the true differences in performance may not be accurately reflected.
  5. The experiments section should discuss the impact of LLM and compare against more baselines (these are mentioned above and in other sections). While additional experiments are included in the Appendix, they are not referenced in the main body and their conclusions are not discussed in the main body.
  6. The paper states that the reward augmentation method improves performance of any direct alignment method. This is done in the appendix, but is not referenced at all in the main body of the paper.
  7. The standard error bars for AlpacaEval2.0 win rate should be reported in Figure 2, Table 9, etc.
  8. A main claim of the paper is that the dataset augmentation helps with generalization. However, all non-benchmark experiments use the same dataset used to train the model. This means that the prompts and responses are reasonably within distribution of the training data, especially as the data comes from GPT. To fully support the generalization claims, the type of analysis reported in Figure 3 should additionally be reported on on at least one dataset from a different distribution, e.g. HH-RLHF or OpenAssistant.

补充材料

I did not read Appendix A.

As far as I can tell, Appendix B is not referenced in the main body of the text despite containing crucial information such as the prompts used during training and inferences, the hyper-parameters, the full results across all LLMs, comparisons to multiple baselines, as well as two additional experiments.

In the absence of the material in the appendix, this paper is incomplete, which is driving my current accept/reject recommendation. The authors MUST reference that this information is in the Appendix and include analysis of the complete results in the main body of the paper.

与现有文献的关系

The paper has interesting learnings and take aways for different strategies to steer LLM behavior.

There are contemporary papers that have explored similar methods, e.g. "Towards Aligning Language Models with Textual Feedback" (EMNLP 2024), and that attempt to solve similar issues with DPO, such as "Iterative Reasoning Preference Optimization" which uses a NLL-loss to help with unlearning. Any final version of the paper should discuss such contemporary literature and help readers to understand how they are distinct.

遗漏的重要参考文献

The authors position the paper relative to the literature by pointing to a difference in goals. However, a difference is goals is not a strong distinction as a method designed to address a different set of issues may additionally address the goals outlined in the paper. The authors do not go into detail how or why their method addresses is a better solution than the related work that is mentioned.

The method is similar to Decision Transformer ("Decision Transformer: Reinforcement Learning via Sequence Modeling") where generated behaviors are controlled by conditioning on the desired reward the generate behaviors should receive. The similarity and the relationship to Decision Transformer should be discussed.

If the paper is accepted, the camera ready version should discuss "Towards Aligning Language Models with Textual Feedback" (EMNLP 2024) as contemporary work.

其他优缺点

  1. The authors make many claims about the weaknesses of DPO and use these claims to motivate their data augmentation approach, however when making those claims, the authors do not present evidence to support them (neither a citation nor experimental result). It isn't until the experiments/results section that the authors begin to provide evidence for their motivating claims. The experiments should be pointed to earlier in the paper when the claims are first raised. Something as simple as "(see Section X)" would be sufficient.

其他意见或建议

All comments and suggests are included in previous sections.

作者回复

The only baselines are DPO and SPPO.

In addition to DPO and SPPO, we compared with 15 baselines in Figure 4 and Table 12.

Gains across LLMs are inconsistent. Marginal improvements on NLP benchmarks.

  • The effectiveness of our method is demonstrated on 5 LLMs. It consistently offers improvements, with most gains substantial. Since our method's hyperparameters were not extensively tuned for each model, variability in performance gains is expected. Notably, even SOTA alignment methods [1, 2] reported inconsistent improvements across different models.
  • Alignment tax [1, 3] can reduce common-sense QA performance. So we primarily evaluated on instruction-following benchmarks, where our method yields strong improvements while avoiding the alignment tax.

Claim of comparable performance between Half RA and RA.

  • While we initially considered RA and Half RA comparable, each outperforming the other on at least one benchmark, we agree that this claim is not essential. We will remove it as the comparison with DPO already supports this ablation: fine-tuning on reward-augmented data yields better performance with half of the prompts and the same compute.
  • We conducted main experiments across five models and selected one or two for ablation studies. Due to resource constraints, we were unable to run all 10 ablations on all 5 models.
  • We extended this ablation to Llama, in addition to Qwen and Gemma. The results are consistent with our original findings.
LCWRMTBArena
Llama-3.1-8B-It24.7927.388.4426.9
+DPO (UF)28.6730.218.4733.0
+DPO (RA)31.2035.938.4734.4
+DPO (Half RA)31.6634.378.5033.6

A version of Figure 2 across all LLMs is provided in Table 9 but not referenced.

Figure 2 and Table 9 present the same results, both across all LLMs.

The gains are small for MTBench.

  • Since we did not perform extensive hyperparameter tuning for each benchmark, it is expected that performance gains are modest on some benchmarks.
  • On MTBench, the average gains obtained by our method are ~1.55 more than DPO. In comparison, all four models fine-tuned with SimPO [1] fail to outperform DPO on MTBench.

The analysis in Figure 3 should also be reported on O.O.D data.

We conducted additional experiments on HelpSteer2 and had similar observations as in Fig. 3:

8910
Qwen2-7B-It-416.7-356.5-334.8
+DPO (UF)-484.5-419.4-401.7
+DPO (RA)-438.6-366.4-341.1

No evidence or citation about how DPO behaves is provided.

  • In introduction, we cited [4] as the first to identify the unlearning issue of DPO, and compared with it in related work and experiments.
  • We offered empirical evidence in Fig. 3. We will also include the new ablation on HelpSteer2 at L384 and add pointers to these results in Sec. 1 and 3.

Limitations are partially addressed by PPO.

Our primary motivation is to address the limitations of direct alignment, instead of PPO that exhibit different limitations not covered here (see Sec. 5). We will include a comparison with Llama-3-PPO, which scores 21.2721.27 on AlpacaEval.

SFT on the augmented prompt + response is an important baseline.

Please refer to Fig. 4, where we compared with SOTA conditional SFT baselines including DPA and SteerLM.

The strategy for selecting the hyperparameters must be detailed.

We will add the following paragraph:
"We tune β\beta within [0.001,0.01,0.1][0.001, 0.01, 0.1] and batch size within [64,128,256][64, 128, 256]. We find β=0.01\beta=0.01 and batch size 128128 yield the overall best performance for DPO across models. Our method uses the same hyperparameters as DPO."

Additional experiments in Appendix are not referenced.

We will add references to these ablation headers in the experiments and move most of them to the main body as space permits.

The error bars for AlpacaEval should be reported.

Reporting error bars requires training at least three times more models, which was not feasible given our resource constraints. For similar reasons, most prior alignment works [1, 2] also report results from single runs.

Missing related works.

We will add the following paragraph to L262:
"Pang et al. (2024) addressed DPO’s tendency to reduce the probability of the chosen response by incorporating an NLL loss. In contrast, our work focuses on a different limitation of DPO—its tendency to overlook qualitative aspects of responses—and proposes a data relabeling approach that requires no algorithm changes. It also differs from conditional sequence modeling based on SFT (Chen et al. 2021, Lloret et al. 2024). Due to the lack of textual feedback in UF, we empirically compare with the reward feedback variants of Lloret et al. (2024), including SteerLM and DPA."

[1] Meng et al. ''SimPO.''
[2] Wu et al. ''SPPO.''
[3] Askell et al. ''Language Assistant as Alignment Laboratory.''
[4] Adler et al. ''Nemotron-4 340B.''

审稿人评论

Sorry for the delay in this message. I posted it in the wrong spot.

Thank you for your responses. They have answered a number of my questions. I have an additional question.

Can you please elaborate on what is meant here, "Since we did not perform extensive hyperparameter tuning for each benchmark, it is expected that performance gains are modest on some benchmarks."? Which benchmarks were used to select the hyper-parameters?

Per this response, "Reporting error bars requires training at least three times more models, which was not feasible given our resource constraints. For similar reasons, most prior alignment works [1, 2] also report results from single runs." The AlpacaEval report includes a standard error score.

作者评论

We thank the reviewer for the time and feedback. We are glad that our responses addressed your previous concerns, and we now address your remaining questions as follows.

1. Can you please elaborate on what is meant here, "Since we did not perform extensive hyperparameter tuning for each benchmark, it is expected that performance gains are modest on some benchmarks."? Which benchmarks were used to select the hyper-parameters?

We tune the KL regularization coefficient within [0.001,0.01,0.1][0.001, 0.01, 0.1] and batch size within [64,128,256][64, 128, 256]. Among these, β=0.01\beta=0.01 and a batch size of 128128 achieve the overall best performance for DPO (hyperparameters are not specifically tuned for our method) averaged across the LC win rate, MT-Bench average score, and Arena-Hard-Auto score. Specifically, β=0.01\beta=0.01 consistently yields the best results across all models. While the optimal batch size varies slightly between models, 128128 performs best for most models, and for models with other optimal batch sizes, the gaps are minor. Our method adopts the same hyperparameter settings as DPO. We will add the above paragraph to our experimental setups.

2. The AlpacaEval report includes a standard error score.

  • The standard error scores reported in the AlpacaEval GitHub repository and paper reflect variability across all instructions in the dataset and across different verbosity prompts, respectively. They are designed to assess the robustness of the leaderboard setup (specifically, the instruction designs and LC score calculation), rather than the robustness of the algorithms or the resulting models that the reviewer concerned.
  • To further address the reviewer’s concern, we report the average scores and their corresponding variances across three independent training runs in the table below:
Qwen-DPOQwen-OursGemma-DPOGemma-Ours
LC WR21.39±0.3931.10±0.3350.49±0.3559.06±0.16
WR19.68±0.2828.22±0.3135.40±0.1954.48±0.12
MTB8.35±0.0018.46±0.0018.54±0.0018.58±0.001

These results are consistent with the scores presented in the paper, which supports the statistical significance of our results and demonstrates the robustness of the proposed method and DPO. Due to time constraints, we were unable to retrain the other models across multiple runs. However, we will include a complete table with error bars for all models in the next version of the manuscript.


We hope these responses have fully addressed your concerns. We would be happy to have further discussions if the reviewer has any additional questions or comments.

审稿意见
3

This paper studies the preference alignment problem in Large Language Models (LLMs) and proposes a Reward-Augmented Data Relabeling method to improve Direct Preference Optimization (DPO). Traditional preference learning focuses only on relative preferences while ignoring the absolute quality scores of responses. This leads to unnecessary unlearning of high-quality rejected responses, indiscriminate reinforcement of low-quality chosen responses, and poor generalization to optimal responses, which are sparse in the preference dataset.

To address these issues, the authors introduce reward-conditioned policies, leveraging reward scores from AI feedback to relabel preference data and construct a reward-augmented dataset. Experiments on various instruction-following and academic benchmarks demonstrate that training LLMs with DPO on this enhanced dataset consistently improves performance. Additionally, results confirm that this method effectively mitigates the unlearning problem of high-quality rejected responses, making preference optimization more robust and generalizable.

给作者的问题

  • (1) Could you provide a more complete explanation of the relationship between the theoretical model you proposed in Section 3 and the experiments in Section 4? From my perspective, the experiments in Section 4 modify the data within the prompt to construct a dataset with reward value information, aligning the training objective with the preset reward goals. However, what is the specific connection to the goal involving R(x, y, g) introduced in Section 3? Alternatively, could you explain how the results from the experiments in Section 4 support the validity of the theory in Section 3?

  • (2) For the IRA experiments, since the implicit reward values provided by the model are unnormalized, how did you handle them and use them in your experiments? It might be helpful if you could provide the prompt settings for the IRA experiments.

  • (3) If possible, could you provide the code for your experiments? This would help me better understand your work.

  • (4) The experimental results shown in Figure 3 are quite striking, but they only demonstrate that the forgetting of high-quality rejected responses is alleviated. It does not show whether the log probability for low-quality rejected responses also increases. Could you provide more comprehensive experimental results? This would help present more convincing findings in your work.

  • (5) Regarding your experiments, I believe they can be viewed as augmenting the dataset by swapping the accepted and rejected data pairs and incorporating robust learning with confidence-based parameters. Is my understanding correct? Could you elaborate on the connection and advantages of your work compared to traditional data-driven robust training methods?

论据与证据

The paper provides a detailed and convincing explanation of the issues in preference learning. In the experimental section, it conducts extensive testing across multiple models and explores various aspects, particularly dataset size impact (Half RA). The authors perform in-depth experiments on the UltraFeedback dataset, carefully controlling variables such as DPO hyperparameters and benchmark settings to ensure the reliability of results. Additionally, the paper validates the effectiveness of its method for learning from high-quality rejected responses from two perspectives: log probability of high-quality rejected responses and a test where low-quality rejected responses are filtered out before training (reported in the appendix). Overall, the paper presents a meaningful research problem, develops a model to address it, and conducts extensive experiments to verify its effectiveness.

However, the paper also has some limitations:In Section 3 (Method), the newly proposed training objective is not fully reflected in later experiments.Instead, in Section 4 (Experiments), the authors directly apply DPO on the re-labeled dataset, rather than using the training objective introduced in Section 3.As a result, the experiments only verify that re-labeling preference data with reward scores improves generalization performance (as the paper claims, allowing the model to “learn from the full spectrum of response quality”). However, the current setup does not directly validate the theoretical claims made in the paper.

If possible, please supplement the paper with additional experiments or provide more theoretical justification to strengthen the connection between the proposed training objective in Section 3 (Method) and the experimental setup in Section 4 (Experiments).

方法与评估标准

  • (1) The paper chooses well-established benchmarks (AlpacaEval 2.0, MT-Bench, TruthfulQA, GSM8K etc.), which are widely used to evaluate preference alignment in LLMs.
  • (2) The evaluation metrics (win rate, accuracy) align well with existing literature.
  • (3) Ablation studies effectively analyze different aspects of the method.

理论论述

In this paper, a convergence guarantee Theorem (Theorem 4.1) is provided, and it is proved that the method can converge to the optimal strategy under certain assumptions, and the error bound is O(N1/2)O(N^{-1/2}). The logic and derivation of the proof are correct, clear in form, and consistent with previous DPO-related work.

实验设计与分析

The paper evaluates multiple LLM architectures (Mistral-7B, Qwen2-7B, Llama-3-8B, Gemma-2-9B, SPPO).

It controls variables properly, comparing DPO with and without reward augmentation.

Ablation studies effectively isolate the impact of reward augmentation.

补充材料

I read all the parts of the appendix. I have read all sections of the appendix. Appendix A establishes the preference learning model under the goal reward conditions, presents the training algorithm, and proves its convergence. Appendix B provides detailed experimental settings and additional results from ablation experiments.

与现有文献的关系

The experiments in this paper are closely related to prior work on Direct Preference Optimization (DPO), particularly the method proposed by Rafailov et al., 2024. The approach in this paper is similar to Reinforcement Learning from AI Feedback (RLAIF), such as LLM-as-Judge (Zheng et al., 2024; Cui et al., 2023) and Reward-Model-as-Judge (Adler et al., 2024; Dong et al., 2024).

Compared to traditional DPO, where the model learns directly from preference data, the method in this paper reconstructs the learning objectives by conditioning on goal rewards rather than directly learning from preference pairs. This reward-conditioned approach aims to optimize the model’s ability to learn from the entire spectrum of response quality.

The key distinction from RLAIF is that while RLAIF primarily leverages AI feedback preferences, the method in this paper goes further by incorporating AI feedback scores. This allows the model to learn from the full spectrum of response quality, rather than only relying on binary preference information. In other words, the paper’s approach uses the rating scores from the AI judge to enable more nuanced learning across a wider range of response qualities.

遗漏的重要参考文献

All essential references has been carefully addressed.

其他优缺点

Strengths

  • (1) The background of the problem proposed in the paper is very worthy of investigation. The study of the full spectrum of response quality is often overlooked in preference learning, and the reward-conditioned preference learning method proposed in the paper is an effective approach to address this issue. It also holds promise for generalizing LLMs to high-quality output distributions.
  • (2) The paper includes detailed ablation experiments that demonstrate the role of high-quality outputs in model training, which is an interesting finding.

Weaknesses

  • (1) The experimental setup section could be more detailed. (For example, the time spent on experiments, more complete parameter configurations, IRA prompt settings, etc.).

其他意见或建议

None

作者回复

We thank the reviewer for identifying our work's soundness and technical contributions. Your valuable comments have greatly helped us improve our manuscript. Below are our specific responses to the raised questions:

Weakness 1 and Question 1: The authors directly apply DPO on the re-labeled dataset, rather than using the training objective introduced in the method.

We will incorporate the following clarifications—extending lines 150–219—in the revised manuscript. The training objective introduced in our method is maxπΠEx,gD,yπ(x,g)[R(x,y,g)β0KL(π(x,g)πref(x,g))],\max_{\pi\in\Pi}\mathbb{E}_ {x,g\sim\mathcal{D},y\sim \pi(\cdot\mid x, g)}[R(x,y, g)-\beta_0\text{KL}(\pi(\cdot\mid x,g)\| \pi_{\text{ref}}(\cdot\mid x,g))], where Π\Pi denotes the class of all goal-conditioned policies π(yx,g)\pi(y\mid x,g) and the relabelling distribution for gg is P(g=r(x,yw)x,yw,yl)=P(g=(r(x,yl)x,yw,yl)=1/2.\mathbb{P}( g= r(x,y_w) \mid x,y_w,y_l) = P(g=(r(x,y_l) \mid x,y_w,y_l)=1/2. It has the closed-form solution as follows: πR(yx,g)πref(yx,g)exp(β01R(x,y,g)).\pi_ R(y\mid x, g)\propto \pi_{\text{ref}}(y\mid x,g)\exp(\beta_0^{-1} R(x,y,g)). Using the reward reparameterization trick from DPO, the reward can be written as R(x,y,g)=β0logπR(yx,g)β0logπref(yx,g)ZR(x,g).R(x,y,g) = \beta_ 0 \log\pi_R(y\mid x,g) -\beta_0 \log\pi_{\text{ref}}(y\mid x,g) -Z_ R(x,g). Based on the relabeling distribution, we define the augmented preference dataset D\overline{\mathcal{D}} as {(xi,y~wi=ywi,y~li=yli,gi=r(xi,ywi))}i[M]{(xi,y~wi=yli,y~li=ywi,gi=r(xi,yli))}i[M],\{(x^i,\tilde{y} _ w^i=y_ w^i,\tilde{y}_ l^i = y_ l^i,g^i = r(x^i,y^i_ w))\}_ {i\in[M]}\cup\{(x^i,\tilde{y} _ w^i=y_ l^i,\tilde{y}_ l^i = y_ w^i,g^i = r(x^i,y^i_ l))\}_ {i\in[M]}, which doubles the size of the original dataset D\mathcal{D}. The resulting goal-conditioned DPO objective becomes:

maxREx,y~w,y~l,gD[σ(R(x,y~w,g)R(x,y~l,g))]=maxπEx,y~w,y~l,gD[σ(β0logπ(y~wx,g)πref(y~wx,g)β0logπ(y~lx,g)πref(y~lx,g))].\max_{R} \mathbb{E}_ {x,\tilde{y}_ w, \tilde{y}_ l, g\sim \overline{\mathcal{D}}}[\sigma(R(x,\tilde{y}_ w,g) - R(x,\tilde{y}_ l,g))] =\max_{\pi}\mathbb{E}_ {x,\tilde{y}_ w,\tilde{y}_ l,g\sim \overline{\mathcal{D}}}\Bigl[\sigma\Bigl(\beta_0 \log\frac{\pi(\tilde{y}_ w\mid x,g)}{\pi_{\text{ref}}(\tilde{y}_ w\mid x,g)}- \beta_0 \log\frac{\pi(\tilde{y}_ l\mid x,g)}{\pi_ {\text{ref}}(\tilde{y}_ l\mid x,g)}\Bigr)\Bigr].

This objective directly corresponds to our implementation, which applies DPO on the relabeled dataset.

Weakness 2: The experimental setups could be more detailed.

We have included our hyperparameters and prompts in Appendix B.1, which we list as follows. On 8 A100 GPUs, the training takes about 5-7 hours.

KL regularizationbatch sizelearning ratewarmup ratiomax prompt lengthmax completion lengthoptimizerlr_scheduler
0.011285e-70.1512512AdamWcosine

Question 2: Could you provide the prompt settings for the IRA experiments?

The prompts used in the IRA experiments are identical to those in the main experiments (as detailed in Appendix B.1). The only difference lies in the rescaling of reward values to the range [1,10][1, 10] using the following linear transformation: max(min(10(rewardlow)/(highlow),10),1)\max(\min(10*(\text{reward} - \text{low}) / (\text{high} - \text{low}), 10), 1), where high and low denote the maximum and minimum implicit rewards computed from a small subset of the training data.

Question 3: If possible, could you provide the code for your experiments?

We provided the code in the anonymous link: https://anonymous.4open.science/r/anonymous-9208-id.

Question 4: Figure 3 is quite striking but does not show whether the log probability for low-quality rejected responses also increases.

We report the log probabilities for low-quality rejected responses (with scores less than 5) in the following table:

1.02.03.04.0
Qwen2-7B-Instruct-471.5-283.6-302.7-334.6
+DPO (UF)-673.8-488.5-513.9-517.6
+DPO (RA)-674.2-485.9-495.6-420.4

It can be observed that the log probability of low-quality rejected responses for our method DPO (RA) has a similar scale as the vanilla DPO (UF) and is smaller than that of Qwen2-7B-Instruct.

Question 5: Relationship to traditional data-driven robust training.

We will incorporate the following discussions to our manuscript:
"Both our method and robust training techniques are motivated by enhancing the model generalization by leveraging augmented datasets that capture alternative outcomes. However, robust training methods typically focus on generic uncertainty or perturbations, which are not involved in our method. Instead, it concerns the limitations in direct alignment algorithms, such as the unlearning issue, by explicitly conditioning on quality scores to steer the model toward learning patterns associated with varying levels of response quality. Moreover, robust methods often emphasize worst-case scenarios or boundary conditions, which can lead to conservative generalization. In contrast, our method promotes generalization toward sparse, high-quality responses."


We hope the reviewer could consider raising the score if we resolved the reviewer's concerns. We would be happy to have further discussions if the reviewer has any additional questions or comments.

审稿人评论

Having reviewed the authors' clarifications, I now understand the relationship between their theoretical model and experimental implementation, which is solid enough. Their code release enables me to have a closer look at the method's details and confirm the reproductivity of the method.

However, the method relies heavily on the stability of pre-trained models used for scoring, particularly the implicit assumption that the AI judge’s reward scores are both accurate and consistent. This dependency—while common in RLAIF-inspired methods—introduces unquantified risks (e.g., potential error propagation due to bias or noise in the judge models) that may accumulate during training and affect the model’s robustness and its performance in real-world scenarios.

I will raise my score to 3 in recognition of the authors ’detailed and rigorous technical clarifications and empirical validations. However, the dependency on pre-trained judge stability limits its broader impact and application.

作者评论

We appreciate the reviewer for recognizing the rigor and reproducibility of our work and for raising the score to 3. We will address your remaining concern as follows.

The method relies heavily on the stability of pre-trained models used for scoring, particularly the implicit assumption that the AI judge’s reward scores are both accurate and consistent. This dependency—while common in RLAIF-inspired methods—introduces unquantified risks (e.g., potential error propagation due to bias or noise in the judge models).

  • Compared to direct preference optimization methods such as DPO, our approach is more robust to the bias or noise of scalar rewards. For instance, when only a sample-based estimate of the true preference is available or when using function approximators, it is common for high-quality rejected responses to be favored by more annotators. DPO, which strives to maximize the reparameterized reward gap, may degrade the model in such cases. In contrast, our method is aware of the quality levels and learns from the full spectrums.
  • The AI judge's reward scores have been shown to be accurate and consistent, conditioning on which leads to superior performance compared to DPO. In our experiments, we found that most conditional SFT methods, including DPA [1], SteerLM [2], NCA-P, NCA-R, INCA-P, and INCA-R [3], outperform DPO, which fails to account for the qualitative aspects of responses.
ZephyrDPODPASteerLMNCA-PNCA-RINCA-PINCA-ROurs
LC Win Rate6.2111.6011.13-11.5012.8713.6814.8316.66
Win Rate3.948.5810.588.218.439.5611.0011.3413.37
  • We further performed an ablation study to assess our method's robustness to variations in the reward scales provided by the AI judges. Specifically, we utilized the UltraFeedback dataset, originally containing reward scores ranging from 1–10, and rescaled these scores to alternative ranges of 1–5 and 1–100 through linear transformations. The results are summarized in the following table:
Qwen2-7B-It+DPO (UF)+DPO (RA, 5)+DPO (RA 10)+DPO (RA 100)
LC Win Rate20.9321.4629.8531.1731.81
Win Rate18.2219.3526.1227.5827.96

It can be observed that the performance of our method remains stable and robust across these varying reward scales, indicating resilience against potential biases or noise introduced by different reward scales.


We hope these responses have fully addressed your concerns. We would be happy to have further discussions if the reviewer has any additional questions or comments.

[1] Wang et al. ''Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards.''
[2] Dong et al. ''SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF.''
[3] Chen et al. ''Noise Contrastive Alignment of Language Models with Explicit Rewards.''

审稿意见
3

Direct preferential optimization (DPO) has shown great potential for finetuning language models with user preferences. However, DPO highly depends on positive vs negative samples, and therefore, if some of the relatively good samples are rejected by evaluator model, that can significantly worsen the performance of DPO. In this paper, the authors rather proposed to generate a reward value (e.g., between 1 to 10) for each sample and finetune a reward-conditioned LM for alignment. Experimental results on several academic benchmark dataset (e.g., Alpaca and MT-bench) demonstrate that such reward conditioning can improve the performance of finetuning over traditional DPO ultrafeedback method.

给作者的问题

  1. How do you address the errors in generating reward scores by the evaluator model? Typically, LLMs struggle to identify right scaler rewards.
  2. If we can generate the reward values, then we can directly use RL for the alignment. However, I do see any comparison with wide variety of RLAIF works. Please explain why?
  3. How to identify the optimal goal reward value?

论据与证据

Performance of DPO is very sensitive towards evaluator model's preference as it is a binary classification, whereas adding scaler reward metrics can alleviate this sensitivity issue. Authors claimed that with simple tweaking of the loss function and asking the evaluator model to generate scaler goal-conditioned reward value can significantly improve the finetuning performance. Experimental results validated such claim and demonstrate superiority over ultrafeedback method.

方法与评估标准

The evaluation metrics and dataset chosen for experiments make sense to. While Ultrafeedback DPO is a good benchmark to compare with, there are also recent SOTA works cited in the related work but not been considered as benchmark methods.

理论论述

Theorem 1 shows that the reward-condition formulation is guaranteed to reach optimality, and the detailed proofs are provided in supplement.

实验设计与分析

Experiments are carefully designed. Performance is validated on well-known benchmark dataset. Several ablation studies demonstrate that reward-conditioning can help improve performance of DPO on standard Q/A tasks, mitigate the issues with unlearning, and also the proposed framework is highly dependent on goal reward setting.

补充材料

I have read the supplementary material at a high level and might have missed some of the mathematical proofs.

与现有文献的关系

Preference optimization and alignment is an important problem and will have broader interest in the community. This work provides an alternative to DPO with reward conditioning for SFT, and therefore, will be interesting broader academic community.

遗漏的重要参考文献

NA

其他优缺点

While DPO has been widely adopted, it fails in many problems due to sensitivity w.r.t. good samples being rejected. Therefore, reward-conditioned alignment propose a good alternative towards DPO. Experimental results also validated that reward-conditioning alignment performs better in many scenarios over traditional DPO. Therefore, this solution would be of interest for the broader community. Having said that, I have a few concerns:

  1. LLM evaluators are typically not that stable when generating scaler rewards (as opposed to generating preferences), and thus many RLAIF works suffer. So, some analysis on how to address the errors or biases from evaluator models would have been great.
  2. The proposed method is only considered with ultrafeedback DPO, but there exists other recent works that improves on the DPO, e.g., RPO (Adler et al., 2024). Also, some comparison with RLAIF line of works is expected.

其他意见或建议

Majority of the space is allocated towards experimental results only. I would recommend expanding the technical method section with more proofs and guarantees on how the proposed direction can change the preference selection research direction.

作者回复

We thank the reviewer for identifying our work's soundness and technical contributions. The valuable comments have greatly helped us improve our manuscript. Below are our specific responses to the raised questions:

Weakness 1 and Question 2: SOTA works cited in the related work but not compared, such as recent works that improve on the DPO, e.g., RPO (Adler et al., 2024), and RLAIF line of works.

We have compared with 15 additional baselines in Figure 4 and Table 10, 12 in the Appendix, including RPO and other SOTA RLAIF methods. We listed the results below for your reference and will incorporate Tables 10 and 12 into the main body of the manuscript:

Zephyr-SFTDPODPASteerLMNCA-PNCA-RINCA-PINCA-ROurs
LC Win Rate6.2111.6011.13-11.5012.8713.6814.8316.66
Win Rate3.948.5810.588.218.439.5611.0011.3413.37
Llama-3-8B-ItSLiC-HFORPOCPORRHFKTOIPORPOR-DPOSimPOOurs
LC WR22.9226.928.528.931.333.135.640.841.144.748.2
WR23.1527.527.432.228.431.835.641.737.840.553.2

If there are other baselines you would like us to consider, please let us know and we would be happy to include them in our comparisons.

Weakness 2 and Question 1: LLM evaluators are typically not that stable when generating scalar rewards.

  • The LLM or reward model (RM) judges are typically trained on human preference data, and the resulting reward scores reflect the expected preferences across annotators—akin to Elo scores in Bradley-Terry models.
  • Compared to direct preference optimization methods such as DPO, our approach is more robust to the instability of scalar rewards. For instance, when only a sample-based estimate of the true preference is available or when using function approximators, it is common for high-quality rejected responses to be favored by more annotators. DPO, which attempts to maximize the reparameterized reward gap, may degrade the model in such cases. In contrast, our method is sensitive to varying response quality and effectively learns from the full spectrum of rewards.
  • We further performed an ablation study to assess the robustness of our method to different scalar reward scales generated by LLM evaluators. Using the UltraFeedback dataset, which provides reward scores in the range of 1–10, we linearly rescaled the rewards to 1–5 and 1–100. We then applied our method to these modified datasets. We observed that our method is robust to the scale of the reward scores. The results are summarized in the following table:
Qwen2-7B-It+DPO (UF)+DPO (RA, 5)+DPO (RA 10)+DPO (RA 100)
LC Win Rate20.9321.4629.8531.1731.81
Win Rate18.2219.3526.1227.5827.96

Suggestion 1: Expand the method section with more proofs and guarantees on how the proposed direction can change the preference selection research direction.

We thank the reviewer for the helpful suggestion. In the revised version, we will include a proof sketch in Section 4, incorporating the key analysis and lemmas from Appendices A.5 and A.6. Additionally, we will highlight the observed limitations of DPO—such as those illustrated in Figure 3 and similar trends reported on the O.O.D. data HelpSteer2 below—to provide a more comprehensive discussion.

Rejected score8910
Qwen2-7B-It-416.7-356.5-334.8
+DPO (UF)-484.5-419.4-401.7
+DPO (RA)-438.6-366.4-341.1

Question 3: How to identify the optimal goal reward value?

The optimal goal reward depends on the value range of the judge model and its scale as presented in the training prompt. For instance, the optimal goal is 11 when using sigmoid-based reward models, or 55 when using LLM judges that follow evaluation criteria with a maximum score of 55. Both types of reward values can be rescaled, e.g., via linear transformations in the training prompt, as demonstrated in Appendix B.3.


We hope the reviewer could consider raising the score if we resolved the reviewer's concerns. We would be happy to have further discussions if the reviewer has any additional questions or comments.

最终决定

This paper proposes to improve DPO by making the policy conditioned on the target reward score, which in particular addresses an issue where DPO may "unlearn" good responses that are "rejected" in the preference data. Extensive empirical evaluations validate the benefits of the proposed method over both vanilla DPO and relevant "reward-aware" baselines.

All reviewers are advocating for weak acceptance after the discussion period, where the authors provided several clarifications and important additional experimental results. While there could always be room for more comparisons / ablations, and the improvements might not be consistently major across all benchmarks and models, I personally consider that the authors did a thorough enough work to justify their method and demonstrate its benefits to deserve being accepted for publication. I expect this work to be of significant interest to researchers working on preference-based post-training.