Risk-aware Direct Preference Optimization under Nested Risk Measure
In this paper, we introduce Risk-aware Direct Preference Optimization (Ra-DPO), a novel approach that incorporates risk-awareness by employing a token-level objective function under nested risk measure.
摘要
评审与讨论
This paper tackles token-level preference optimization for LLM alignment by making it “risk-aware.” It modifies the usual Bradley-Terry setup to include a nested risk measure that accounts for potential variability in model updates. They define a token-level advantage function that uses this risk measure, leading to a new loss (Ra-DPO). The main idea is to improve alignment performance while keeping the model from drifting too far from the reference. They test on IMDb, Anthropic HH, and AlpacaEval, showing Ra-DPO often beats methods like DPO, PPO, and TDPO in preference accuracy and also keeps lower sequential KL divergence.
给作者的问题
- Your reported improvements seem relatively modest (1-3%). Could you explain more clearly why even these small improvements are significant enough to justify using a more complex objective function?
- Would your risk-aware method work well for other generation tasks or datasets where the distribution shift or uncertainty is larger (e.g., harmful content moderation, toxicity detection)? Have you tried those settings?
论据与证据
The main claims are that their risk-aware method (Ra-DPO) can effectively control model drift while still maintaining or improving preference accuracy. They support these claims with experiments on three standard datasets. The experiments show moderate improvements in accuracy and lower sequential KL divergence compared to baselines. They provide theoretical proofs (in Appendix B) showing that maximizing their proposed loss leads to policy improvements. Overall, their evidence supports the claims, though the practical advantage seems modest.
方法与评估标准
The authors adapt known ideas (nested risk measures like CVaR and Bradley-Terry models) to token-level language modeling. Their evaluation uses standard datasets (IMDb, Anthropic HH, AlpacaEval), measuring preference accuracy and KL divergence as indicators of alignment quality and model drift. These criteria make sense given their goal of improving alignment while controlling risk. Although the evaluation approach is typical and appropriate, the improvement in practical metrics (accuracy, drift) is somewhat minor.
理论论述
They include several lemmas and a theorem (Lemmas 3.1, 3.3, 3.4, 3.5, and Theorem 3.6) to justify their token-level risk-aware approach. I checked Lemma 3.1 and Lemma 3.3 in particular—they appear correct based on standard RL math, with straightforward algebra and clear steps. I didn’t find any obvious mistakes in the provided derivations.
实验设计与分析
I checked the experimental designs, especially the IMDb and Anthropic HH evaluations. The design looks sound: they clearly define the setup, use standard baselines (DPO, PPO, TDPO, KTO), and present results clearly (accuracy and KL divergence plots). One potential issue is that the performance gains, though consistent, seem quite small (around a 1-3% improvement), which might limit the practical value.
补充材料
Mainly the experiments.
与现有文献的关系
This paper directly builds on recent token-level preference optimization methods like TDPO and classical risk-sensitive RL literature. The main novelty is combining nested risk measures (widely used in RL) with direct preference optimization methods (like DPO and TDPO). The paper cites existing literature clearly and fits reasonably well within current trends toward more risk-sensitive and token-level RLHF methods.
遗漏的重要参考文献
Nothing noteworthy to me.
其他优缺点
Strength:
- The paper clearly explains how to integrate nested risk measures with token-level preference learning, combining known ideas from risk-sensitive RL and preference optimization.
- They present results across multiple datasets (IMDb, Anthropic HH, AlpacaEval), giving some confidence that the method works consistently.
- The theoretical part seems carefully done, with explicit derivations to support their method clearly laid out in the appendix.
Weakness:
- The improvements shown in experiments are modest, which makes it unclear whether the added complexity of their risk-aware objective is worthwhile in practice.
- The main idea is incremental—just applying existing risk-aware RL concepts to the token-level DPO setting. It doesn't introduce a significantly new concept or theory.
其他意见或建议
The running title of the paper is still "Submission and Formatting Instructions for ICML 2025" from the template. Also in the proof of Theorem 3.6 in the appendix, should it be "Theorem 3.6 Restated"?
We sincerely appreciate the valuable comments from the reviewer. We hope our responses below provide further clarity.
Response to Other Comments Or Suggestions: The running …… "Theorem 3.6 Restated"?
We apologize for the confusion caused by our oversight. We will correct these errors in the next version.
Response to Question 1: Your reported …… objective function?
It should be clarified that our proposed method maintains both:
- A natural and simple loss function (the sum of DPO loss and negative sequence risk ratio), which can be seen in Figure 1;
- Comparable training efficiency, with no substantial increase in training time requirements.
Note: Firstly, in the paper, the lemmas and theorems presented serve exclusively to demonstrate the theoretical validity of our approach when accounting for risk considerations. Specifically, we introduce nested risk measures (a nonlinear function) in token-level policy optimization and then prove that maximizing the objective function will result in policy improvements. This method is more likely to effectively balance alignment performance and model drift, thereby preventing model failure in certain aspects and ultimately achieving an improvement in reward accuracy.
Additionally, as shown in Figures 6, we can observe that our method achieves higher reward accuracy when implemented with Pythia-1.4B as the base model on the Anthropic HH dataset. And, we provide several additional experimental results, including a numerical example, several evaluation results of LLMs, the results using nested ERM (entropic risk measure) [1-3] and the results with different seeds, to demonstrate the effectiveness of our algorithm. For details, please refer to the link 🔗Additional_Experiment_Results.
Response to Question 2: Would your …… those settings?
Thank you for the valuable question. We would like to provide an extensive discussion about risk-awareness and corresponding experiments.
Risk-Awareness: In this paper, we introduce nested risk measures to enhance risk-awareness in LLM alignment, which induces a conservative policy during the process of aligning LLMs, which enables the model to remain closely aligned with a reference LLM, thereby preventing significant deviations and maintaining its superior decision-making and reasoning abilities. This is highly valuable in real-world applications, where aligning general-purpose LLMs with human values and intentions (higher reward) without compromising its decision-making and reasoning abilities (lower KL divergence).
Experiments: Additionally, from the perspective of output verification (e.g., harmful content moderation, toxicity detection), we recommend a hybrid approach that combines Safe RLHF [4] with risk-sensitive measures. This approach independently models both cost and reward functions while accounting for cost distributions. However, it may require more computational resources due to the need to train additional models. Nonetheless, our method can serve as a foundational groundwork for such potential approaches. Moreover, we plan to conduct further research in this direction in the future.
It is noteworthy that our additional experiments also demonstrate examples of using LLMs (DeepSeek and GPO-4o) to evaluate the performance of various algorithms. The experimental results can be found at the link 🔗Additional_Experiment_Results.
References:
[1] Föllmer, Hans, and Alexander Schied. Convex measures of risk and trading constraints. Finance and stochastics, 2002, 6: 429-447.
[2] Hau, Jia Lin, Marek Petrik, and Mohammad Ghavamzadeh. Entropic risk optimization in discounted MDPs. In AISTATS, 2023.
[3] Fei, Yingjie, Zhuoran Yang, Yudong Chen, Zhaoran Wang, and Qiaomin Xie. Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret. In NeurIPS, 2020.
[4] Dai, Josef, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe RLHF: Safe reinforcement learning from human feedback. In ICLR, 2024.
This paper presents a risk-aware version of direct preference optimization (DPO) algorithm. The key innovation is to employ a risk-aware objective that operates at the token level (which results in a different algorithm due to the presence of KL divergence). The risk is calculated sequentially in terms of the deviation between the model and a reference model on the token level. The derivation shows that the notorious partition function will cancel in the general setting, akin to the more specific setting studied by the DPO paper. Experiments demonstrate that the algorithm maintains a similar level of preference accuracy relative to a recent benchmark, while having a lower divergence relative to the reference policy.
给作者的问题
My main concern with the paper is readability, or lack thereof. I spent several hours to understand the notation used in the paper, and had to go back and forth between multiple papers to generally understand what the paper means. I don't really want to be mean here, but just genuinely concerned that the paper is not readable at this stage. Some examples below:
-
what is composition operator used in the set of equations (6). The operator o is not properly defines as far as I can tell. I think I intuitively know what it means, but a proper definition is lacking.
-
in the set of equations (5), how come the Q function takes as (second) input, a probability distribution?! This is a really basic question that confuses me. In the same vein, in the first line here, and on the right hand side, why is the second input to V is y<t and not y<t+1? Should it not be the case that we consider the value function for states in the next state and therefore need to increment t from left to right? Again, very basic question, so I am leaving it open that I am missing something fundamental here.
论据与证据
My main concern is regarding the notion of risk-awareness, and how it is measured in the experiments. Why not instead focus on actual benchmarks that measure risk in LLM outputs and instead look at KL divergence discrepancy?
For instance can we do some evaluation of the responses generated and measure them in terms of say probability of something quite toxic being generated, and then in doing so assess the success of the newly proposed algorithm to mitigate this phenomenon?
方法与评估标准
The paper's main selling point is that the new DPO-ish algorithm is risk-aware, but fundamentally, how can one demonstrate the risk-aware nature of a resultant policy? The current experimental design measures risk-awareness in terms of a token-level deviation from a reference policy, but what if the reference policy itself is risk-unaware, therefore measuring discrepancy between the learned policy and the reference policy would not be indicative of risk-awareness.
理论论述
In terms of the theoretical derivation, I find a strong similarity between this paper and that of "Token-level Direct Preference Optimization", Zeng et al 2024. While the citation is there, I still want to know in what sense the theoretical result is going beyond this paper. The systematic way to derive the objective is very similar between this paper and Zeng et al, with both papers computing a sequential token-level KL computation.
Another question pertains to the connection between this paper and "Entropic Risk Optimization in Discounted MDPs". This paper augments the state space by using historical trajectories to avoid the notion that nested risk measures are not law-invariant. But the paper mentioned above actually introduces a set of Bellman equations that are much easier to work with and does so without doing state augmentation. Can the DPO-style derivation presented here be also applied to the setting discussed by the paper mentioned above, or is it only limited to the case where we must augment the state space?
实验设计与分析
See the comment about measuring risk-awareness.
Also, increasing the number of random seeds used in the 4.1 experiment would be helpful to obtain more reliable error bars.
补充材料
Yes
与现有文献的关系
NA
遗漏的重要参考文献
NA
其他优缺点
NA
其他意见或建议
NA
We sincerely appreciate the valuable comments from the reviewer. We hope our responses below provide further clarity.
Response to Claims And Evidence:
Risk awareness: In this paper, risk awareness refers to the sensitivity to risks arising from deviations from the reference model. It induces a risk-averse policy to enable the model to remain closely aligned with a reference LLM and maintain its superior decision-making and reasoning abilities.
Experimental Benchmarks: From the perspective of output verification, we recommend a hybrid approach that combines Safe RLHF [1] with risk-sensitive measures, which independently models both cost and reward functions while accounting for cost distributions. However, it may require more computational resources due to the need to train additional models. Nonetheless, our method can serve as a foundational groundwork for such potential approaches.
Research significance: Notably, a critical consideration in alignment research involves balancing performance and model drift. Current approaches (e.g., DPO [2] and TDPO [3]) establish that optimal alignment should simultaneously maintain minimal deviation from the reference model (lower KL divergence) while aligning with human values (higher reward). F-DPO [4] studies this trade-off under varying divergence constraints.
Additional Experiment Results: We provide additional experimental results to further demonstrate the effectiveness of our approach at the link 🔗Additional_Experiment_Results.
Response to Methods And Evaluation Criteria:
The fine-tuning of LLMs typically involves two key stages: supervised fine-tuning (SFT) and preference alignment. During the alignment phase, the post-SFT model typically serves as the reference model (the most reliable model available), which generally demonstrates robust reasoning and decision-making capacities. Importantly, significant deviation from the reference model generally leads to capability degradation, which inherently constitutes substantial risk. Thus, the primary challenge in LLM alignment lies in balancing alignment performance and model drift (maintaining reasoning and decision-making abilities).
Response to Theoretical Claims:
Theoretical Breakthroughs. We introduce the nested risk measures to enhance the model's risk sensitivity. Our theoretical advancements primarily include:
- Incorporating nested risk measures into token-level policy optimization and providing a closed-form solution. Importantly, our method maintains a natural and simple loss function shown in Figure 1.
- Establishing the connection between risk-aware value functions and optimal policies. The key technical contributions lie in:
- A risk-aware advantage function design under nested risk measures;
- Proof of Bellman-type model equivalence with the Regret Preference Model under nested risk measures.
State Augmentation. We argue that state augmentation is essential, particularly for token-level generation in LLM alignment. The inherent contextual dependencies of text tasks and the non-Markovian characteristics of risk sensitivity naturally necessitate this approach. Compared to DPO-style methods, it introduces negligible additional complexity and computational overhead.
Response to Experimental Designs Or Analyses:
To ensure clarity, we provide additional clarification as follows: In Figures 2 and 3, each algorithm is represented by two curves: the darker-colored one shows the raw curve, while the brighter-colored one displays the smoothed version. All curves share one identical random seed, primarily because (1) LLM training requires substantial computational resources, and (2) we followed conventional practice by initializing the trained model with parameters from the reference model (the post-SFT model). Also, we add experimental results under multiple random seeds to the "The results with different seeds" folder in link 🔗Additional_Experiment_Results.
Response to Questions For Authors:
We apologize for the confusion caused by our oversight. The operator "o" denotes the concatenation of the state and action at time step . In the set of equations (5), the second input to should be . We appreciate this correction to the writing error. represents the policy when taking an action given the state . In the next version, we will correct these errors.
References:
[1] Safe RLHF: Safe reinforcement learning from human feedback. ICLR, 2024.
[2] Direct preference optimization: Your language model is secretly a reward model. NeurIPS, 2023.
[3] Token-level direct preference optimization. ICML, 2024.
[4] Beyond reverse KL: Generalizing direct preference optimization with diverse divergence constraints. ICLR, 2024.
The paper introduces Risk-aware Direct Preference Optimization (Ra-DPO), a new method for fine-tuning token-level large language models (LLMs) with higher-order nested risk measures. The moderation of path dependency utilizes Bellman equation. Comprehensive theoretical remarks and justification are provided regarding the integrity with existing methods:
- DPO
- Bradley-Terry model for preference modeling
- Generalization of gradients, in the lens of gradient calculation
To keep the proposed method comparable with existing methods, the authors devise two strategies in experiments. Empirical results on IMDb, Anthropic HH, and AlpacaEval using GPT-2 Large, Pythia-1.4B, and Pythia-2.8B confirm superior alignment with reduced model drift compared to baselines.
给作者的问题
Q1. In Line 378, Section 4.3, how does the sampling temperature coefficient impacting the performance and contribute to Ra-DPO_2?
- In my understanding, under the purpose of ensuring training stability, there are some contradiction in between using this temperature setting and adopting the Ra-DPO_2 strategy in Eq(19), Line 293.
Q2. Have you considered using human evaluators to assess alignment quality beyond automatic metrics?
论据与证据
Yes, the major claims are made with clear evidences.
方法与评估标准
Yes, the methodology is well-structured and appropriate for the problem:
A minor limitation in the experiment is that the authors do not conduct an ablation study on the individual impact of risk control parameters across different datasets.
理论论述
Several theoretical results are included in this paper.
- (Lemma 3.5) Equivalence Between Ra-DPO and the Bradley-Terry Model
- (Lemma 3.4) Closed-form Solution for Risk-aware Policy Optimization
- The authors prove that the optimization problem has a tractable closed-form solution, enabling efficient implementation.
However, I was not in a position to check all the correctness, especially the details in the proof in the Appendix.
实验设计与分析
The proposed and experimental are well defined and evaluated
-
Baselines: The authors compare against DPO, PPO, TDPO1, TDPO2, and KTO, providing a comprehensive benchmark. Risk Control Sensitivity: Different values of µ and α are tested in Ra-DPO1 and Ra-DPO2, allowing for a robust analysis of risk-aware preference optimization.
-
Datasets: IMDb (sentiment alignment), Anthropic HH (dialog alignment), and AlpacaEval (comparison-based evaluation) ensure diverse task coverage.
-
Metrics: The authors use reward accuracy and KL divergence as key metrics, which are well-aligned with the paper’s objectives. Potential improvement.
补充材料
Yes, I went through the arguments or claims in the main article and found respective proofs or Algorithm procedure details in the Appendix, which was very useful.
However, I was not in a position to check all the correctness, especially the details in the proof in the Appendix.
与现有文献的关系
This paper has properly cited the former works.
The key references are found below:
- Direct Preference Optimization (DPO) (Rafailov et al., 2023):
- Ra-DPO extends DPO by incorporating risk sensitivity at the token level.
- Token-level Direct Preference Optimization (TDPO) (Zeng et al., 2024):
- Ra-DPO improves upon TDPO by using nested risk measures for sequential risk control.
- Risk-aware RL (Bisi et al., 2022; Chen et al., 2024):
Overall, the paper makes meaningful claims and has clear notes.
遗漏的重要参考文献
The essential references are discussed in this paper.
In a broader landscape, RLHF is exposed to 'intransitivity' risk, because it rely on Bradley-Terry model as the preference function, where all preferences are transitive by assumption.
- Certain literatures have shown that such 'transitive' relationship between preference annotations may not always hold and some techniques are explored but not mentioned in this paper.
- https://arxiv.org/abs/2409.19325 (Duan et al, 2017) presented some evidence and can be of interest for future work.
其他优缺点
At high level, this paper opens a door to the higher-order risk measurement and control of RLHF.
Besides the good completeness of the work, this high level contribution is the highlight.
其他意见或建议
N/A
We sincerely appreciate the valuable comments from the reviewer. We hope our responses below provide further clarity.
Response to Methods And Evaluation Criteria: Yes, the …… different datasets.
In Appendix Figures 6 and 7, we present experimental results conducted on the Anthropic HH dataset using Pythia-1.4B and Pythia-2.8B as base models. We implemented , and with different risk control parameters.
In addition, we provide several additional experimental results, including a numerical example, several evaluation results of LLMs, the results using nested ERM (entropic risk measure) [1-3] and the results with different seeds, to demonstrate the effectiveness of our algorithm. For details, please refer to the link 🔗Additional_Experiment_Results.
Response to Essential References Not Discussed: The essential …… future work.
We sincerely appreciate the reviewers' recognition of our work. In Appendix A, we examine key factors that introduce risks in the alignment of LLMs, where we highlight the crucial factor that "there exist conflicts and contradictions among human preferences (or choices)", while implying that the 'transitive' relationship between preference annotations may not always hold. Of course, we acknowledge that we inadvertently omitted citation to the important reference https://arxiv.org/abs/2409.19325. In the new version, we will include the relevant citations.
Response to Question 1: In Line …… Line 293.
In Line 378, Section 4.3, the sampling temperature coefficient is a parameter of AlpacaEval [4], which is a tool to evaluate instruction-following language models based on human annotations. In this paper, we adhered to the default settings in the official AlpacaEval implementation. It should be explicitly noted that this parameter bears no direct relationship to the loss function of our proposed Ra-DPO algorithm.
Response to Question 2: Have you …… automatic metrics?
Thank you very much for your valuable suggestions. Using human evaluators to assess alignment quality beyond automatic metrics is indeed very convincing. However, as is commonly known, this approach incurs substantial labor costs. For this reason, numerous research efforts are now exploring methodologies that utilize LLM-based evaluators as substitutes for human evaluators. Many studies [4-7] have shown that LLM-based auto-evaluators have become a key component of the LLM development process due to their cost-effectiveness and scalability compared to human-based evaluation. In this paper, we use AlpacaEval, a fast and affordable benchmark for chat LLMs that uses LLMs to estimate response quality.
References:
[1] Föllmer, Hans, and Alexander Schied. Convex measures of risk and trading constraints. Finance and stochastics, 2002, 6: 429-447.
[2] Hau, Jia Lin, Marek Petrik, and Mohammad Ghavamzadeh. Entropic risk optimization in discounted MDPs. In AISTATS, 2023.
[3] Fei, Yingjie, Zhuoran Yang, Yudong Chen, Zhaoran Wang, and Qiaomin Xie. Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret. In NeurIPS, 2020.
[4] Dubois, Yann, Balázs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024.
[5] Zheng, Lianmin, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin et al. Judging llm-as-a-judge with mt-bench and chatbot arena. In NeurIPS, 2023.
[6] Li, Tianle, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline. arXiv preprint arXiv:2406.11939, 2024.
[7] Lin, Bill Yuchen, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi. Wildbench: Benchmarking llms with challenging tasks from real users in the wild. arXiv preprint arXiv:2406.04770, 2024.
This paper introduces a risk-aware direct preference optimization method that incorporates a nested risk measure into a token-level objective function. The ultimate objective function maximizes the likelihood of the policy while suppressing the deviation between a training model and the reference model using a sequential risk ratio, thereby enhancing the model’s risk awareness during the process of aligning LLMs. The empirical results demonstrate the superior performance of the proposed method.
给作者的问题
Please see the above sections.
论据与证据
Please see the section of Other Strengths And Weaknesses.
方法与评估标准
Please see the section of Other Strengths And Weaknesses.
理论论述
I have checked all the theoretical claims.
实验设计与分析
Please see the section of Other Strengths And Weaknesses.
补充材料
I have reviewed all sections of the supplementary material.
与现有文献的关系
N/A
遗漏的重要参考文献
N/A
其他优缺点
Strengths:
-
This paper proposes a novel model by combining the risk measure and the token-level preference optimization method, which is an interesting contribution to this field.
-
The experiments also showcase the superior performance of the proposed method in balancing alignment performance and model drift.
-
This paper is technically sound and well-structured.
Weaknesses:
-
The motivation of this work seems still not clear. As discussed in Lines 50-54, the authors mention that a risk-neutral criterion neglects the characteristics of the reward distribution beyond the mean. This is the primary motivation for risk-aware learning, as demonstrated in many prior works. For example, [1] states that iterated CVaR with the parameter focuses on optimizing the worst -percent performance at each step and allows the agent to control the risk throughout the decision process tightly. On the other hand, the authors also mention that the proposed method with risk measures aims to achieve a better balance between alignment performance and model drift. The model drift is indicated by the lower KL divergence of policies. This latter motivation seems not aligned with the initial focus on optimizing worst-case performance or capturing the characteristics of the reward distribution beyond the mean.
-
Following the above discussion, the authors need to focus on designing experiments to validate the risk-averse properties of their approach.
-
It is interesting to discuss why risk-aware preference optimization can lead to higher reward accuracy after integrating the risk measure into preference optimization methods since the primary goal of applying risk measure is to optimize the worst -percent performance.
-
This work could be improved by evaluating the proposed approach on other risk measures to demonstrate the generality of the method.
Reference:
[1] Chen, Yu, et al. "Provably efficient iterated CVaR reinforcement learning with function approximation and human feedback."
其他意见或建议
N/A
We sincerely appreciate the valuable comments from the reviewer. We hope our responses below provide further clarity.
Response to Weaknesses 1:
We apologize for the confusion caused by failing to give a clear explanation and would like to re-clarify our motivation.
Before restating our motivation, we first clarify the following facts:
- The reference model, typically a post-supervised fine-tuned model, demonstrates robust decision-making and reasoning capabilities. Current approaches, including DPO [1], KTO [2], and TDPO [3], establish that optimal alignment should simultaneously maintain minimal deviation from the reference model (lower KL divergence) while aligning with human values (higher reward).
- Experiments in TDPO [3] have demonstrated the advantages of examining divergence against a reference LLM on a more granular, token-by-token basis.
- The TDPO [3] focuses only on the expected reward (a risk-neutral criterion), thereby neglecting the characteristics of the reward distribution beyond the mean.
Based on the aforementioned facts and corresponding experimental results, a critical conclusion emerges: significant deviation from the reference model typically indicates a heightened risk of degradation in decision-making and reasoning capabilities.
Motivated by the above facts and conclusion, we introduce nested risk measures to enhance risk-awareness in LLM alignment. Here, “risk” specifically denotes potential hazards arising from deviations relative to the reference model. This is highly valuable in real-world applications, where aligning general-purpose LLMs with human values and intentions (higher reward) without compromising its decision-making and reasoning abilities (lower KL divergence). We hypothesize that risk-aware models employing nested risk measures (e.g., CVaR and ERM) will systematically reduce the probability of policy options with potential catastrophic consequences (failure, harmful, or deceptive) during policy optimization.
Response to Weaknesses 2:
We provide several additional experimental results, including a numerical example, several evaluation results of LLMs, the results using nested ERM (entropic risk measure) [4-6] and the results with different seeds, to demonstrate the effectiveness of our algorithm. For details, please refer to the link 🔗Additional_Experiment_Results.
Response to Weaknesses 3:
We provide discussion about why risk-aware preference optimization can lead to higher reward accuracy as follows:
As shown in Figures 6 and 7, we can observe that risk-aware preference optimization (Ra-DPO) achieves higher reward accuracy when implemented with Pythia-1.4B as the base model on the Anthropic HH dataset. This effect disappears when using Pythia-2.8B, which we attribute to the greater potential for reward accuracy improvement in smaller models. This is evident from several failed experiments we conducted: smaller models (Pythia-14m, Pythia-70m, and Pythia-160m) are more prone to model drift after thousands of iterations, resulting in empty outputs or invalid responses (extremely brief answers that fail to address the question). The proposed risk-sensitive method, incorporating risk-awareness into the token-level objective function, addresses this through risk-averse policy optimization. This method is more likely to effectively balance alignment performance and model drift, thereby preventing model failure in certain aspects and achieving an improvement in reward accuracy.
Response to Weaknesses 4:
Thank you for the valuable suggestions. We add the experimental results using nested ERM to demonstrate the effectiveness of our algorithm, which is conducted on Anthropic HH dataset with Pythia-1.4B serving as the base model. The experimental results can be found at the link 🔗Additional_Experiment_Results. Experimental results show that our algorithm (with Nested-ERM) also achieves consistently lower KL divergence and higher reward accuracy compared to baseline methods.
References:
[1] Rafailov, Rafael, Archit Sharma, Eric Mitchell, et al. Direct preference optimization: Your language model is secretly a reward model. In NeurIPS, 2023.
[2] Ethayarajh, Kawin, Winnie Xu, et al. Model alignment as prospect theoretic optimization. In ICML, 2024.
[3] Zeng, Yongcheng, Guoqing Liu, et al. Token-level direct preference optimization. In ICML, 2024.
[4] Föllmer, Hans, and Alexander Schied. Convex measures of risk and trading constraints. Finance and stochastics, 2002, 6: 429-447.
[5] Hau, Jia Lin, Marek Petrik, and Mohammad Ghavamzadeh. Entropic risk optimization in discounted MDPs. In AISTATS, 2023.
[6] Fei, Yingjie, Zhuoran Yang, et al. Risk-sensitive reinforcement learning: Near-optimal risk-sample tradeoff in regret. In NeurIPS, 2020.
The paper proposes a risk-sensitive variant of direct preference optimization that maximizes a risk measure, rather than the expected reward, while incorporating standard KL regularization. The results show that this approach outperforms the baseline in terms of expected reward, achieving higher performance with a lower KL divergence. While this improved performance may be of interest, the paper and the rebuttal do not clearly explain how optimizing the risk measure ultimately contributes to maximizing the expected reward. This paper could be made much stronger with further discussion possibly with ablation study regarding where the performance gain stems from.