PaperHub
6.1
/10
Poster4 位审稿人
最低2最高5标准差1.1
3
2
3
5
ICML 2025

Design Considerations in Offline Preference-based RL

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24
TL;DR

We study offline preference-based RLHF algorithms like DPO, and provide theoretical guarantees on the quality of the learned policy for a large class of methods.

摘要

关键词
Reinforcement Learning from Human FeedbackRLHFDPO

评审与讨论

审稿意见
3

The paper provides a theoretical study of offline learning methods from human preferences. The authors first establish a unified framework and relevant assumptions which fit most all of preference-based learning losses. Then, they propose a policy benchmark which is used to measure the quality of a policy output from optimizing empirical losses on preference data. Theorem 3.6 provides bounds on the closeness of these two policies. The findings suggest that using the squared loss is the best choice in that its optimizer is the closest to the benchmark "ideal" policy among other losses. These findings are also supported by some simple experimental results on language models.

给作者的问题

  1. I understand that in practice, the reference policies and the policies to be optimized are often non-degenerate. However, they can have negligible entries in the sense that μ(yx)0\mu(y|x) \approx 0 or π(yx)0\pi(y|x) \approx 0 for some x,yx,y. That makes the bounds on logπ(yx)\log\pi(y|x) or logπ(yx)/logμ(yx)\log\pi(y|x) / \log\mu(y|x) extremely large, especially when you consider it over the data generating distribution, not only the data itself. Moreover, the presence of such a bound as an exponent in the upper bounds on Theorem 3.6 seems concerning for such border cases. Can Assumption 3.1 be relaxed? If it doesn't hold, is there a way to get meaningful bounds?

  2. Although the squared loss seems to be superior to the other losses under the assumptions made, it is not particularly clear how realistic it is. Specifically, the logisitc loss, for instance, is not an arbitrary choice, but results from the synthesis of the RLHF pipeline into one step. Thus there is an inherent intuition behind its definition. Is there a practical intuition behind the squared loss? Under what preference generation model, if any, does it make sense?

论据与证据

The claims made in the submission are supported by rigorous theoretical proof and experimental results.

方法与评估标准

The proposed models, dataset and losses are all relevant to the considered setting.

理论论述

I checked the correctness of the proof of Theorem 3.6 and, to the best of my understanding, it is clear and concise.

实验设计与分析

I did not replicate the experimental results of the paper.

补充材料

I have not reviewed the supplementary material.

与现有文献的关系

The key contribution of the paper is the theoretical insight of the benefits of using squared losses when directly optimizing over a given preference dataset. These findings are definitely relevant to the literature of learning from human preferences.

遗漏的重要参考文献

There are no essential references missing, although the paper could benefit from a more comprehensive related work section in the broader theoretical RLHF literature (as opposed to the more narrow offline PbRL literature which is mentioned).

其他优缺点

The paper is clear, concise and easy to follow. The added contribution is indeed original as it unifies previously proposed loss function by considering the necessary assumptions which all of them seem to satisfy, and providing a general upper bound that depends on parameters of interest. The implications of Theorem 3.6 point out the seeming superiority of IPO in terms of producing a policy that is closer to the benchmark policy. This makes for an interesting finding.

其他意见或建议

N/A

作者回复

We thank you for your encouraging feedback on our work. We address your key questions below.

1. Relaxing Assumption 3.1: You are correct in noting that having an exponential dependence on the size of log-ratios is not ideal. But this limitation is shared by a line of prior works [1], [2], [3]. While it would be desirable to weaken this dependence, we also note that the degradation of learning quality as the log probability ratios grow large during learning is quite clear in our experiments. As we see in Figures 1 and 2, the drop of log probabilities under the learned policy coincides with degradation in the performance of DPO. Since the reference policy has well-behaved log probabilities, the probability ratios look nearly identical to the learned policy’s log probabilities here. So we also think that the bounds capture a phenomenon actually realized in the experiments qualitatively, even though we might be able to improve the precise functional forms from being milder than exponential in future work.

2. Generating model underlying squared loss: We thank you for the insightful question about the generating model underlying the squared loss. We refer you to Appendix A, which provides this discussion for general link functions, as well as gives the specific preference generation model for squared loss in lines 591-592: P(ω=1x,y,y)=12+R(x,y)R(x,y)2P(\omega=1|x, y, y’) = \frac{1}{2} + \frac{R^\star(x, y) - R^\star(x, y’)}{2}.

We thank you again for your thoughtful feedback, and will add these discussions to the final version of the paper.

审稿意见
2

This paper explores the theoretical aspects of offline preference-based reinforcement learning (PBRL). It examines a broad range of offline PBRL methods, including DPO and IPO, and establishes theoretical bounds on the sub-optimality of the policies learned by these methods. The analysis is based on specific assumptions regarding the loss functions and base policies, with the derived bounds also depending on certain quantities determined by the loss functions. Consequently, the theoretical findings offer some insights into selecting appropriate loss functions and base policies. Lastly, the paper presents experiments to validate the theoretical results.

给作者的问题

  1. What is the meaning of the notation ΠR\Pi_{\mathcal{R}} under Assumption 3.1?

论据与证据

Some of the key claims made in this paper are not well supported by clear and convincing evidence.

  1. The paper claims that its theoretical results provide insights into design choices such as the selection of loss functions and base policies for offline RLHF. However, the provided evidence does not fully support this claim. Regarding the loss function, Remark 3.7 suggests that the squared loss is preferable due to its curvature properties but fails to satisfy the realizability assumption. However, the paper does not extend its theoretical analysis to scenarios where this assumption is violated. As a result, it remains unclear when one should choose a particular loss function in practice. Similarly, regarding the base policy, Remark 3.8 states that its selection affects the benchmark policy. This implies that different variants with distinct base policies are not evaluated against a common standard, reducing the meaningfulness of theoretical comparisons.
  2. Additionally, there is a significant gap between the theoretical analysis and experimental validation. The primary discrepancy lies in the performance metrics used. The theoretical results rely on KL divergence with respect to the benchmark policy, whereas the experiments evaluate performance using evaluation preference and the log probability of winning or losing. Since the theoretical and empirical evaluations employ fundamentally different measures, the experimental results do not convincingly support the proposed theory.

方法与评估标准

This paper primarily analyzes existing methods rather than proposing a new one. However, it introduces a benchmark policy specifically for the theoretical analysis of offline RLHF methods. This benchmark policy is defined as the policy that attains the pointwise minimum of the loss function. Under the Bradley-Terry (BT) model assumption, this benchmark policy coincides with the standard KL-regularized reward-maximizing policy. Given this alignment, I find the proposed benchmark policy to be a reasonable choice for studying offline RLHF.

理论论述

I only checked the proof sketch in Section 4 and did not check the detailed proof. I feel the results are correct.

实验设计与分析

Please see my comments in the above Claims And Evidence part.

补充材料

I had a rough look at the appendix which includes a discussion on proper loss and experiment details.

与现有文献的关系

This paper studies offline RLHF for LLM alignment and may inspire advancements toward training better LLMs.

遗漏的重要参考文献

To the best of my knowledge, this paper provides a sufficiently thorough discussion of all closely related works.

其他优缺点

Please see my comments in the above Claims And Evidence part.

其他意见或建议

  1. The notation \ell is used inconsistently. Before line 255, the loss function \ell is defined as taking a single variable as input. However, in line 255, \ell is instead expressed as a function of two variables. This inconsistency may cause confusion and should be clarified.
作者回复

We thank you for your thoughtful feedback on our work. You are right in noting that the main contributions of our work are theoretical, and not in proposing new methods. We address your main concerns with our theoretical results below.

Squared loss and realizability:. You are correct that it is not the case that the squared loss is a clear win across all the theoretical criteria, and makes a less natural realizability assumption. We note that we can use standard arguments to extend our analysis to the case where the realizability assumption holds uniformly, up to an error ϵ\epsilon. We did not include this for clarity of exposition in the submission, but can easily add this to the final version. More generally, our view is that realizability is an unverifiable assumption in practice, but is perhaps more likely to hold for rich model classes such as LLMs. At the same time, the curvature of the loss is something we can control better. So an approach that primarily favors the squared loss as a default objective based on the curvature properties is reasonable practically, since it will yield good results, unless realizability is violated, in which case a practitioner can switch to other losses, such as logistic. This tradeoff between the intuitive modeling niceness versus optimization and learning tractability is often encountered in the literature on generalized linear models as well, and we will expand on this discussion in the final submission.

Choice of loss function affects benchmark policy: You are also correct that the choice of the base policy affects the benchmark policy. As we note on line 161 (right), the most natural benchmark, which would be independent of these choices, cannot be recovered by the class of offline methods including DPO and variants which are studied here. That said, the base policy is a part of the loss function. It is quite standard even in the literature on simpler classification and regression setups that when the loss changes, the convergence point changes accordingly, and we can only analyze convergence to the optimal solution according to the chosen loss. Indeed, as we notice in Remark 3.8, the choice of base policy has no explicit role in the convergence bound, but only plays a more subtle role in influencing the realizability assumption and the benchmark policy. We think it is still useful to uncover the different points of convergence that different choices used in practice correspond to, so that we can prefer methods which correspond to desirable points of convergence.

Gap between the performance measures in theory and experiments: We agree that it would have been great if we could just have a similar KL bound or an easy surrogate in our experiments as well. Unfortunately, without access to the unknown benchmark policy, this is an impossible task. Consequently, we measure the most natural quantity that we could think of empirically, and that is also consistent with the broader literature. While not perfect, we believe this is still insightful. If the reviewer has additional recommendations for performance measures that are measurable, and that would narrow the theory-practice gap, we would be happy to include them in the final submission.

Notation: We apologize for the inconsistency regarding the notation for loss, we will fix it. The notation ΠR\Pi_{\mathcal{R}} is a typo from an earlier version. It should just be Π\Pi.

Thanks again for your insightful comments. We hope that our responses address some of your concerns, and we would be happy to discuss further if you have additional feedback.

审稿意见
3

The paper offers a theoretical analysis of offline preference-based RL algorithms such as DPO. Specifically, the authors investigate an empirical observation: offline preference-based RL is often worse than online RL, and faces some degeneracies during optimization.

To set up the problem, the authors first set up a general policy class to include different regularization methods and then make 5 assumptions: bounded log prob and loss; realizability ;proper loss; coverage and curvature.

The main conclusion is Thm 3.6, which is an error bound decided by cμc_\mu, CC and ϵ\epsilon. As a result, losses with a high curvature will have a larger cμc_\mu, therefore better error bounds. Moreover, base policy and constraints also play a role: different base policies may change the π\pi^*, and different constraints make different assumptions on π\pi^*. Lastly, the Authors show that DPO may have large R and small cμc_\mu, causing degeneracies observed in previous works.

Experiments show that square loss is superior to logistic loss and πref\pi_{ref} performs better than uniform distribution.

给作者的问题

See above. My main concern is that Assumption 3.5 seems a bit strong to prevent analysis on other losses, and I will be happy to raise my score if authors can address these concerns.

论据与证据

Square loss is better than logistic loss

Evidence:

  • The cμc_\mu of square loss is ½, yielding a reasonable bound
  • The experiment shows that square loss is better than logistic loss

Question:

  • The cμc_\mu should be 2 instead of ½?

Base policy matters, and πref\pi_{ref} is a relative good choices

Evidence:

  • Jointly supported by Section 3.1 and experiment

Different constraints play a role in the π\pi^* assumption, which may affect results

Evidence:

  • Remark 3.9. Question:
  • Is it possible to draw a plot of CPO to justify this assumption?

DPO faces degeneracies because of large R and low cμc_\mu

Evidence:

  • Remark 3.10

方法与评估标准

The methods and evaluation make sense.

理论论述

Strengths:

  • I have a rough glance at Section 4 and do not find obvious mistakes.

Weakness:

  • The Assumption 3.5 may be too strong and prevent analysis on many possible losses, for example, sigmoid.

实验设计与分析

Strengths:

  • Experiments and analyses make sense.

Weakness:

  • As I mentioned earlier in “Claims and Evidence”, it would be better to add experiment such as CPO to justify the claim on constraints.

补充材料

N/A

与现有文献的关系

Strengths:

  • Different from other papers such as GPO, which gives empirical analysis, this paper gives a unified theoretical analysis and understanding on the offline preference-based RL.

Weakness:

  • The assumption 3.5 restricts the theoretical analysis to a limited class of losses. Specifically, the authors analyze square loss and show its superiority to logistic loss. However, similar analyses and conclusions have been found in prior works such as IPO.

遗漏的重要参考文献

It may be beneficial to discuss the difference between IPO’s theory analysis and this paper’s.

其他优缺点

Other Strengths

  • It is good to see that authors report error bars in the experiments, which should be encouraged
  • I personally think the paper is clearly written overall.

Other Weakness

  • See above

其他意见或建议

  • Line 366 right side: “consistently” -> “consistently better”
  • Line 373 right side: “Figure 1(left)” -> “Figure 1 (right)”

update after rebuttal

I keep my score.

作者回复

We thank you for your encouraging and thoughtful comments.

Severity of restrictions due to Assumption 3.5: You are correct in noting that Assumption 3.5 places some restrictions on the loss function, but we believe that these restrictions are relatively mild and standard as we discuss below:

  1. Please note that the curvature condition is not pointwise (i.e. per x, y, y’ triple), but only in expectation, and not uniform, but only local around the optimal policy π\pi^\star. So the required assumption is local curvature of the expected loss around the optimal solution, which is a very common assumption in optimization and convergence theory.
  2. The curvature assumption, while not explicitly stated, is responsible for the e^B term in almost all prior works on analysis within the Bradley-Terry-Luce model with the logistic loss, such as [1], [2] and [3]. Please also see Remark 4.9 in [1] for an explicit mention of this aspect. In fact, even in simpler settings such as analysis of generalized linear models, a similar curvature assumption can be seen in [4], Assumption 1.

While we agree that weakening such conditions further would be highly desirable, addressing this goes well beyond the study of preference-based learning methods, and would be a significant undertaking to be carried out in a separate work.

Differences to analysis in IPO paper: We apologize for missing the discussion of differences from the analysis in the IPO paper in the submission. The authors in IPO indeed note the issue of the curvature in the link function, and use this to motivate the IPO algorithm. However, there is no explicit quantitative analysis of the role of the link function beyond a simple example, since the focus is more on obtaining a practical algorithm for the identity link function case. We instead derive precise convergence guarantees, and also highlight the role of offline data coverage which is not captured in the IPO paper.

We thank you again for the insightful comments, and we will update the final submission with the discussions above, as well as address your other suggestions.

References:

  • [1] Sharp Analysis for KL-Regularized Contextual Bandits and RLHF, arXiv 2411.04625.
  • [2] Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons, ICML 2023.
  • [3] Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-constraint, ICML 2024.
  • [4] Parametric Bandits: The Generalized Linear Case, NeurIPS 2010.
审稿人评论

Thanks for the authors' rebuttal. I decided to maintain my tendency for acceptance after reading the rebuttal.

审稿意见
5

The paper investigates offline reinforcement learning methods that use a fixed dataset of responses and human preference feedback to align language models. It analyzes how various design choices in methods like DPO, IPO, and SLiC affect the quality of the learned policy. The study provides a unified theoretical framework that bypasses traditional reparameterization arguments and includes an empirical validation on a standard summarization benchmark.

给作者的问题

N/A

论据与证据

The claims about bound on the sub-optimality and other theoretical insights are well supported.

方法与评估标准

The benchmark DPO methods are appropriate and diverse enough. Multiple loss functions are also studied. Given the loss function and the data coverage assumption, the main bounds are derived.

理论论述

Section 4 analysis contains accurate derivation of the proof of Theorem 3.6 which is intuitive and accurate.

实验设计与分析

The experimental results might be limited in terms of various datasets and backbone models. However, the major ablation sufficiently demonstrates the theoretical point regarding the assumptions.

补充材料

N/A

与现有文献的关系

Related to major DPO methods [1,2] where this paper focuses on various design choices and how they affect the learning of an optimal policy.

[1] Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024b.

[2] Azar, M. G., Guo, Z. D., Piot, B., Munos, R., Rowland, M., Valko, M., and Calandriello, D. A general theoretical paradigm to understand learning from human preferences. In International Conference on Artificial Intelligence and Statistics, pp. 4447–4455. PMLR, 2024.

遗漏的重要参考文献

Multi-negative DPO [1] could be further discussed. In addition, group-based regularization [2] could also be discussed.

[1] Chen, Yuxin, Junfei Tan, An Zhang, Zhengyi Yang, Leheng Sheng, Enzhi Zhang, Xiang Wang, and Tat-Seng Chua. "On softmax direct preference optimization for recommendation." arXiv preprint arXiv:2406.09215 (2024).

[2] Ramesh, Shyam Sundhar, Yifan Hu, Iason Chaimalas, Viraj Mehta, Pier Giuseppe Sessa, Haitham Bou Ammar, and Ilija Bogunovic. "Group robust preference optimization in reward-free rlhf." Advances in Neural Information Processing Systems 37 (2024): 37100-37137.

其他优缺点

The theoretical bound of sub-optimality is intuitive and useful.

Empirically, the setting could use more ablation study including various backbone models, datasets, etc.

其他意见或建议

See Strengths and Weaknesses

伦理审查问题

N/A

作者回复

Thank you for the encouraging feedback. We will add the suggested references [1, 2] to our discussion.

最终决定

This paper presents a unified theoretical framework for RLHF algorithms. It provides several interesting and intuitive theoretical insights (e.g., square vs. logistic loss, early stopping, etc.) and validates them in experiments. Although this paper is concisely written, it is quite informative and brings clarity to the community. Therefore, I would recommend acceptance for this paper.