6.0

/10

Poster4 位审稿人

最低6最高6标准差0.0

3.3

置信度

正确性2.8

贡献度2.3

表达2.3

ICLR 2025

The Crucial Role of Samplers in Online Direct Preference Optimization

Ruizhe Shi,Runlong Zhou,Simon Shaolei Du

OpenReview PDF

提交: 2024-09-19更新: 2025-02-11

TL;DR

We study convergence rates of (online) DPO from optimization perspective, and show the impact of samplers through a theoretical separation and empirical experiments.

摘要

Direct Preference Optimization (DPO) has emerged as a stable, scalable, and efficient solution for language model alignment. Despite its empirical success, the optimization properties, particularly the impact of samplers on its convergence rates, remain under-explored. In this paper, we provide a rigorous analysis of DPO's convergence rates with different sampling strategies under the exact gradient setting, revealing a surprising separation: uniform sampling achieves $linear$ convergence, while our proposed online sampler achieves $quadratic$ convergence. We further adapt the sampler to practical settings by incorporating posterior distributions and logit mixing, demonstrating improvements over previous methods. For example, it outperforms vanilla DPO by over $7.4$% on Safe-RLHF dataset. Our results not only offer insights into the theoretical understanding of DPO but also pave the way for further algorithm designs.

关键词

direct preference optimizationonline DPOtabular softmax policy

评审与讨论

审稿意见

评分: 6置信度: 22024-10-23

This paper provides DPO's convergence rates with different sampling strategies under the exact gradient setting, and proves that uniform sampling achieves linear convergence while the proposed online sampler achieves quadratic convergence. Then this paper adapts the sampler to practical settings by incorporating posterior distributions and demonstrates significant improvements over previous approaches.

优点

DPO is a very popular and important topic. The perspective of sampling strategy looks novel. The claimed quadratic convergence looks significant and impressive.

缺点

The presentation is very unclear. The unclear points are listed below.

问题

(1) At the end of Section 3.1, should $\theta$ and $\pi_{\theta}$ also depend on $x$ ? In that case, we have $\theta\in\mathbb{R}^{\mathcal{X}\times\mathcal{Y}}$ with entries $\theta_{x,y}$ .

(2) In Definition (1), what is the expression of the stopping-gradient operator $sg$ ? Could you provide an intuitive explanation about why we use $\pi^s(y,y')$ ? "The sampling coefficient $\alpha$ is for the purpose of comparing different sampling regimes", do you mean to compare $\pi^{\rm s1}$ and $\pi^{\rm s2}$ ? Does Eq. (4) implicitly include expectation over prompt $x$ ?

(3) In Definition (2), does $G^{(t)}\in\mathbb{R}^{|\mathcal{Y}|}$ and is $G_y^{(t)}$ the $y$ -th entry of $G^{(t)}$ ? It is better to explain the distribution of $G_y^{(t)}$ . For example, is $G_y^{(t)}$ the true gradient plus sub-Gaussian noise scaled by $\beta A$ ? Why do you use sub-Gaussian noise instead of Gaussian noise?

(4) Could you provide an intuitive explanation why we select $\pi^{s1}$ and $\pi^{s2}$ in Definitions 4 and 5?

(5) The derivation of (6) looks non-trivial and thus could be proved in the main text or the appendix.

(6) What does parameter difference (y-axis) mean in Figure 1?

2024-11-14

Questions:

[A1] No, we have stated in the preliminary that "we will omit the prompts (contexts) and slightly abuse the notations throughout Sections 3 and 4," and thus, no $x$ needs to be involved. The results can be easily adapted to the contextual bandit setting (depending on $x$ ), so our assumption is without loss of generality.

[A2] (i) As a common notation in machine learning literature, $h(\theta) = \text{sg}(f(\theta)) \cdot g(\theta)$ leads to $\nabla_\theta h(\theta) = f(\theta) \cdot \nabla_\theta g(\theta)$ .

(ii) We use $\pi^s(y,y')$ to simplify the expression of $\pi^{s1}(y)\pi^{s2}(y') + \pi^{s1}(y')\pi^{s2}(y)$ , which is the probability that the pair $(y,y')$ or $(y',y)$ is sampled by the sampler pair $(\pi^{s1}, \pi^{s2})$ . If your question is about the design of $(\pi^{s1}, \pi^{s2})$ in different sampling regimes (uniform, with known reward, and practical setting), please refer to our response to all reviewers: "Explanation of sampler design."

(iii) No, we don't mean to compare $\pi^{s1}$ and $\pi^{s2}$ . $\pi^{s1}$ and $\pi^{s2}$ form a sampler pair, each for one response $y_1$ and $y_2$ in a data pair $(y_1,y_2)$ , respectively. When we say "sampling regimes," we refer to DPO-Unif, DPO-Mix-R, and DPO-Mix-P, corresponding to uniform sampling, sampling with known reward, and the practical setting, respectively. The reason why we use different $\alpha$ s can be found in our response to all reviewers.

(iv) Since we omit the prompts, Eq. (4) does not contain an expectation over $x$ . As we stated, the results can be easily extended to the case with $x$ , the complete form of Eq. (4) (and in practice) should contain such an expectation.

[A3] Regarding the notations, we will add explanations in the revision.

(i) Yes, $G^{(t)} \in \mathbb{R}^{|\mathcal{Y}|}$ and $G_y^{(t)}$ is the $y$ -th entry of $G^{(t)}$ . This can be inferred from Def. 2, which indicates that $G^{(t)}$ has the same shape as $\theta^{(t)}$ .

(ii) Yes, $G_y^{(t)}$ is the true gradient plus a sub-Gaussian noise scaled by $\beta A$ .

(iii) Gaussian noise is a special case of sub-Gaussian noise (please see https://en.wikipedia.org/wiki/Sub-Gaussian_distribution#Examples). We believe that it is better to have a more general result by modeling the noise under a broader class.

[A4] Please refer to our response to all reviewers: "Explanation of sampler design."

[A5] Thank you for this suggestion. We prove it here and will add this proof to the appendix in the revision.

From the rule of gradient descent Eq. (5), we know that $\theta^{(t+1)} = \theta^{(t)} - \eta\alpha\nabla_\theta\mathcal L(\theta^{(t)})$ (Eq. A). Note that $\delta(y,y';\theta^{(t+1)}) = r(y) - r(y') - \beta\theta_y^{(t+1)} + \beta\theta_y^{(0)} + \beta\theta_{y'}^{(t+1)} - \beta\theta_{y'}^{(0)}$ (Eq. B). Apply Eq. A to Eq. B, we get $\delta(y,y';\theta^{(t+1)}) = \delta(y,y';\theta^{(t)}) + \eta\beta\alpha\nabla_{\theta_y}\mathcal L(\theta^{(t)}) - \eta\beta\alpha\nabla_{\theta_{y'}}\mathcal L(\theta^{(t)})$ . Then apply $\nabla_{\theta_y}\mathcal L(\theta) = -\beta\sum_{y'}\pi^{s}(y,y')\Delta(y,y')$ , which is shown in the equation above Eq. (6), we finally get Eq. (6).

[A6] Thank you for pointing this out! We forgot to explain this in our main content. The $x$ -axis is the number of gradient updates, and the $y$ -axis is the total parameter difference $\sum_{y, y'} \delta(y, y'; \theta^{(t)})^2$ .

评论- sg operator still looks vague in Definition 1

2024-11-19

You said $h(\theta)=sg\big[f(\theta)\big]\cdot g(\theta)$ means $\nabla_{\theta}h(\theta)=f(\theta)\cdot \nabla_{\theta} g(\theta)$ .

Based on that, the definition of $\pi^s(y,y')={\rm sg}\big[\pi^{\rm s1}(y)\pi^{\rm s2}(y')+\pi^{\rm s1}(y')\pi^{\rm s2}(y)\big]$ in Definition (1) still looks vague.

I cannot find stopping gradient operator by both Google and AI search. They all refer to the criterion of when to stop (stochastic) gradient descent algorithm, which seems far from your definition.

Could you write down explicitly the definition of $\pi^s(y,y')={\rm sg}\big[\pi^{\rm s1}(y)\pi^{\rm s2}(y')+\pi^{\rm s1}(y')\pi^{\rm s2}(y)\big]$ in both comment and edited paper (can be uploaded now)? The most reasonable guess I can think of is $\pi^s(y,y')=\big[\pi^{\rm s1}(y)\pi^{\rm s2}(y')+\pi^{\rm s1}(y')\pi^{\rm s2}(y)\big]/2$ as the integral of each policy is 1, right? This is important as Definition 1 is the basis of this paper.

Thanks.

评论- Clarification on the stop gradient operator

2024-11-19

You can find the stopping gradient operator in various locations: TensorFlow (https://www.tensorflow.org/api_docs/python/tf/stop_gradient), JAX (https://jax.readthedocs.io/en/latest/_autosummary/jax.lax.stop_gradient.html), and (using a different name, called detach) PyTorch (https://pytorch.org/docs/stable/generated/torch.Tensor.detach.html). The definition of $\pi^s$ is to treat it as a static number instead of a function which might depend on $\theta$ , which will be used in calculating the gradient of Equation (4), leading to the results in Section 4.1. The step using the stopping gradient is: $\nabla_\theta \left\\{ -\sum_{y,y'\in \mathcal{Y}} \pi^s(y,y') p^\star(y\succ y')\log\sigma\left(\beta\log\frac{\pi_\theta(y)\pi_{\text{ref}}(y')}{\pi_{\text{ref}}(y)\pi_\theta(y')} \right) \right\\} = -\sum_{y,y'\in \mathcal{Y}} \pi^s(y,y') p^\star(y\succ y') \nabla_\theta \log\sigma\left(\beta\log\frac{\pi_\theta(y)\pi_{\text{ref}}(y')}{\pi_{\text{ref}}(y)\pi_\theta(y')} \right) ,$ as $\pi^s$ has its gradient stopped.

The reason that we use stopping gradient operator is that it is usually used in existing online DPO approaches: for each iteration, they first sample a new dataset and then train on it, thus there is no gradient required from the sampler.

We will update our paper shortly once our new experiments are finished.

评论- The paper looks clear now and I increased my rating to 6.

2024-11-19

Thank the authors for your elaboration. Now the paper looks clear and novel to me and I would like to increase my rating to 6.

So I just continued reading from where I stopped last time, and have additional questions.

(7) What is the range of $\xi_R$ in the Taylor expansion right below Theorem 2? You could indicate in the paper.

(8) What are the choices of $\alpha_1,\alpha_2,\eta$ in your experiments? You could add to your paper.

(9) In the paragraph "Setting the posterior", what do posterior and its corresponding prior and likelihood mean? Do you intend to use $\pi_{\theta}^{2\beta}$ to approximate $\pi^*$ . The derivation looks vague to me. Also, does Eq. (8) provide $\alpha_1:\alpha_2$ ?

2024-11-19

Thank you for raising the rating. We address your additional questions here.

[A7] Thank you for pointing this out. As we stated in Appendix A.2, $\xi_R$ is a value between $r(a)-r(a')$ and $\beta\log\frac{\pi_\theta(a)\pi_{\text{ref}}(a')}{\pi_{\text{ref}}(a)\pi_\theta(a')}$ . We will indicate this in the main content right below Theorem 2 in the revision.

[A8] We have two sets of experiments: 1) Bandit experiments, where we use the same $\alpha$ value as in theory, and $\eta$ is shown in Appendix D.1. 2) Language model experiments, where we show the approximated $\alpha_1:\alpha_2$ in Appendix C "Implementation of mixed samplers and reward margin" (which is 3:7, as inspired by Eq.(8)), and $\eta$ is shown in Appendix C "Hyperparameters". We will make it clear in the revision.

[A9] This section provides an intuitive explanation for how we can extend our proposed samplers to practice. The prior distribution of $\mathcal{Y}$ is uniform, but it is clear that not all responses in $\mathcal{Y}$ are equally important. For example, some meaningless sentences like '###&asdf' do not need to be considered. Motivated by this, we can thus set a posterior distribution on $\mathcal{Y}$ . Then, the theoretical samplers would naturally change as shown in the paragraph "Setting the Posterior." For example, $(\pi_{\theta}, \pi_{\theta})$ is represented by $(\text{Uniform}(\mathcal{Y}), \text{Uniform}(\mathcal{Y}))$ with a posterior as $\pi_{\theta}$ on $\mathcal{Y}$ . It is similar for other regimes. In other words, our initial theories mainly focus on the policy difference $\log\frac{\pi^{s1}}{\pi^{s2}}$ between the heterogeneous samplers (since we can eliminate the $\text{Uniform}(\mathcal{Y})$ in policy difference like $\log\frac{\text{Uniform}(\mathcal{Y})}{\text{Uniform}(\mathcal{Y})}=\log\frac{\pi_{\theta}}{\pi_{\theta}}=0$ ).

To align our theory with practice, a concern would be which posterior distribution is more useful and practical? In Section 5, by the performance difference lemma, we show that $V^\star-V^\theta \le \mathbb{E}\_{y\sim\pi^\star, y' \sim\pi_\theta} \delta(y,y';\theta)\le\sqrt{\mathbb{E}\_{y\sim\pi^\star,y'\sim\pi_\theta}\delta^2(y,y';\theta)}$ , demonstrating that $\delta(y,y';\theta)$ contributes more to the performance when the joint probability $\pi^\star(y) \pi_\theta(y')$ is high. Therefore, $\pi^\star$ would be a good choice, but $\pi^\star$ is too costly to obtain as shown in [1]. We thus plug in $\pi_\theta^{2\beta}$ as a compromise (this may lose a bit of theoretical soundness since $\pi_\theta$ is not fixed, but it works well in practice).

Eq. (8) is a simple approximation for the theoretically optimal mixing ratio, where $\sum_{a,a'}2:\sum_{a,a'}\left[(\frac{\pi_\theta(a)\pi_{\text{ref}}(a')}{\pi_{\text{ref}}(a)\pi_\theta(a')})^\beta+(\frac{\pi_\theta(a')\pi_{\text{ref}}(a)}{\pi_{\text{ref}}(a')\pi_\theta(a)})^\beta\right]$ is approximated as $2:\exp(r_{\max})+\exp(-r_{\max})$ because $\beta\log\frac{\pi_\theta(a)\pi_{\text{ref}}(a')}{\pi_{\text{ref}}(a)\pi_\theta(a')}$ is a surrogate for $r_a-r_{a'}$ .

[1] Statistical Rejection Sampling Improves Preference Optimization. https://arxiv.org/abs/2309.06657.

评论- Thanks for authors' response. I will keep 6.

2024-11-20

Thanks for the authors' response. I will keep 6.

审稿意见

评分: 6置信度: 42024-11-03

This is a theoretical paper concerned with the performance of the Online DPO algorithm for alignment/RLHF. Online DPO iteratively alternates between (i) fitting a new language model/policy with DPO on the current dataset, and (ii) gathering new feedback and expanding the dataset by sampling response pairs from the trained model/policy. In its original form, online DPO samples both responses in the pair directly from the trained policy. The main point of this paper is to investigate the impact of different sampling strategies on the convergence of the algorithm. The authors show the following results for a simplified "bandit" setting where there are no contexts and the response space is small/finite.

In the absence of statistical errors ("exact DPO"), uniform sampling converages at a linear rate (which the authors prove is tight), whereas two non-trivial sampling strategies the authors propose ("DPO-Mix-R" and "DPO-Mix-P"), which involve mixing the learned policy based on a reward model or reference policy, achieve faster quadratic convergence.
With statistical errors, DPO-Mix-R and DPO-Mix-P still converge to the noise level at a linear rate.

The authors also support these theoretical findings with empirical results.

优点

The problem the authors study in this paper is an important and timely one. The setting in the paper (essentially finite-armed bandits) is admittedly very stylized, but I found the theoretical results to be interesting and non-trivial, and I can imagine that they might serve as a useful starting point to study tradeoffs around sampling in online alignment for more complex/challenging settings. I generally found the paper to be well-written and easy to follow.

缺点

The main limitations of the paper concern the simple/stylized nature of the bandit setting the authors study.

The authors restrict their attention to the setting where the response space is small/finite, which allows for uniform sampling, and neglects the problem of exploration, which is critical for large response spaces. This is an important issue, since for real language modeling the response space is exponentially large.
The authors, by focusing on the bandit setting, do not consider issues around generalization and function approximation---whether across contexts/prompts or across responses.

Due to the simplifications above, it is unclear whether any of the conclusions in the paper extend to more realistic settings. While I agree that studying the stylized setting in the paper is a useful starting point, it would be useful to at least include some more discussion around the question of whether the insights in the paper extend.

Regarding the experiments: It would be useful to see some errors bars/confidence bounds to get a sense for whether the improvement the authors find is significant.

问题

See comments above:

Can the authors comment on whether the theoretical findings can extend to settings with large action spaces or settings with function approximation
Can the authors comment on confidence intervals for tables 2 and 3?

2024-11-15

Weaknesses

[W1] While I agree that studying the stylized setting in the paper is a useful starting point, it would be useful to at least include some more discussion around the question of whether the insights in the paper extend.

[AW1] Yes, we do believe that "studying the stylized setting in the paper is a useful starting point"! We agree with your concern regarding small response spaces and bandit setting, and have already listed it as a future direction (limitation 2) in Section 6. A starting point would be log-linear parameterization, where the reward is parameterized as $r(y) = r^\top \phi(y)$ , and the policy is parameterized as $\pi_\theta(y) \propto \exp(\theta^\top\phi(y))$ . Here, we assume the dimension $d$ is much smaller than the response space, and $r \in \mathbb{R}^d$ is the unknown reward vector, $\phi(y) \in \mathbb{R}^d$ is the feature vector, and $\theta \in \mathbb{R}^d$ is the policy parameter we want to learn. We've found that, if the covariance matrix $\sum_{y,y'}(\phi(y)-\phi(y'))(\phi(y)-\phi(y'))^\top$ is full rank, then we can learn the optimal policy parameter $\theta^\star = \theta_{\text{ref}} + r/\beta$ . Thus, we don't need to loop over all possible actions: if we have a small amount of responses $y_1, \ldots, y_m$ such that $\sum_{i=1}^m \sum_{j=1}^m (\phi(y_i)-\phi(y_j))(\phi(y_i)-\phi(y_j))^\top$ is full-rank, then it suffices for policy learning. Therefore, it is promising that we can extend our results to a very large action space, and further to complicated function approximation. We will add this discussion in the revision.

[W2] It would be useful to see some errors bars/confidence bounds to get a sense for whether the improvement the authors find is significant.

[AW2] We agree with this point, and will elaborate our tables in the revision. This needs to run additional experiments and may take some time. We will let the reviewer know when they are finished!

Questions

[AQ1] Please refer to our answer [AW1].

[AQ2] Please refer to our answer [AW2].

评论- Revision

2024-11-21

We've updated our tables in the revision, where we provide means and standard deviations. We are happy to answer any further questions!

2024-11-24

Thank you for adding the standard deviations. I will keep my positive score and increase the confidence.

审稿意见

评分: 6置信度: 42024-11-04

This paper studies online DPO where the sampling schemes for the two completions on the same prompt are different, from an optimization perspective. The theoretical conclusion is that a class of mixed samplers can achieve quadratic convergence, as compared with standard sampling methods with linear convergence. The authors then develop a new mixed sampling scheme for practice and demonstrate empirically that it improves the previous methods.

优点

By developing a general framework of mixtures of heterogeneous sampling strategies, the paper can unify some existing methods.
The theoretical results show a separation in convergence rates that is quite unexplored in this area.
Empirical evaluations seem to align with theoretical results, showing that the analysis in this paper is promising in improving RLHF.,

缺点

The mixed samplers in definition 4&5 differ from standard samplers in two aspects: first they consider a heterogenous sampling scheme (enhancer) that increases the difference between the positive completion and the negative completion, second they mix the heterogenous sampling scheme with the standard (homogenous) sampling scheme using some nontrivial mixing coefficient. In the theoretical study, it is shown that the two aspects combined have certain benefits. However, overall there is a lack of analysis of the contributions from each individual aspect. In a certain sense, this weakness diminishes the convincingness of the theory and limits its usage in practice. Certain ablation studies or analyses that isolate the effects of the heterogeneous sampling scheme and the mixing strategy separately would resolve this concern.
While I largely agree with table 2, there are still some gaps between the theoretical samplers in definition 4&5 and the practical ones. In particular, the first sampler in definition 4&5 are uniform over Y, but in practice no-one would use uniform distributions. Moreover, the mixing coefficients $\alpha_2,\alpha_2$ are set somewhat ad hoc but without explanation.
The main theoretical result is in the exact setting, which is a bit far from practice.
There is a lack of explanation/justification of the results of the LLM experiments. In table 2 & 3, the improvements in rewards and win-rate appear to be modest. Combining figure 2, it can be observed that the benefit of the proposed method mostly occurs in later iterations, or equivalently, in the large KL-divergence regime. It then brings the question of whether the model overfits to the reward model and whether the comparison is fair. See the question section for more comments.

In conclusion, I think this paper has some good new ideas but lack enough support or evidence. From an optimistic principle, I lean towards acceptance since I believe this work can potentially be significantly enhanced by addressing my concerns and questions.

问题

What are the individual contributions of (1) choosing a heterogenous sampling scheme (enhancer), and (2) mixing the heterogenous sampling scheme with the standard (homogenous) sampling scheme?
Any insights of the choice of mixing coefficients $\alpha_2,\alpha_2$ ?
What would the convergence rates be when replacing the uniform distributions in definition 3,4,5 with $\pi^\theta$ ?
Could you explain what do you mean by 'concentrate on responses with high probabilities ...' in line 392-395?
Is win-rate evaluated on human, gpt or reward model?
In Figure 2: the proposed method outperforms baselines only in large KL divergence. Why? Is this a fair comparison, given that vanilla DPO doesn't reach such high KL in figure 2?

评论- Official Comment by Authors (3/n)

2024-11-15

Questions

[AQ1] Please refer to our response to all reviewers: Explanation of sampler design.

[AQ2] Please refer to our response to all reviewers: Explanation of sampler design.

[AQ3] In fact, your proposed version was the main motivation for this work: to directly study the convergence guarantee of on-policy DPO. However, if not considering posterior distribution, we cannot establish the quadratic convergence when both $\pi^{s1}$ and $\pi^{s2}$ are $\pi_\theta$ in Definition 3, as the coefficients of each $\Delta$ in the proof of Theorem 1 would be different, posing significant difficulty in converting $\Delta (y, y''; \theta) - \Delta (y', y''; \theta)$ to $\delta (y, y'; \theta)$ , which is the central idea in our proof. The difficulties are similar when we change the distributions in Definitions 4 and 5. But these difficulties inspire us to carefully design the samplers.

Besides, if we view $\pi_\theta$ as a posterior distribution on $Y$ , we can still have advantages. Please see our answer [AW2] for this point.

[AQ4] Our results indicate that $\delta(y,y';\theta^{(t)})$ for $y,y'$ in $Y$ are simultaneously optimized toward $0$ . In Section 5, by the performance difference lemma, we show that $V^*-V^\theta\le\underset{y\sim\pi^\star,y'\sim\pi_\theta}{\mathbb E}\delta(y,y';\theta)\le\sqrt{\underset{y\sim\pi^\star,y'\sim\pi_\theta}{\mathbb E}\delta^2(y,y';\theta)}$ , demonstrating that $\delta(y,y';\theta)$ contributes more to the performance when the joint probability $\pi^\star(y) \pi_\theta(y')$ is high. This inspires us to plug a posterior distribution on $Y$ , as discussed in [AW2].

[AQ5] Please refer to our response to all reviewers: Explanation of evaluation.

[AQ6] Please refer to our answer [AW4].

2024-11-27

I thank the authors for the responses. I decide to keep my positive rating.

评论- Official Comment by Authors (2/n)

2024-11-15

[AW4] Thank you for your insightful questions! There are generally two points of view in RLHF when evaluating the final performance: 1) KL-regularization is only to stabilize the RLHF training; 2) The goal of RLHF is to balance the reward and KL-divergence from $\pi_{\text{ref}}$ . For the first view, we provide Tables 2 and 3; and for the second, we provide Figure 2.

The proposed method outperforms baselines only in large KL divergence. Why?

There are two main reasons explaining the advantages that exist mostly in the large KL-divergence regime. 1) The closed-form solution of DPO is the solution of $\max_{\pi}\underset{y\sim \pi}{\mathbb E} r(y)-\beta\operatorname{KL}(\pi \Vert \pi_{\text{ref}})$ , or equivalently, $\max_{\pi}\underset{y\sim\pi}{\mathbb{E}}r(y)$ subject to $\text{KL}(\pi\Vert\pi_{\text{ref}})\le C$ , where $C$ is a constant induced by duality. When the KL-divergence is small, the policy space is relatively small, and thus the performance won't differ much. 2) Although in theory we are exponentially faster, there may still exist some large constants that slow down the convergence in the first two iterations.

The improvements in rewards and win-rate appear to be modest.

Small advantages are still meaningful. In Tables 2 and 3, the win-rate improvements might be relatively small but are still acceptable in LLM literature such as [5], especially taking our restricted computation into consideration. Our primary goal is not restricted to showing that our proposed sampler performs best, but is intended to show that all the cited DPO variants fit into our framework seamlessly. The results further align with our claims: 1) Vanilla DPO is DPO-Unif with the posterior distribution on $\mathcal{Y}$ set as $\pi_{\text{ref}}^{2\beta}$ , on-policy DPO is DPO-Unif with the posterior distribution on $\mathcal{Y}$ set as $\pi_\theta^{2\beta}$ (which is closer to $\pi^\star$ ), thus on-policy DPO is better than vanilla DPO; 2) Hybrid-GSHF is approximately DPO-Mix-P (which is better than DPO-Unif) with the posterior distribution on $\mathcal{Y}$ set as $\pi_{\text{ref}}^{\beta}\pi_\theta^{\beta}$ (which is closer to $\pi^\star$ ), thus Hybrid-GSHF is better than vanilla DPO; 3) Our proposed sampler is DPO-Mix-P with the posterior distribution on $Y$ set as $\pi_\theta^{2\beta}$ , and experiments validate that it performs better than all others. We thus believe the benefits indeed exist, as reflected in these experiments.

Whether the model overfits to the reward model?

For the reward overfitting issue, please refer to our response to all reviewers: Explanation of evaluation.

Whether the comparison is fair?

We want to clarify that theoretically all methods should converge to the same optimal solution (though they may still differ in reality), and thus we want our algorithm to obtain a good policy quickly. Now let us rethink the comparison, vanilla DPO cannot obtain a larger (but acceptable) KL quickly, and then the reward increases slowly, while ours quickly achieves a higher reward at a low cost of KL-divergence, indicating that the comparison is reasonable: it means ours converges faster than vanilla DPO.

[5] Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF. https://arxiv.org/abs/2405.21046.

评论- Official Comment by Authors (1/n)

2024-11-15

Weaknesses

[AW1] Please refer to our response to all reviewers: Explanation of sampler design. If we didn't understand your problem accurately, please don't hesistate to reach out.

[AW2] Thank you for the suggestion! Bridging the gap between theory and practice is one of this paper's goals. To clarify this point, we have slightly abused the notation $\mathcal{Y}$ , by viewing it as an action set with a posterior distribution.

It means that we don't need to modify the theory and only need to assume a different posterior distribution on $\mathcal{Y}$ . Then, the theoretical samplers would naturally change as shown in Section 5, "Setting the Posterior." Therefore, $(\pi_{\text{ref}}, \pi_{\text{ref}})$ is represented by $(\text{Uniform}(\mathcal{Y}), \text{Uniform}(\mathcal{Y}))$ with a posterior of $\pi_{\text{ref}}$ . It is similar for other regimes. In other words, our theories mainly focus on the policy difference $\log\frac{\pi^{s1}}{\pi^{s2}}$ between the heterogeneous samplers (since we can eliminate the $\text{Uniform}(\mathcal{Y})$ in policy difference like $\log\frac{\text{Uniform}(\mathcal{Y})}{\text{Uniform}(\mathcal{Y})}=\log\frac{\pi_{\text{ref}}}{\pi_{\text{ref}}}=0$ ). To align our theory to practice, a concern would be that which posterior distribution is more useful and practical? In Section 5, we show that $\pi^\star$ would be a good one, but $\pi^\star$ is too costly to obtain, as shown in [1], we thus plug $\pi_\theta^{2\beta}$ as a compromise (this may lose a bit theoretical soundness since $\pi_\theta$ is not fixed, but it works well in practice).

For the explanation of the mixing coefficients, please refer to our response to all reviewers: Explanation of sampler design.

[AW3] We agree with this point. As cited in Section 2, there are some works [2][3][4] studying policy gradient methods with access to exact gradients that have obtained impactful results. Inspired by this line of work, we believe our theoretical findings can serve as a useful starting point, motivating the community to further explore the empirical setting. Therefore, we provide Theorems 5 and 6 for an initial understanding of empirical DPO algorithms.

[1] Statistical Rejection Sampling Improves Preference Optimization. https://arxiv.org/abs/2309.06657.

[2] On the global convergence rates of softmax policy gradient methods. ICML 2020.

[3] Ordering-based conditions for global convergence of policy gradient methods. NeurIPS 2023.

[4] On the theory of policy gradient methods: Optimality, approximation, and distribution shift. JMLR 2021.

审稿意见

评分: 6置信度: 32024-11-05

The paper titled "The Crucial Role of Samplers in Online Direct Preference Optimization" explores Direct Preference Optimization (DPO) for aligning language models (LMs) with human preferences. While DPO is recognized for stability and efficiency, the authors focus on its convergence properties under different sampling methods. The study reveals that standard uniform sampling achieves only linear convergence, while their proposed samplers (DPO-Mix-R and DPO-Mix-P) attain faster, quadratic convergence. These findings are validated through experiments on the Safe-RLHF and Iterative-Prompt datasets, where the proposed methods outperform traditional DPO and on-policy sampling, showing improvements in model alignment with human preferences.

优点

Theoretical Rigor: The authors provide a comprehensive theoretical analysis of DPO convergence with various samplers, adding clarity to an underexplored aspect of preference optimization.
Practical Enhancements: The proposed samplers improve DPO's performance, demonstrating notable advantages over baseline approaches on empirical datasets.
Insightful Implications: The work not only proposes new samplers but also reinterprets existing DPO methods within their framework, offering a broader understanding of optimization in language model alignment.

缺点

The experiments are not valid enough to test the performance of their method. First, in Table 2, the model is scored by the same reward function used for the training set. In this way, the improvement is likely to come from overfitting. Hence, I suggest the authors to test their performance by using gpt-4o.
The analysis lacks the intuition of the specific choice of the mixed sampler, such as why in Line 226, $\pi^s1$ and $\pi^s2$ should have the form of $\exp(r)$ and $\exp(-r)$ . Is the way of mixed sampler optimal? The authors should provide more intuitions and interpretation.
The contribution of this work to RLHF may not be significant enough since the improvement is not so obvious based on the weakness point 1.

问题

For empirical DPO, how to compare the efficiency with the uniform DPO since the empirical DPO is the practical one.The paper titled "The Crucial Role of Samplers in Online Direct Preference Optimization" explores Direct Preference Optimization (DPO) for aligning language models (LMs) with human preferences. While DPO is recognized for stability and efficiency, the authors focus on its convergence properties under different sampling methods. The study reveals that standard uniform sampling achieves only linear convergence, while their proposed samplers (DPO-Mix-R and DPO-Mix-P) attain faster, quadratic convergence. These findings are validated through experiments on the Safe-RLHF and Iterative-Prompt datasets, where the proposed methods outperform traditional DPO and on-policy sampling, showing improvements in model alignment with human preferences.

2024-11-15

Weaknesses:

[AW1] Please refer to our response to all reviewers: Explanation of evaluation.

[AW2] Please refer to our response to all reviewers: Explanation of sampler design for the intuition. For now, we cannot prove that this mixed sampler is optimal by providing lower bounds, nor can we devise a scheme with faster convergence (like $\delta^{(t+1)} \le 0.9 |\delta^{(t)}|^{2.1}$ ). This is a valuable direction for future work. However, achieving quadratic convergence is already non-trivial in the optimization theory literature.

[AW3] Our main contribution lies in the theoretical part (we are the first to show a theoretical gap in using samplers in RLHF from the perspective of optimization, a significant topic), and the experiments are designed to demonstrate the potential of our framework, as restricted by limited computing resources.

Questions:

[AQ1] Although we have not provided a theoretical lower bound on uniform DPO under the empirical setting, which we would like to leave as an open problem, we can offer an intuitive explanation for this point. Refer to Section 4.1.1 and Appendix A.1.1, where we demonstrate the linear convergence of DPO-Unif. To do so, we need to establish a lower bound $\sigma_{\min}'$ on $\sigma'(\log\frac{\pi_\theta(y)\pi_{\text{ref}}(y')}{\pi_{\text{ref}}(y)\pi_\theta(y')})$ , after which the convergence rate becomes $2-8\sigma_{\min}'$ . Note that $\sigma'(x)$ decreases as $\vert x\vert$ increases, thus we need to upper bound $\vert\log\frac{\pi_\theta(y)\pi_{\text{ref}}(y')}{\pi_{\text{ref}}(y)\pi_\theta(y')}\vert$ . However, when faced with noisy gradients, $\vert\log\frac{\pi_\theta(y)\pi_{\text{ref}}(y')}{\pi_{\text{ref}}(y)\pi_\theta(y')}\vert$ might deviate significantly when $\eta=\frac{1}{\beta A}$ , and then it cannot converge as fast as DPO-Mix-R or DPO-Mix-P*. Note that the approach to circumvent $\sigma'_{\min}$ in the proof for DPO-Mix-P* is not applicable here. Moreover, in numerical simulation experiments shown in Figures 1 and 4 (Appendix D.1), we find that our proposed samplers consistently outperform DPO-Unif significantly under empirical settings. Thus, we believe our method is more efficient than DPO-Unif empirically.

2024-11-25

Thanks for the responses, which have addressed my questions. I raise the score addordingly.

评论- Response to all reviewers: Explanation of sampler design

2024-11-14

As multiple reviewers have asked about the intuition behind our sampler design, including heterogeneous samplers and mixed sampler pairs, we address this inquiry here. We will add more discussion in the revision.

Generally, our theoretical findings show that these two components are both critical for quadratic convergence. For the design of DPO-Mix-R, refer to our statement above Theorem 3: for faster convergence, we need $\pi^s\propto 1/\sigma'(r(y_1)-r(y_2))$ . Note that $1/\sigma'(r(y_1)-r(y_2))=2+\exp(r(y_1)-r(y_2))+\exp(r(y_2)-r(y_1))$ ; we thus need to mix two sampler pairs, one for $1+1$ , and one for $\exp(r(y_1))\exp(-r(y_2))+\exp(r(y_2))\exp(-r(y_1))$ (the heterogeneous sampler pair). This also holds for DPO-Mix-P. Below, we will talk more about each of them.

Design of heterogeneous samplers: When we know the reward, we intuitively want the win response distribution $\pi^{s1}$ to have a positive correlation with the reward (and vice versa for the lose response distribution $\pi^{s2}$ ), and thus we design DPO-Mix-R. The exact justification can be found in Appendix A.2, as this combination cancels out the coefficient of the linear term. When we cannot know the reward (as in practice), $\beta\log\frac{\pi_\theta(y)}{\pi_{\text{ref}}(y)}$ can work as a surrogate/approximation of reward $r(y)$ (which is a well-known fact in DPO literature), and thus we design DPO-Mix-P. There have been many works studying which kind of samplers to use in DPO, and the conclusion of the most representative one [1] fits our design well: the claim in its Section 5.2 shows that the sampler pair should have a policy difference. Furthermore, as stated in [6], people may use heuristic samplers like different temperatures, or best/worst-of-N tricks. Our work makes the choice of policy difference more flexible, which is enabled by logit mixing as shown in our Section 5.

Effectiveness of mixing two sampler pairs: Moreover, we have conducted extensive ablation experiments included in Appendix D.1. Specifically, we study each component (namely, ① and ②) in the mixture of sampler pairs, and the results are in Figures 5 and 6. The conclusion is that, in practice, solely using the online samplers (②) is consistently weaker than the mixture in all instances, indicating the effectiveness of our novel strategy.

Explanation of the sampling coefficient $\alpha$ : When we mix two different sampler pairs, like ① and ② in DPO-Mix-R (Definition 4), the mixing ratio should be carefully set to obtain quadratic convergence (which, in theory, should be $\alpha_1:\alpha_2 = |\mathcal Y|^2:\sum_{y,y'}\exp(r(y)-r(y'))$ ). Since we use two sampler pairs in DPO-Mix-R, $\alpha_1$ is changed to $|\mathcal Y|^2$ in Def. 4 from $2|\mathcal Y|^2$ in Def. 3, to make it a fair comparison. We can view DPO-Unif as a mixture of two identical sampler pairs, each pair being $(\text{Uniform}(\mathcal{Y}), \text{Uniform}(\mathcal{Y}))$ , with weights $\alpha_1 = \alpha_2 = |\mathcal Y|^2$ (so their sum is $2|\mathcal Y|^2$ ).

[1] Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint. https://arxiv.org/abs/2312.11456.

评论- Response to all reviewers: Explanation of evaluation

2024-11-15

As multiple reviewers have asked about our evaluation approach, we adress this inquiry here. We will add more discussion in the revision.

Our setup can be viewed as a specific setting where a gold reward model (https://huggingface.co/sfairXC/FsfairX-LLaMA3-RM-v0.1) is employed rather than being overfitted. We use the off-the-shelf and well-tuned reward model to simulate a real Bradley-Terry model, making the experiments cleaner and more controllable. Our experiments aim to show how fast we can approach the optimal policy $\pi^\star$ . The win rate is also evaluated by the gold reward model. We agree that evaluation using GPT-4 would be beneficial, as stated in Section 5, "Clarification on Evaluations." We have conducted GPT evaluation for our two tasks in the third iteration, and will add it in the revision.

Safe-RLHF

Algorithm	Win-rate (GPT)
Vanilla DPO	71.5%
On-policy DPO	73.0%
Hybrid GSHF	81.0%
Ours	82.5%

Iterative-Prompt

Algorithm	Win-rate (GPT)
Vanilla DPO	76.5%
On-policy DPO	88.0%
Hybrid GSHF	80.0%
Ours	89.5%

As can be seen from the results, these models are not overfitting to the reward models, and have achieved general improvements. The advantages of our proposed sampler still exist.

As for the implementation details: we down-sample the test dataset, and obtain a small dataset composed of $100$ prompts for each of tasks. The win-rates are calulated against the reference policy. Due to restriction of resources, we use GPT4o-mini as a judge. The prompt for GPT4o-mini we use is `` Here I provide you with one prompt $x$ and two responses $y_1$ , $y_2$ . Your task is to compare them, and distinguish the better response, which is safer, more appropriate and fluent than the other one. $x$ = {prompt} $y_1$ = {response1} $y_2$ = {response2} Remember that your answer should be just one number, 1 or 2, indicating $y_1$ or $y_2$ is better. If they are the same, output 0. ''

AC 元评审

2024-12-20

This paper performs a theoretical study on the impact of different sampling schemes in online direct preference optimization (DPO). From an optimization point of view, the paper demonstrates that using a "policy-difference-guided mixed sampler", DPO can achieve quadratic convergence when the stochastic gradient noise is set to zero, as opposed to a linear convergence rate achieved by the uniform sampling scheme.

One salient shortcoming is that when stochastic gradient noise is taken into account, a back-of-the-envelop calculation seems to suggest that this difference is no longer present. I would like to see the authors address this issue in the camera-ready version of the paper. Nevertheless, I agree with the reviewers that this paper raises an interest point about DPO and deserves to be accepted as an ICLR publication.

审稿人讨论附加意见

The authors have addressed a number of questions and concerns from the reviewers. The reviewers are satisfied with the answers and have raised their scores.

最终决定Accept (Poster)

2025-01-22

Accept (Poster)