4.5

/10

Rejected4 位审稿人

最低3最高6标准差1.5

4.5

置信度

正确性2.5

贡献度2.0

表达3.0

ICLR 2025

Ordinal Preference Optimization: Aligning Human Preferences via NDCG

Yang Zhao,Yixin Wang,Mingzhang Yin

OpenReview PDF

提交: 2024-09-26更新: 2025-02-05

TL;DR

We propose Ordinal Preference Optimization (OPO), a novel listwise approach to align human preferences via NDCG metric.

摘要

关键词

Large Language ModelsHuman Preferences AlignmentListwiseLearning to Rank

评审与讨论

审稿意见

评分: 6置信度: 42024-10-23

This paper presents OPO, a novel listwise approach that leverages the Normalized Discounted Cumulative Gain (NDCG), a widely-used ranking metric in Learning To Rank, to align LLM with human preference. By incorporating a differentiable surrogate loss function, OPO effectively utilizes the ranking information derived from multiple model responses.

优点

The paper is generally well-organized and clearly written.
The paper conducts comprehensive experimental with detailed examples, and the implementation code is publicly available.

缺点

Contributions of this paper are quite limited and appear to be incremental. The key technical components, including Neural NDCG objective [1], Sinkhorn scaling [3], and Neural Sort [2], have been previously established in the literature. Sections 3.1 and 3.2 present a reformulation of the established Neural NDCG formulation (Section 4.1 and 4.2).
Missing discussion on how this work is novel compared to other listwise method, such as LiPO[4]. LiPO uses the DCG weighting scheme, which inherently captures the relative proximity within multiple responses. Additionally, the discussion of current listwise methods in line 53 requires further clarification to better distinguish their differences and respective limitations.
While the paper claims to establish an explicit connection between LLM alignment and ranking model training, this theoretical connection is inadequately elaborated in the paper. Moreover, similar insights have been previously explored in LiPO's work.
In traditional learning-to-rank tasks, NDCG serves primarily as an evaluation metric, with standard LTR loss functions only loosely correlated to these evaluation criteria. The use of differentiable NDCG surrogate loss helps bridge this optimization-evaluation gap in ranking tasks. However, LLM is commonly evaluated on open benchmarks rather than ranking metrics, and the paper lacks theoretical justification in two aspects: why NDCG optimization would lead to better preference alignment, and what motivates the choice of NDCG as the optimization objective.
The primary motivation is that NDCG can better utilize relative proximity within ordinal multiple responses. However, in lines 207-208, the paper does not explain why "the gain function should be proportional to the label" can effectively capture the relative ranking of different responses. Moreover, lines 209-211 indicate that recommendation tasks focus on top-ranked items. This design stems from NDCG's discount function, which is derived from an approximation of human attention decay. However, the paper lacks discussion on whether such a discount function is appropriate for preference optimization tasks.
The paper could benefit from comparing different Learning to Rank objectives, such as Neural Sort (which Neural NDCG builds upon) [2], PiRank[5] and other ranking methods.

[1] NeuralNDCG: Direct Optimisation of a Ranking Metric via Differentiable Relaxation of Sorting.

[2] Stochastic Optimization of Sorting Networks via Continuous Relaxations

[3] A relationship between arbitrary positive matrices and dou- bly stochastic matrices.

[4] LiPO: Listwise Preference Optimization through Learning-to-Rank.

[5] PiRank: Scalable Learning To Rank via Differentiable Sorting.

问题

In MT-Bench experiments, why the reported results differ from those presented in other papers.
In the ablation studies of Neural NDCG, while different gain functions are compared, the paper does not explore the impact of different discount functions.
Table 5 suggests that diverse negative samples are more crucial than varied positive samples. However, the performance gain of All Pairs over Single Pair might simply result from increased training data.

伦理问题详情

评论- Response to Reviewer 9n1Z (Part 2)

2024-11-25

We provide the training hyperparameter settings of Mistral-7B-base in Table 7 and our dataset is a combination of UltraFeedback and SimPO, which is not equivalent to papers like DPO and LiPO. So it is normal to have a different result on the MT-Bench evaluation benchmark. The performance order of baselines generally aligns with previous alignment papers.

The ablation results on discount functions are provided below:

Discount	$x$	$x^2$	$\sqrt x$	$\log_2(x+1)$
pair win rate	62.78	62.47	63.16	64.25
score win rate	59.76	60.19	60.19	61.36

It shows that $\log_2(x+1)$ achieves the best performance. A potential reason is that the log function provides a smoother discount.

There is no significant difference between BPR and All Pairs. If increased training data can enhance model performance, All Pairs should consistently achieve superior performance over the BPR method. We also add a variant that compares multiple positive responses against the worst response:

Qwen2-0.5B

Approach	single pair	multi positive vs. worst	best vs. multiple positive	All Pairs
pair win rate	60.75	57.93	60.32	63.82
score win rate	56.86	55.16	58.33	60.54

Mistral-7B

Approach	Single Pair	multi positive vs. worst	best vs. multiple positive	All Pairs
pair win rate	71.90	82.95	84.43	85.34
score win rate	70.66	80.84	82.37	83.31

2024-12-03

I appreciate the authors' efforts in addressing some of the feedback. The additions, including more detailed experiments, are noted and helpful. However, the revised manuscript still does not sufficiently present the material, particularly in the context of prior work, such as LiPO.

While OPO represents an incremental improvement by applying neural sort to optimize the NDCG metric, it does not constitute a pioneering contribution as suggested by the authors. The method is primarily an empirical improvement, serving as a better proxy for NDCG in a specific context, but it provides limited new insights or learning. Furthermore, the paper lacks theoretical guarantees to support the claims regarding OPO's superiority as a proxy for NDCG, especially compared to LiPO.

评论- Response to Reviewer 9n1Z (Part 1)

2024-11-25

Thanks for your constructive and valuable feedback which improves our work. Please see our detailed response below:

w1&w2&w3&w4

Inspired by the Listwise Preference Optimization (LiPO) framework, we are the first to propose aligning human preferences directly with the NDCG metric to the best of our knowledge. To additionally analyze why OPO outperforms other methods, we counted their correct ranking pairs and successful ranking flips (i.e., incorrect to correct) during training. OPO exhibited significantly higher efficiency in flipping incorrect pairs compared to DPO and LiPO. We can also theoretically prove that the optimal ranking accuracy of NDCG-based methods is bounded by the ranking accuracy of the reference model. But to achieve such an optimal policy, OPO doesn’t necessarily maximize the reward margin like DPO and LiPO, which leads to higher efficiency in increasing the ranking accuracy metric. The detailed results are provided in Appendix A.9.

“the gain function should be proportional to the label” means $G(\psi_i)>G(\psi_j)$ if $\psi_i>\psi_j$ . The gain function should correctly reflect the order of ground truth labels. The ablation study on the gain function also reveals that there is no significant performance gap between $G(\psi)=2^\psi-1$ and $G(\psi)=\psi$ . We also supplement the ablation study on discount functions, which will be provided below.

Thanks for your supplements on baselines. NeuralSort and PiRank lacks column normalization compared with OPO, which is already shown in our ablation studies. Now we summarize the model performance on NeuralSort, PiRank, and OPO:

Approach	NeuralSort	PiRank	OPO
pair win rate	57.36	56.30	64.25
score win rate	56.20	54.98	61.36

2024-11-26

We appreciate your thorough response, which has clarified some of our initial concerns. However, we share the perspective of other reviewers regarding the incremental nature of OPO's technical contributions. Additionally, it is worth noting that [1] demonstrates that LambdaRank loss (used in LiPO) inherently optimizes NDCG.

[1] The LambdaLoss Framework for Ranking Metric Optimization.

审稿意见

评分: 3置信度: 52024-10-27

The paper proposes to use a listwise objective for LM alignment when multiple responses for a prompt is available. The objective is an existing one from the IR field. The proposed method is evaluated on datasets such as AlpacaEval and MT-Bench against baselines that focus on objective alternatives with the same backbones.

优点

It is good to further study the listwise perspective of the important LLM alignment problem. However, there seem to have some factual misunderstanding of the literature discussed below.
The study of diversity of negative samples is interesting and has value.
The paper essentially performs a study of additional listwise objectives for LM alignment. I think this is good to do, though I am not convinced of the significance and the trustworthiness of the empirical results, see below.

缺点

The major issue of this paper is its contribution. There are wrong claims such as claiming LiPO to be pairwise approaches, while they are actually listwise - for example, LambdaRank is considered as a listwise objective in the information retrieval literature and it is a proxy of NDCG, which the authors avoid discussing. This is because the weights use listwise information so they are no longer pairwise. It is even more confusing that LiPO pretty much has the exact same idea (while being public almost a year ago) to optimize NDCG. The paper uses another existing surrogate loss to do that. In the IR literature, this surrogate loss is not as acknowledged as the one studied in the LiPO work. The first two major contributions (of out 3) may even be put in the LiPO paper without any inconsistency.
Despite the large overlap with existing work, the paper does not have a convincing argument on why the deployed objective is a better choice, so the experiments are at the risk of cherry-picking. And given extensive work on such topics, the contribution is quite limited to meet the standard of a top-tier ML venue. Note that even if there's a argument specific to the objective, the contribution is still likely limited since the formulation and general objective (e.g., optimizing NDCG) are more important and the loss alternatives are not that different (as studied intensively in the IR literature).
Besides the formulation and method, the additional value provided by the paper is also relatively limited. For example, it is still limited to offline alignment (which might be ok itself, the issue is it does not add value based on existing work as discussed above).

问题

See above.

评论- Response to Reviewer Zehe (Part 3)

2024-11-25

It is popular to employ offline algorithms with several iteration rounds in the industry[1], and it is straightforward to apply OPO in an iterative paradigm. Traditional iterative pairwise methods sample best-of-n responses and assign them reward scores with the separated reward model, then they choose preferred response $y_w$ and non-preferred response $y_l$ based on assigned rewards and apply alignment algorithms like DPO on these response pairs. For OPO, it shares the same sampling-rewarding stage and the only difference is that we input a list of K responses with their assigned reward scores and employ listwise algorithms on them.

Regarding other additional values, we conduct comparisons of flipping incorrect pairs during training across DPO, LiPO, and OPO, which is provided in Appendix A.9.

[1] Dubey, Abhimanyu, et al. "The llama 3 herd of models." arXiv preprint arXiv:2407.21783 (2024).

评论- Response to Reviewer Zehe (Part 2)

2024-11-25

The paper does not have a convincing argument on why the deployed objective is a better choice, so the experiments are at the risk of cherry-picking

To comprehensively highlight the differences in performance among various methods, we counted their correct ranking pairs and successful ranking flips (i.e., incorrect to correct) during training. OPO exhibited significantly higher efficiency in flipping incorrect pairs compared to DPO and LiPO. The detailed results are provided in Appendix A.9.

Note that even if there's an argument specific to the objective, the contribution is still likely limited since the formulation and general objective (e.g., optimizing NDCG) are more important and the loss alternatives are not that different (as studied intensively in the IR literature).

Experiments have shown that there is a performance gap among different learning to rank objectives. We added several LTR baselines to demonstrate the reliability of the experimental results. The results indicate that OPO consistently outperforms these methods.

Qwen2-0.5b

Approach	bce	ranknet	listnet	pirank
pair win rate	51.63	62.27	52.03	56.30
score win rate	50.46	59.04	51.52	54.98

Mistral-7b

Approach	bce	ranknet	listnet	pirank
pair win rate	53.76	85.62	54.80	65.23
score win rate	54.17	84.22	56.21	64.74

评论- Response to Reviewer Zehe (Part 1)

2024-11-25

Thanks for your constructive and valuable feedback which improves our work. Please see our detailed response below:

This is because the weights use listwise information so they are no longer pairwise.

LambdaRank uses the DCG difference to re-weight the pair error compared to the original DPO formula, which can be interpreted as a Pairwise Logistic Loss with lambda-NDCG weights. This is the major reason why we classified it as a pairwise method. But we’ll modify the claims to a listwise approach in our revised paper to keep the alignment with LTR literature.

It is even more confusing that LiPO pretty much has the exact same idea (while being public almost a year ago) to optimize NDCG.

Thanks for pointing out this potential confusion. The LiPO framework proposed the concept of Listwise Preference Optimization (LiPO) and introduced LambdaRank. Although LambdaRank incorporates DCG information, its theoretical relationship with NDCG has not been thoroughly analyzed. In fact, based on their gradient formulations, LambdaRank is more closely related to DPO (pairwise logistic loss). When $\delta_{w,l}$ is large, the gradient of LambdaRank will be strengthened and vice versa. However, it still falls short of the struggle of pairwise logistic losses, which maximizes the reward margin instead of ranking accuracy or win rates.[3] We can show that OPO has a higher speed in correcting incorrect ranking accuracy pairs (i.e., make $\frac{\pi\_\theta(y_w|x)}{\pi\_\theta(y_l|x)}>1$ given $\frac{\pi\_{\text{ref}}(y_w|x)}{\pi\_{\text{ref}}(y_l|x)}<1$ ) than DPO and LiPO in Appendix A.9.

\nabla\_{\theta}\mathcal{L}\_{LiPO}=-\Delta\_{w,l}\cdot\beta\cdot\sigma(s(x,y\_l)-s(x,y\_w))\left[\nabla_\theta \log\pi(y\_w|x)-\nabla\_\theta\log\pi(y\_l|x)\right]=\Delta\_{w,l}\cdot\nabla\_{\theta}\mathcal{L}\_{DPO}

[1] Burges, Christopher, Robert Ragno, and Quoc Le. "Learning to rank with nonsmooth cost functions." Advances in neural information processing systems 19 (2006).

[2] Wang, Xuanhui, et al. "The lambdaloss framework for ranking metric optimization." Proceedings of the 27th ACM international conference on information and knowledge management. 2018.

[3] Chen, Angelica, et al. "Preference Learning Algorithms Do Not Learn Preference Rankings." arXiv preprint arXiv:2405.19534 (2024).

审稿意见

评分: 3置信度: 52024-10-28

This paper proposes Ordinal Preference Optimization (OPO), a listwise method that adapts the existing learning-to-rank (LTR) methods to optimize LLMs based on multiple response rankings. By adapting the existing learning-to-rank (LTR) methods such as NeuralNDCG with DPO, OPO can directly optimize under the multi-response datasets.Experiments show the effectiveness of the proposed OPO.

优点

The proposed OPO method is straightforward and intuitive.
Experiments with various pairwise and listwise baselines demonstrate the effectiveness of OPO.
The codes and scripts are provided in the supplementary materials, enhancing the reproducibility of this work.

缺点

The motivation for OPO is not clear. It is only briefly mentioned (e.g., line 53 "which is relatively simplistic") and requires a more detailed comparison with existing listwise approaches to highlight the motivation of the proposed new listwise method.
The proposed OPO directly adapts existing learning-to-rank (LTR) methods (e.g., NeuralNDCG) to optimize DPO with NDCG, lacking novelty. In fact, as listed in this paper, other listwise LTR methods can also applied to the DPO.
The paper lacks theoretical analysis and comparative discussion of the proposed OPO method, which would strengthen its contribution and clarify its advantages over existing methods.

问题

Why was NeuralNDCG chosen as the surrogate loss for optimizing NDCG, rather than other direct NDCG optimization losses?
Given the potential sensitivity of the win rate metric with GPT4, was a t-test conducted to assess its significance, and were selection biases (e.g., sensitivity to AB vs. BA results) considered?

评论- Response to Reviewer bLeQ (Part 2)

2024-11-25

There is a gap between using NDCG as an evaluation metric and a training objective, since the NDCG metric is non-differentiable with respect to reward scores $\mathbf{s}$ , which prevents the utilization of gradient descent to optimize models. So we have to employ a differentiable surrogate approximation of NDCG to optimize it.

The win rate metric is the standard evaluation metric in LLM alignment literature, such as DPO and SimPO. To overcome the sensitivity of a single metric, we employ four different win rate metrics across different reward models and evaluation benchmarks. As shown in the literature, a 5% improvement in win rates is considered significant, but achieving a significant p-value can be difficult as the test set contains only about 2k prompts.

评论- Response to Authors' Rebuttal

2024-11-25

Thank you for your response. Some of my concerns have been addressed. However, I agree with the other reviewers' comments that the contributions of OPO seem relatively incremental (i.e., directly applying direct NDCG optimization to the preference optimization of LLMs). Providing more theoretical analysis and insights would significantly enhance the contribution of this paper.

Additionally, regarding Question 1, I apologize if my initial question was not specific enough. My primary concern is why NeuralNDCG was chosen as the approximation method for NDCG, rather than other approaches, such as those proposed in [1-2] or other approximation techniques. Clarifying this choice would help strengthen the justification for your method.

Overall, I will maintain my original score.

[1] A Guided Learning Approach for Item Recommendation via Surrogate Loss Learning, SIGIR 2021
[2] Large-scale Stochastic Optimization of NDCG Surrogates for Deep Learning with Provable Convergence

评论- Response to Reviewer bLeQ (Part 1)

2024-11-25

We thank the reviewer for their constructive and valuable feedback which improves our work. Please see our detailed response below:

w1&w3

The logistic-based methods like RLHF, DPO, and LiPO maximize the reward margin of the preferred response over the non-preferred one, but experiments[1] have shown that they struggle to increase ranking accuracy $\pi_\theta(y_w|x)>\pi_\theta(y_l|x)$ , which is a more practical metric in generation scenarios. We counted their correct ranking pairs and successful ranking flips (i.e., incorrect to correct) during training. OPO exhibited significantly higher efficiency in flipping incorrect pairs compared to DPO and LiPO. The detailed results are provided in Appendix A.9.

LTR methods can be applied to the offline alignment problem, but their performance differs. To comprehensively investigate the LTR literature, we added additional baselines of BCE, RankNet, ListNet, and PiRank. The results indicate that OPO consistently outperforms these methods.

Qwen2-0.5b

Approach	bce	ranknet	listnet	pirank
pair win rate	51.63	62.27	52.03	56.30
score win rate	50.46	59.04	51.52	54.98

Mistral-7b

Approach	bce	ranknet	listnet	pirank
pair win rate	53.76	85.62	54.80	65.23
score win rate	54.17	84.22	56.21	64.74

[1] Chen, Angelica, et al. "Preference Learning Algorithms Do Not Learn Preference Rankings." arXiv preprint arXiv:2405.19534 (2024).

审稿意见

评分: 6置信度: 42024-11-04

The paper introduces a new surrogate objective function for training generative models using reinforcement learning, where the objective function is based on NDCG ranking rewards of multiple candidate answers. The paper argues that strong RL training methods often depend on designing contrastive reward functions, which utilize the comparative values rather than the absolute values of multiple answers for self-normalization and uncertainty quantification purposes. The proposed NDCG objective is an extension of this idea, where the detailed ranking positions for a list of answers are considered at the same time.

To optimize the NDCG objective, the paper further introduces a differentiable Neural Sort operator (Grover et al 2019), where the ranking steps are relaxed into a probabilistic distribution matrix using softmax operators. The paper then compares the proposed Neural-NDCG method with other popular NDCG optimization methods, such as LambdaRank and simple pairwise comparison baselines. The paper shows the generality of the proposed method by fine-tuning medium-scaled LLMs on ListUltraFeedback dataset and evaluating them on AlpacaEval and MT-Bench datasets.

优点

Originality: The paper introduces a new way to train RLHF models by extending pairwise reward functions into list-wise ranking functions. The paper then introduces neural-sort operator to replace the current SOTA LambdaRank method. The introduction of the neural-sort operator looks natural and the intuitions are clear.

Quality: The proposed methods are evaluated on recent models and relevant benchmarks. The experiments and the ablation studies are well-designed. They serve as a good template for researchers setting up similar experiments.

缺点

It does not seem very clear from the paper, but the proposed RL training steps seem to be offline. According to Eq (2), the learned policies are only fitted on the likelihoods of the pre-generated answers, instead of the self-generated answers. It is unclear to me whether the offline optimization paradigm presents a performance gap when the models are evaluated online or when they generate answers in the real world.

As a minor point, the NeuralSort algorithm (5) could use more discussions. For example, the paper could include more details about the probabilistic P_sort matrix in the numerical examples.

问题

Regarding online vs offline RL, can the authors:

Clarify whether the proposed method is intended for offline or online RL.
Discuss potential limitations of using an offline approach
Compare the proposed method to online RL alternatives to verify if potential performance gaps may exist
Give concrete examples on how the proposed method perform in real-world online generation scenarios.

Regarding NeuralSort details, please:

Include a step-by-step example of how the probabilistic P_sort matrix is computed in Appendix A
Clarify the benefits of probabilistic permutations. Will having ties or near-ties allow better characterization of the ranking uncertainties?
Discuss potential limitations of probabilistic permutations. Will the P_sort matrix lead to miscalibration, where all of the generated matrix may degenerate? For example, having the same [0,1,0,0] rows in the entire matrix?
Discuss uniqueness of the solutions. Are there other equivalent formulations to achieve neural-sort? Will these equivalent formulations alleviate some of the limitations? In what sense is the proposed algorithm the most obvious choice?

评论- Response to Reviewer Eyvn (Part 1)

2024-11-25

We would like to thank you for the detailed comments and suggestions. We have responded to each of the concerns below.

The OPO is an offline listwise alignment algorithm.
The performance differences between online and offline RLHF approaches have been well discussed in the alignment literature. [1][2] Online methods integrate on-policy sampling during the training stage and demonstrate a superior performance with a smaller KL divergence, which means the RLHF policy drifts less from the SFT policy. Offline algorithms also face the challenges of fitting on sub-optimal pre-generated training data. A possible solution is to employ the offline methods in multiple iteration rounds, where new preference data and annotations are generated from the latest policy and used for the next training round.[3]
Online algorithms can be viewed as offline algorithms running on $D_{online}$ , which is more on-policy than $D_{offline}$ . Under specific sampling methods such as rejection sampling or best-of-n [2], offline algorithms like DPO and our OPO can be naturally applied in online scenarios. Please also note that employing offline algorithms with several iteration rounds in the industry is more popular, which is discussed in (2).
Like other offline methods, applying OPO in an iterative paradigm is straightforward. Traditional iterative pairwise methods sample best-of-n responses and assign them reward scores with the separated reward model, then they choose preferred response $y_w$ and non-preferred response $y_l$ based on assigned rewards and apply alignment algorithms like DPO on these response pairs. For OPO, it shares the same sampling-rewarding stage and the only difference is that we input a list of K responses with their assigned reward scores and employ listwise algorithms on them.

[1] Tang, Yunhao, et al. "Understanding the performance gap between online and offline alignment algorithms." arXiv preprint arXiv:2405.08448 (2024).

[2] Gao, Leo, John Schulman, and Jacob Hilton. "Scaling laws for reward model overoptimization." International Conference on Machine Learning. PMLR, 2023.

[3] Dubey, Abhimanyu, et al. "The llama 3 herd of models." arXiv preprint arXiv:2407.21783 (2024).

评论- Response to Reviewer Eyvn (Part 2)

2024-11-25

We have now provided the step-by-step example in Appendix A.8
The original NDCG metric is deterministic, while it is non-differentiable with respect to reward scores, which prevents the utilization of gradient descent to optimize models. In order to apply backpropagation algorithms, we need to replace ranking operators like $\tau(j)$ with differentiable and continuous operators like $\hat\tau(j)$ in ApproxNDCG, and such operators are probabilistic.
The hyperparameter $\tau$ in OPO controls approximation precision. As shown in Table 6, when $\tau$ is set to a high value, the probabilistic permutation matrix P_sort could not correctly reflect the true permutation, which is harmful to model performance.

For example, having the same [0,1,0,0] rows in the entire matrix

No, $P_{\text{sort}(\mathbf{s})}[i,j]$ denotes the probability that response $y_j$ is ranked in the $i$ -th position after re-sorting. If scores=[1,1,1,1], the rows of P_sort will be [0.25,0.25,0.25,0.25].

NeuralSort and its variant PiRank are both differentiable probabilistic sorting algorithms and can be employed for NDCG approximation. However, their permutation matrix $\widehat{P}$ may not be column-stochastic, meaning each column may not sum to one. This can cause some $G(\psi_i)$ to contribute to the overall loss objective disproportionately and adversely affect model performance. The ablation study is shown below.

2024-11-26

Thanks for the authors' responses. The authors addressed my concerns, but I also agree with the other reviewers in terms of novelty assessments. If DPO itself is not a fundamentally challenging problem, then it may not be the best application for differentiable list-wise optimization algorithms. I will keep my rating thanks to the clarity of the paper.

评论- Response to shared concerns.

2024-11-25

Response to All:

We thank the reviewers for your positive and constructive feedback. Your comments signiﬁcantly help us improve the paper. We summarize responses to the shared concerns as follows:

1. Motivation and Theoretical analysis

To additionally analyze why OPO outperforms other methods, we counted their correct ranking pairs and successful ranking flips (i.e., incorrect to correct) during training. OPO exhibited significantly higher efficiency in flipping incorrect pairs compared to DPO and LiPO. We can also theoretically prove that the optimal ranking accuracy of NDCG-based methods is bounded by the ranking accuracy of the reference model. But to achieve such an optimal policy, OPO doesn’t necessarily maximize the reward margin like DPO and LiPO, which might be a reason that leads to higher efficiency in increasing the ranking accuracy metric. We can show that OPO has a higher speed in correcting incorrect ranking accuracy pairs (i.e., make $\frac{\pi\_\theta(y_w|x)}{\pi\_\theta(y_l|x)}>1$ given $\frac{\pi\_{\text{ref}}(y_w|x)}{\pi\_{\text{ref}}(y_l|x)}<1$ ) than DPO and LiPO. The detailed results are provided in Appendix A.9.

2. Online vs. Offline

OPO is an offline alignment algorithm, so it also faces the challenge of fitting on sub-optimal pre-generated training data like other offline methods. However, it can be applied in an iterative paradigm, which has been proven to be effective in the industry. [1] Traditional iterative pairwise methods sample best-of-n responses and assign them reward scores with the separated reward model, then they choose preferred response $y_w$ and non-preferred response $y_l$ based on assigned rewards and apply alignment algorithms like DPO on these response pairs. For OPO, it shares the same sampling-rewarding stage and the only difference is that we input a list of K responses with their assigned reward scores and employ listwise algorithms on them.

[1] Dubey, Abhimanyu, et al. "The llama 3 herd of models." arXiv preprint arXiv:2407.21783 (2024).

3. Supplement Experiments

To comprehensively investigate the LTR literature, we added additional baselines of BCE, RankNet, ListNet, and PiRank. We also supplemented ablation studies on the key components of OPO. The results shown below indicate that OPO consistently outperforms these methods.

AC 元评审

2024-12-21

The paper proposes a novel surrogate objective function for reinforcement learning-based generative model training, leveraging NDCG ranking rewards to evaluate multiple candidate answers. This approach extends contrastive reward functions by incorporating detailed ranking positions for better self-normalization and uncertainty quantification. The authors optimize the NDCG objective using the differentiable Neural Sort operator and demonstrate the method’s generality by fine-tuning language models on ListUltraFeedback and evaluating them on AlpacaEval and MT-Bench datasets, outperforming existing NDCG optimization methods like LambdaRank.

I agree with the common concerns of the reviewers (Reviewer Eyvn, bLeQ, Zehe, 9n1Z) regarding the motivation of the paper over DPO motivation. Furthermore, the novelty of the proposed methods is unclear over the existing LTR algorithms. In general, the significance and efficacy of listwise objectives for LM alignment are not very convincing, even from the empirical results. Based on the common consensus, therefore, we suggest the authors submit to the next suitable venue with all the necessary modifications.

审稿人讨论附加意见

See above.

最终决定Reject

2025-01-22

Reject