PaperHub
7.0
/10
Poster4 位审稿人
最低6最高9标准差1.2
9
6
6
7
3.3
置信度
COLM 2025

In-context Ranking Preference Optimization

OpenReviewPDF
提交: 2025-03-21更新: 2025-08-26
TL;DR

We introduce IRPO, a framework that optimizes LLMs using natural, in-context ranking feedback to enhance ranking quality while reducing computational cost.

摘要

关键词
direct preference optimizationlarge language model

评审与讨论

审稿意见
9

This paper aims to solve the challenge of sparse pairwise feedback used in Direct Preference Optimization (DPO). DPO maximizes the margin between preferred and non-preferred responses and does not take into account the pairwise comparisons of all possible pairs. In contrast, this paper proposes a framework called ​​In-context Ranking Preference Optimization that accounts for the relevance of an item and its positional importance in a differentiable ranking objective. Experiments are conducted on ranking tasks such as conversational recommendation, generative retrieval, and question-answering re-ranking. Compared to other existing methods, such as supervised fine tuning (SFT) and DPO, the proposed model performs significantly better in both NDCG@k and Recall@k metrics.

接收理由

  1. The paper is well-organized and well-written. It provides the necessary background and identifies the challenge with existing DPO methods. The theoretical insights provided in the paper helps in better understanding of the method.
  2. The strong empirical results across three different tasks suggest that the method is generalizable and improves the alignment with human preferences.

拒绝理由

I don’t see any specific reason to reject this paper.

评论

We sincerely thank Reviewer SMDr for their highly positive feedback and insightful evaluation of our work. We greatly appreciate the recognition of our paper's organization, clarity, and theoretical contributions. We're particularly encouraged by the reviewer's acknowledgment of our strong empirical results and the generalizability of the proposed approach.

审稿意见
6

This paper introduces IRPO, a novel approach motivated by real-world applications where LLMs generate multiple candidate outputs in a single inference, with users selecting relevant or preferred items. Unlike methods requiring exhaustive pairwise comparisons, IRPO optimizes models using naturally occurring ranked feedback.

The key innovation is extending the DPO objective to simultaneously incorporate graded relevance and positional importance. IRPO leverages the PL model for list-wise preferences, weighting each item's contribution according to its rank position's NDCG gain. This creates a differentiable surrogate for list-wise ranking objectives by decomposing ranking lists into weighted pairwise comparisons, with higher-ranked mistakes incurring larger losses to mimic NDCG penalties.

Theoretically, IRPO prioritizes learning from the most "wrongly ordered" items, as its gradient automatically emphasizes items with large disparities between model and reference rankings. The authors demonstrate that this gradient functions as an unbiased importance sampling estimator with reduced variance for ranking tasks. Overall, the results outperform DPO and S-DPO baselines with significant improvement in performance.

接收理由

  • The paper is clearly written and accessible despite the technical complexity of the subject matter.
  • The proposal is theoretically sound with rigorous mathematical proofs supporting the approach.
  • Evaluation is comprehensive, covering multiple application domains: Conversational Recommendation, Generative Retrieval, and QA reranking.

拒绝理由

  • The novelty is somewhat limited, as this is not the first work combining DPO with differentiable NDCG objectives. The authors mention related approaches (OPO, DRPO, LiPO [1,2,3]) but provide insufficient comparative analysis against these methods, making the specific contributions of IRPO unclear.
  • The paper lacks exploration of applicability beyond recommender systems. Can IRPO effectively learn from ranked preferences to improve performance on general LLM benchmarks like Arena-Hard and AlpacaEval, which are standard targets for preference learning methods?
  • While the authors claim computational efficiency ("eliminating multiple forward passes and reducing computational cost," line 101), they provide no empirical measurements to substantiate this claim. Comparative runtime or resource utilization metrics would strengthen this argument.
  1. LiPO: https://arxiv.org/abs/2402.01878
  2. OPO: https://arxiv.org/abs/2410.04346
  3. DRPO: https://arxiv.org/abs/2410.18127

给作者的问题

  1. What specifically separates IRPO from OPO, DRPO, and LiPO? The related work part doesn't provide enough evidence on the difference, a direct comparison would help
  2. Have you tried IRPO on standard LLM alignment benchmarks outside recommendation?
  3. How much faster/cheaper is this approach in practice compared to alternatives?

Addressing these questions from the reviewer would help improve the rating.

评论

Response to Weakness 2 and Question 2

We have evaluated IRPO on standard LLM tasks beyond recommendation, including generative retrieval [1–3] and question answering [4–6] (Sections 5.3 and 5.4), where ARC-Challenge also appears in AlpacaEval. These results demonstrate IRPO’s applicability to general LLM alignment benchmarks.

Response to Weakness 3 and Question 3

As mentioned in Line 37–38, unlike prior methods, IRPO aligns LLMs that generate an entire ranked list in one shot (Line 88 in Related Work). As a result, IRPO needs only a single forward pass per sample, whereas DPO, SDPO, and LiPO perform multiple passes to compute pairwise margins, yielding a substantial speedup. We report Run Time (per sample), which is averaged under consistent evaluation settings, to ensure a comprehensive and representative comparison. We show the following complexity comparison results.

MethodRun Time (per sample)
Base0.1140s
DPO0.1145s
SDPO0.1144s
LiPO0.1154s
IRPO0.0313s

Since IRPO’s only a single foward pass, its per-sample runtime is roughly 4× faster.

[1] Shen, Tao, et al. "Retrieval-augmented retrieval: Large language models are strong zero-shot retriever." Findings of the Association for Computational Linguistics ACL 2024. 2024.

[2] Li, Xiaoxi, et al. "From matching to generation: A survey on generative information retrieval." ACM Transactions on Information Systems 43.3 (2025): 1-62.

[3] Xia, Yu, et al. "Knowledge-Aware Query Expansion with Large Language Models for Textual and Relational Retrieval." Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies. 2025.

[4] Li, Muzhi, et al. "Retrieval, Reasoning, Re-ranking: A Context-Enriched Framework for Knowledge Graph Completion." Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies. 2025.

[5] Chen, Shijie, Bernal Jiménez Gutiérrez, and Yu Su. "Attention in Large Language Models Yields Efficient Zero-Shot Re-Rankers." arXiv preprint arXiv:2410.02642 (2024).

[6] Yue, Murong. "A survey of large language model agents for question answering." arXiv preprint arXiv:2503.19213 (2025).

[7] He, Zhankui, et al. "Large language models as zero-shot conversational recommenders." Proceedings of the 32nd ACM international conference on information and knowledge management. 2023.

[8] Hou, Yupeng, et al. "Large language models are zero-shot rankers for recommender systems." European Conference on Information Retrieval. Cham: Springer Nature Switzerland, 2024.

评论

Response to Weakness 1 and Question 1

While prior works, such as OPO, DRPO, and LiPO, combine DPO with differentiable NDCG-style objectives, IRPO is designed for a fundamentally different and practically motivated new setting: aligning LLMs based on sparse, in-context listwise feedback, where users select only a few relevant items from LLM-generated rankings. This sparse supervision setting, distinct from explicit pairwise or fully labeled listwise feedback, arises naturally in applications such as conversational recommendation [7-8], generative retrieval [1-3], and in-context QA [4-6] (see Lines 37–38 and 88). To address this new setting, IRPO introduces several novel contributions:

  • Feedback modeling: Unlike prior methods that require external supervision (for example, human-labeled rewards or pre-assigned scores), IRPO directly models partial, implicit preferences from ranked lists without relying on external reward signals.
  • Objective design: IRPO integrates both item-level relevance and positional importance using a Plackett-Luce-inspired model with DCG-based discounting, resulting in a fully differentiable surrogate objective tailored for sparse, in-context feedback (Lines 38–40).
  • Theoretical contributions: We provide a detailed gradient analysis showing that IRPO's optimization induces importance-weighted gradients that are unbiased and have reduced variance. This theoretical grounding, based on importance sampling, is not present in OPO or DRPO, which focus primarily on empirical performance without such guarantees.

In summary, IRPO is not just a variant of DPO. It identifies and addresses a realistic and previously underexplored setting where in-context ranked list feedback is naturally sparse and non-pairwise. To support robust optimization in this setting, IRPO introduces both modeling innovations, such as positional aggregation of pairwise preferences weighted by graded relevance, and theoretical insights, including unbiased gradient estimators linked to importance sampling with reduced variance.

In addition to responding to the reviewer's question, we also added the suggested baseline LiPO and provided the comparative results. We chose LiPO because it is a recent and representative listwise preference optimization method that explicitly extends DPO using general learning-to-rank principles. Among comparable approaches, LiPO has a more generic design that allows straightforward integration into our setup. Moreover, it serves as a meaningful contrast to IRPO: while both methods handle ranked list feedback, LiPO emphasizes listwise LTR objectives, whereas IRPO is specifically designed to model sparse in-context rankings and positional importance. This comparison helps clarify IRPO’s contributions under realistic, feedback-constrained alignment settings.

ModelMethodN@1 (ARC)N@3N@5R@1R@3R@5N@1 (MuSiQue)N@3N@5R@1R@3R@5
LLaMA3Base13.221.728.613.228.444.942.543.551.918.844.960.7
SFT13.521.428.413.527.744.636.939.148.016.640.757.6
SDPO23.428.335.423.432.550.058.660.866.726.062.373.3
LiPO35.239.845.535.243.757.747.449.657.120.951.165.2
IRPO27.438.946.527.447.666.365.159.469.427.659.278.0
Gemma2Base27.435.642.127.442.658.440.942.851.117.944.359.9
SFT23.632.939.323.640.255.739.843.351.717.245.361.3
SDPO28.736.542.628.743.058.040.642.350.617.943.759.4
LiPO23.230.537.623.236.053.549.851.858.922.353.266.7
IRPO27.045.553.327.059.878.755.451.061.123.851.670.6

As seen above, IRPO consistently outperforms the other baselines across all benchmarks.

评论

Thank you for the author response and effort. The authors have addressed my questions 2 and 3, and I have therefore decided to raise my score. However, question 1 remains unclear, as the results and explanations still do not adequately relate to the two previous methods: OPO and DRPO. Could the authors focus more on discussing the differences between their method and these existing approaches? Clarifying this point would increase the reviewer's confidence that the paper meets the standards of the conference.

评论

Thanks for the further suggestions, we summarize the key differences between OPO/DRPO and IRPO(ours) in the following table:

DifferenceOPO/DRPOIRPO
Supervision SourceRequires fully labeled listwise feedback or external scores/rewardsDirectly models implicit user selection; no external supervision
Feedback SettingOffline, fully supervisedIn-context, sparse listwise feedback at inference time
Objective FunctionOPO: NeuralNDCG with soft sorting; DRPO: Differentiable NDCG with margin-based policiesClosed-form Bradley-Terry with positional NDCG gains
Differentiable SortingYes (OPO: soft sorting, DRPO: sorting networks)No sorting networks required
Positional DiscountingOPO: Implicit via differentiable NDCG; DRPO: YesExplicitly modeled using DCG
On-Policy ExtensionNot explicitly supportedNaturally supports via iterative rollouts
Theoretical GuaranteesNot emphasizedProves unbiased, low-variance gradients via importance sampling
评论

Thank you for the reviewer's effort on providing additional comparison. They have resolved my concerns and I decided to improve the score.

审稿意见
6

This paper studies formulating direct preference optimization as a list-wise preference learning. The paper formulates the training loss as an approximation of the NDCG loss. The proposed approach improves LMM models as Llama3, Phi3 and Gemma2 compared to SFT, DPO and SDPO post-training.

接收理由

  • The approach is particularly interesting for recommendation use case as it is experimented in the paper where the non-binary preference information is better reflected by ndcg loss formulation
  • The experimental results shows improvement compared to DPO and SDPO.
  • The paper is overall well written and clear

拒绝理由

  • The comparison in the related work compared to LiPO approach is not clear, articulating how these two approaches are different would be interesting to add.
  • Paper only compares to the pairwise formulation of the preference ranking. Adding comparison with other listwise aproaches (such as LiPO, OPO, DRPO) would make the experimental results more convincing.

给作者的问题

  • How the approach performs compares to the existing list wise preference optimization approaches like LiPO, OPO, DRPO?
  • Could you please add some examples from the dataset that shows the impact of different training objectives?
评论

Response to Question 2

In our paper, we have provided some qualitative comparisons in Appendix F. To further illustrate the strengths and distinct behavior of IRPO compared to baselines, we now provide additional qualitative examples showcasing ranked lists generated by IRPO and baseline methods. Specifically, we present three representative examples demonstrating how IRPO effectively prioritizes more contextually relevant and coherent responses compared to the baselines:

Example1

Prompt: Pretend you are a question-answering model. I will give you a Question. Based on the Question, please rerank the provided answers in a list from the best to the worst without any extra words.Question: Sam wasn't lying, but he left out important details. He was being what? 1. dishonesty, 2. deceitful, 3. lying, 4. deceptive, 5. imagination, 6. more lies, 7. lie to himself, 8. deception, 9. telling lies, 10. poker face

SFT: 1. dishonesty 2. deceitful 3. lying 4. deceptive 5. imagination

DPO: 1. deceptive 2. lying 3. deceitful 4. deceptive 5. imagination

IRPO: 1. deceitful 2. dishonesty 3. more lies 4. deceptive 5. imagination 6. lie to himself 7. deception 8. telling lies 9. poker face 10. lying

Example2

Prompt: Pretend you are a question-answering model. I will give you a Question. Based on the Question, please rerank the provided answers in a list from the best to the worst without any extra words. Question: At the end of the day as he began to eat he paused and thanked her, it wasn't often she would what? 1. cook dinner, 2. get food, 3. make food, 4. did chores, 5. stretch out, 6. wife asked to, 7. when holding her hand, 8. a meal, 9. were hungry, 10. she cannot do that when all are for her

SFT: 1. wife asked to 2. when holding her hand 3. she cannot do that when all are for her 4. she would 5. cook dinner 6. get food 7. make food 8. did chores 9. stretch out 10. were hungry

DPO: 1. stretch out 2. cook dinner 3. get food 4. make food 5. did chores 6. wife asked to 7. when holding her hand 8. were hungry 9. she cannot do that when all are for her

IRPO: 1. cook dinner 2. she cannot do that when all are for her 3. wife asked to 4. get food 5. a meal 6. stretch out 7. when holding her hand 8. make food 9. were hungry 10. did chores

Example 3:

Prompt: Pretend you are a question-answering model. I will give you a Question. Based on the Question, please rerank the provided answers in a list from the best to the worst without any extra words. Question: No matter what date you put on it, we all know the universe to be what? 1. very old, 2. infinite, 3. infiniverse, 4. real, 5. universe, 6. getting younger, 7. miss universe, 8. the universe, 9. astronomy, 10. date

DPO: 1. infinite 2. astronomy 3. the universe 4. real 5. getting younger 6. universe 7. miss universe 8. date 9. infiniverse 10. very old

SFT: 1. infinite 2. astronomy 3. the universe 4. real 5. getting younger 6. miss universe 7. date 8. infiniverse 9. very old

IRPO: 1. very old 2. infinite 3. infiniverse 4. real 5. universe 6. getting younger 7. miss universe 8. the universe 9. astronomy 10. date

[1] Shen, Tao, et al. "Retrieval-augmented retrieval: Large language models are strong zero-shot retriever." Findings of the Association for Computational Linguistics ACL 2024. 2024.

[2] Li, Xiaoxi, et al. "From matching to generation: A survey on generative information retrieval." ACM Transactions on Information Systems 43.3 (2025): 1-62.

[3] Xia, Yu, et al. "Knowledge-Aware Query Expansion with Large Language Models for Textual and Relational Retrieval." Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies. 2025.

[4] Li, Muzhi, et al. "Retrieval, Reasoning, Re-ranking: A Context-Enriched Framework for Knowledge Graph Completion." Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies. 2025.

[5] Chen, Shijie, Bernal Jiménez Gutiérrez, and Yu Su. "Attention in Large Language Models Yields Efficient Zero-Shot Re-Rankers." arXiv preprint arXiv:2410.02642 (2024).

[6] Yue, Murong. "A survey of large language model agents for question answering." arXiv preprint arXiv:2503.19213 (2025).

[7] He, Zhankui, et al. "Large language models as zero-shot conversational recommenders." Proceedings of the 32nd ACM international conference on information and knowledge management. 2023.

[8] Hou, Yupeng, et al. "Large language models are zero-shot rankers for recommender systems." European Conference on Information Retrieval. Cham: Springer Nature Switzerland, 2024.

评论

Response to Weakness 1,2, and Question 1

While prior works, such as OPO, DRPO, and LiPO, combine DPO with differentiable NDCG-style objectives, IRPO is designed for a fundamentally different and practically motivated new setting: aligning LLMs based on sparse, in-context listwise feedback, where users select only a few relevant items from the LLM-generated rankings. This sparse supervision setting, distinct from explicit pairwise or fully labeled listwise feedback, arises naturally in applications such as conversational recommendation [7-8], generative retrieval [1-3], and in-context QA [4-6] (see Lines 37–38 and 88).

To address this new setting, IRPO introduces several novel contributions:

  • Feedback modeling: Unlike prior methods that require external supervision (for example, human-labeled rewards or pre-assigned scores), IRPO directly models partial, implicit preferences from ranked lists without relying on external reward signals.
  • Objective design: IRPO integrates both item-level relevance and positional importance using a Plackett-Luce-inspired model with DCG-based discounting, resulting in a fully differentiable surrogate objective tailored for sparse, in-context feedback (Lines 38–40).
  • Theoretical contributions: We provide a detailed gradient analysis showing that IRPO's optimization induces importance-weighted gradients that are unbiased and have reduced variance. This theoretical grounding, based on importance sampling, is not present in OPO or DRPO, which focus primarily on empirical performance without such guarantees.

In summary, IRPO is not just a variant of DPO. It identifies and addresses a realistic and previously underexplored setting where in-context ranked list feedback is naturally sparse and non-pairwise. To support robust optimization in this setting, IRPO introduces both modeling innovations, such as positional aggregation of pairwise preferences weighted by graded relevance, and theoretical insights, including unbiased gradient estimators linked to importance sampling with reduced variance.

In addition to responding to the reviewer's question, we also added the suggested baseline LiPO and provided the comparative results. We chose LiPO because it is a recent and representative listwise preference optimization method that explicitly extends DPO using general learning-to-rank principles. Among comparable approaches, LiPO has a more generic design that allows straightforward integration into our setup. Moreover, it serves as a meaningful contrast to IRPO: while both methods handle ranked list feedback, LiPO emphasizes listwise LTR objectives, whereas IRPO is specifically designed to model sparse in-context rankings and positional importance. This comparison helps clarify IRPO’s contributions under realistic, feedback-constrained alignment settings.

ModelMethodN@1 (ARC)N@3N@5R@1R@3R@5N@1 (MuSiQue)N@3N@5R@1R@3R@5
LLaMA3Base13.221.728.613.228.444.942.543.551.918.844.960.7
SFT13.521.428.413.527.744.636.939.148.016.640.757.6
SDPO23.428.335.423.432.550.058.660.866.726.062.373.3
LiPO35.239.845.535.243.757.747.449.657.120.951.165.2
IRPO27.438.946.527.447.666.365.159.469.427.659.278.0
Gemma2Base27.435.642.127.442.658.440.942.851.117.944.359.9
SFT23.632.939.323.640.255.739.843.351.717.245.361.3
SDPO28.736.542.628.743.058.040.642.350.617.943.759.4
LiPO23.230.537.623.236.053.549.851.858.922.353.266.7
IRPO27.045.553.327.059.878.755.451.061.123.851.670.6

As seen above, IRPO consistently outperforms the other baselines across all benchmarks.

审稿意见
7

This work addresses a relevant and important problem : limited and sparse pairwise feedback when using LLMs as ranking models. IRPO extends the DPO objective by including relevance and position of items in ranked lists. To address the challenge of non-differentiable ranking metrics IRPO also introduces a differentiable objective based on positional aggregation of item preferences. Overall, this work theoretically and empirically shows the benefits of IRPO over standard DPO approaches across various ranking tasks like conversational recommendations and question answering.

接收理由

Novelty and Significance : The proposed IRPO framework is novel and addresses a significant problem of sparse feedback in ranking with LLMs. The superior performance of IRPO across diverse ranking tasks establishes the impact this work can have on the domain. The differentiable objective based on positional aggregation of pairwise preferences is a clever and novel way to address the non-differentiable nature of ranking metrics.

Theoretical Motivation : IRPO is well-motivated and grounded in theory by building connections between the Plackett Luce model and the Bradley Terry Model. The theoretical analysis providing insights into IRPO's optimization behavior, linking it to importance sampling, and demonstrating an unbiased gradient estimator with reduced variance is sound. Finally, the work motivates and grounds the importance of incorporation relevance and positional information.

Comprehensive Evaluation : The empirical implementations and results across various ranking tasks ( and various models ) like conversation recommendation, generative retrieval and question-answering is well-designed. IRPO outperforms all selected baselines across all datasets demonstrating the superiority of the proposed framework. The analysis of extending IRPO to online settings further shows the wide applicability of the framework.

拒绝理由

Qualitative Analysis : The work can benefit from including some qualitative examples of the ranked lists created by IRPO and baselines to further demonstrate the behavior and strength of the approach.

Computational Complexity : The work mentions that IRPO reduces the computational cost (of DPO) by removing multiple forward passes, it lacks an analysis of the computational cost of the IRPO model compared to the baseline.

Impact of components : I would like to see studies to further investigate the impact of different components of IRPO such as the positional preference model or the gradient estimator on the overall ranking performance. It would also be interesting to see how much does the relevance of position importance affect the ranking results. This could further strengthen the work.

给作者的问题

How can IRPO be extended to handle more complex forms of user feedback, such as multiple relevance labels per item ? Can IRPO be applied to tasks beyond ranking?

Can you provide some qualitative examples to show the differences between IRPO and baselines ?

How sensitive is IRPO's performance to the choice of hyperparameters ?

评论

Response to Weakness 1 and Question 2

In our paper, we have provided some qualitative comparisons in Appendix F. To further illustrate the strengths and distinct behavior of IRPO compared to baselines, we now provide additional qualitative examples showcasing ranked lists generated by IRPO and baseline methods. Specifically, we present three representative examples demonstrating how IRPO effectively prioritizes more contextually relevant and coherent responses compared to the baselines:

Example 1

Prompt: Pretend you are a question-answering model. I will give you a Question. Based on the Question, please rerank the provided answers in a list from the best to the worst without any extra words.Question: Sam wasn't lying, but he left out important details. He was being what? 1. dishonesty, 2. deceitful, 3. lying, 4. deceptive, 5. imagination, 6. more lies, 7. lie to himself, 8. deception, 9. telling lies, 10. poker face

SFT: 1. dishonesty 2. deceitful 3. lying 4. deceptive 5. imagination

DPO: 1. deceptive 2. lying 3. deceitful 4. deceptive 5. imagination

IRPO: 1. deceitful 2. dishonesty 3. more lies 4. deceptive 5. imagination 6. lie to himself 7. deception 8. telling lies 9. poker face 10. lying

Example 2

Prompt: Pretend you are a question-answering model. I will give you a Question. Based on the Question, please rerank the provided answers in a list from the best to the worst without any extra words. Question: At the end of the day as he began to eat he paused and thanked her, it wasn't often she would what? 1. cook dinner, 2. get food, 3. make food, 4. did chores, 5. stretch out, 6. wife asked to, 7. when holding her hand, 8. a meal, 9. were hungry, 10. she cannot do that when all are for her

SFT: 1. wife asked to 2. when holding her hand 3. she cannot do that when all are for her 4. she would 5. cook dinner 6. get food 7. make food 8. did chores 9. stretch out 10. were hungry

DPO: 1. stretch out 2. cook dinner 3. get food 4. make food 5. did chores 6. wife asked to 7. when holding her hand 8. were hungry 9. she cannot do that when all are for her

IRPO: 1. cook dinner 2. she cannot do that when all are for her 3. wife asked to 4. get food 5. a meal 6. stretch out 7. when holding her hand 8. make food 9. were hungry 10. did chores

Example 3

Prompt: Pretend you are a question-answering model. I will give you a Question. Based on the Question, please rerank the provided answers in a list from the best to the worst without any extra words. Question: No matter what date you put on it, we all know the universe to be what? 1. very old, 2. infinite, 3. infiniverse, 4. real, 5. universe, 6. getting younger, 7. miss universe, 8. the universe, 9. astronomy, 10. date

DPO: 1. infinite 2. astronomy 3. the universe 4. real 5. getting younger 6. universe 7. miss universe 8. date 9. infiniverse 10. very old

SFT: 1. infinite 2. astronomy 3. the universe 4. real 5. getting younger 6. miss universe 7. date 8. infiniverse 9. very old

IRPO: 1. very old 2. infinite 3. infiniverse 4. real 5. universe 6. getting younger 7. miss universe 8. the universe 9. astronomy 10. date

评论

Response to Weakness 2

As mentioned in Lines 37–38, unlike prior methods, IRPO aligns LLMs that generate an entire ranked list in one shot (Line 88 in Related Work). As a result, IRPO needs only a single forward pass per sample, whereas DPO, SDPO, and LiPO perform multiple passes to compute pairwise margins, yielding a substantial speedup. We report Run Time (per sample), which is averaged under consistent evaluation settings, to ensure a comprehensive and representative comparison. We show the following complexity comparison results.

MethodRun Time (per sample)
Base0.1140s
DPO0.1145s
SDPO0.1144s
LiPO0.1154s
IRPO0.0313s

Since IRPO’s only a single forward pass, its per-sample runtime is roughly 4× faster.

Response to Weakness 3 and Question 3

IRPO inherently prioritizes relative comparisons over absolute weight magnitudes, making it less sensitive to specific weighting schemes. To validate this, we performed an additional ablation study evaluating two alternative positional weighting methods:

  • abl1: w(i)=1/log(1+i)w(i) = 1/\log(1 + i)

  • abl2: w(i)=2yi1/iw(i) = 2^{y_i} - 1/ i

We compared these variants using the Llama3 backbone across three benchmarks (one from each category).

Inspired

MethodN@1N@5N@10R@1R@5R@10
IRPO45.355.262.322.468.887.2
IRPO(abl1)35.037.948.518.946.072.9
IRPO(abl2)43.849.258.821.058.983.1

MusiQue

MethodN@1N@3N@5R@1R@3R@5
IRPO65.159.469.427.659.278.0
IRPO(abl1)50.954.065.821.056.978.9
IRPO(abl2)68.362.872.729.062.981.5

ARC

MethodN@1N@3N@5R@1R@3R@5
IRPO27.438.946.527.447.666.3
IRPO(abl1)9.417.223.49.422.938.2
IRPO(abl2)27.137.846.127.145.866.3

These ablation results also demonstrate the importance of incorporating relevance into the positional weights: when the relevance term is removed in abl1, we observe a significant performance drop across all benchmarks.

Response to Question 1

As discussed in Section 5.2, we have already considered the scenario where multiple relevance labels exist per item. Regarding the design of IRPO, in Lines 38-40, we emphasized that modeling natural and flexible ranking feedback effectively requires capturing both item relevance and positional importance, of which conventional DPO methods and their underlying preference models are limited in modeling directly. IRPO integrates both item-level relevance and positional importance (via DCG-based discounting) into a unified, differentiable surrogate objective.

评论

Thanks for your detailed responses to my questions. Please include the ranked lists examples and run time complexity in the final draft for a clearer comparison. Overall, I am satisfied with the ablations that indeed show the importance of relevance with the positional weights. I see Section 5.2 does discuss multiple relevance labels per items so Q1 is also addressed satisfactorily.

I will retain my current score as this is a good paper to accept.

最终决定

This paper introduces In-context Ranking Preference Optimization (IRPO) by extending Direct Preference Optimization (DPO) to effectively learn from sparse, in-context ranked lists. A differentiable objective incorporating item relevance and positional importance is introduced, which aligns LLMs with a more natural forms of user feedback. The method was shown to consistently outperform strong baselines like DPO and SDPO across diverse tasks, including conversational recommendation and question-answering. The reviewers reached a strong consensus to accept the paper. Initial concerns raised by reviewers regarding the novelty compared to other methods (e.g., LiPO, OPO, DRPO) and claims of computational efficiency were thoroughly addressed during the author-reviewer discussion. The authors provided compelling new experimental results addressing these concerns. Overall, the work is original, effective, and makes meaningful contributions to the community.