Preference-based Reinforcement Learning beyond Pairwise Comparisons: Benefits of Multiple Options
We present the first theoretical analysis of PbRL with ranking feedback, showing that longer ranking feedback can provably improve sample efficiency.
摘要
评审与讨论
The paper proposes M-AUPO, a novel algorithm for preference-based reinforcement learning (PbRL) that leverages ranking feedback over multiple actions (beyond pairwise comparisons) under the Plackett-Luce (PL) model.
优缺点分析
Strengths:
- First work to rigorously show that larger ranking feedback subsets improve sample efficiency in PbRL, with matching lower bounds.
- Addresses limitations of pairwise comparisons in RLHF and provides a principled way to exploit richer feedback.
- Validates theory on both synthetic and real-world data, outperforming baselines like DopeWolfe.
Weaknesses:
- The per-round cost scales with for rank-breaking (RB) loss, which may limit scalability for very large .
- Relies on linear reward models and the PL model for rankings; robustness to misspecification is unclear.
问题
- The RB loss has tighter bounds but higher computational cost. In practice, how do you decide between PL and RB losses?
- Given the prominence of pairwise preference-based RLHF (e.g., PPO + BT model in [35] or DPO [38]), did you consider comparing M-AUPO against such methods?
- See the weaknesses.
局限性
yes
最终评判理由
Key Resolved Issues:
- Concerns about the baseline problem were addressed through additional experiments in the rebuttal.
- Concerns about "Robustness to Model Misspecification in Experiments" were addressed through extension to General Function Approximation in the rebuttal.
The paper’s strengths outweigh its weaknesses, justifying a score of Borderline accept.
格式问题
N/A
Thank you for your feedback and for engaging with our work.
The reviewer’s main concerns are (i) computational efficiency for large and (ii) the linear reward assumption. However, is typically small in practice (e.g., in Ouyang et al., 2022), and the linear assumption is a standard and foundational choice in bandit and RL theory. Thus, these should not be viewed as weaknesses. We elaborate below.
Beyond the main concerns, we would like to reiterate the significance of our theoretical results:
- the first theoretical result showing improved sample efficiency with multiple-option feedback (to our best knowledge),
- the elimination of the exponential dependence common in prior PbRL—achieved without relying on any auxiliary techniques, such as self-sampling updates [1] or the use of -dependent weighted covariance matrix [2] ( is generally unknown in practice), and
- The sample efficiency of RB loss and PL loss is equivalent in terms of the leading-order term. This result provides theoretical justification for the common empirical practice of decomposing ranking feedback into pairwise comparisons (Ouyang et al., 2022).
- the first -improving lower bound in PbRL.
For our first contribution, although it may seem intuitive that more options improve performance, no prior work has formally proven this—and most existing results actually degrade with larger . Our result thus closes a long-standing gap between practical intuition and theory, marking a significant step forward for PbRL.
For our second contribution, removing the dependence is especially impactful, as many prior works failed to do so (e.g., Saha, 2021; Saha et al., 2023; Zhu et al., 2023; Xiong et al., 2023; Zhan et al., 2023). More importantly, the key lemmas used to eliminate the exponential dependence (Lemmas D.2 and E.1) are particularly significant, as they can be readily applied to many existing PbRL analyses without any modification to the underlying algorithms. This shows that the commonly observed dependence is not fundamental, but rather an artifact of loose analysis—an insight we find both surprising and broadly useful.
As we believe that all of your concerns have been thoroughly addressed and that our theoretical contributions are substantial, we respectfully ask the reviewer to reconsider the evaluation and recommend our paper for acceptance.
[1] Chen, Mingyu, et al. "Avoiding exp (rmax) scaling in rlhf through preference-based exploration." arXiv 2025.
[2] Di, Qiwei, Jiafan He, and Quanquan Gu. "Nearly Optimal Algorithms for Contextual Dueling Bandits from Adversarial Feedback." ICML 2025.
-
We Typically Consider Small Settings
The reviewer raised a concern about the large- scenario (i.e., when the maximum size of the assortment is large). However, in practice, small is typically used. For example, Ouyang et al. (2022) utilize ranking feedback of length 4 to 9.
Large- settings are generally unrealistic, as it is difficult to obtain accurate long rankings from human annotators. Therefore, we believe this concern should not be considered a weakness of our paper.
-
Linear Model Is a Foundational Starting Point
Assuming linear rewards is a widely adopted approach for analyzing the statistical efficiency of PbRL, including dueling bandits. This modeling choice has been used in numerous prior works—such as Saha (2021), Wu and Sun (2023), Saha et al. (2023), Zhu et al. (2023), Di et al. (2024), Thekumparampil et al. (2024), among others—as a foundational tool for theoretical analysis.
Therefore, we strongly believe that this assumption should not be regarded as a unique weakness of our work, nor as a reason for rejection. The linear assumption serves as a fundamental starting point for understanding how the learning mechanism operates in PbRL. If certain phenomena fail to hold in this foundational setting, it is difficult to expect them to hold in more complex scenarios. Hence, we believe that the contributions outlined above are both meaningful and valuable to the community.
-
Robustness to Model Misspecification in Experiments
Moreover, in our real-world experiment, there is indeed model misspecification: the features and the true rewards come from different models. Specifically, the features are extracted from the last hidden layer of Gemma-2B, while the true reward model is based on Mistral-7B. Despite this mismatch, our method still performs well, demonstrating a notable degree of robustness to model misspecification.
-
Extension to General Function Approximation
To further address the reviewer’s concern about the linearity and PL model assumptions, we show that MAU (Maximizing Average Uncertainty) method can be extended to general function classes (i.e., general preference models), following a similar approach to Theorem 4 of Das et al. (2024). Specifically, given a function class , we define a confidence set
b_t(x, a, a'):= \max_\{f_1, f_2 \in \mathcal{C}\_t (\mathcal{F}, \delta)}|f_1(x,a,a') - f_2(x,a,a')|.We then select the reference action-assortment pair according to:
Then, we can bound the suboptimality as:
where and are the covering number and Eluder dimension of , respectively. The final inequality follows from Lemma 2 of [1], with minor modifications. Therefore, this result suggests that multiple-option feedback can lead to improved sample complexity for general function classes (including LLMs) as well.
We hope that this additional result can addresses the reviewer’s concern regarding the linear reward assumption and PL model assumption.
[1] Russo, Daniel, and Benjamin Van Roy. "Eluder dimension and the sample complexity of optimistic exploration." NeurIPS 2013.
-
Q1. How do you decide between PL and RB losses?
Our theoretical results show that the suboptimality gaps of the PL and RB losses are comparable, especially as becomes large. While the RB loss may offer a slightly tighter non-leading term, this difference is negligible for large . Therefore, both losses are statistically equivalent in the asymptotic regime.
However, the RB loss incurs higher computational cost due to the rank-breaking procedure. Therefore, we recommend using the PL loss when handling ranking feedback (just for its computational efficiency, not due to any statistical advantage).
-
Q2. Did you consider comparing M-AUPO against such methods (e.g., PPO + BT model in [35] or DPO [38])?
The methods proposed in Ouyang et al. (2022) and Rafailov et al. (2023) operate purely in the offline setting—they optimize over a fixed dataset and do not make decisions about which actions to collect next. In contrast, M-AUPO is designed for the online setting, where the learner must actively select actions (or assortments) at each round. Therefore, a direct comparison would be methodologically inappropriate.
Instead, we compare against two baselines more relevant to the online RLHF setting:
- "Uniform", which selects actions uniformly at random (already included in Appendix I, and will be moved to the main paper in the revision), and
- "Best&Ref", which selects an action pair () using the current policy and a reference policy (e.g., uniform random or SFT), respectively—following a selection mechanism similar to those in Online GSHF [1] and XPO [2]..
We believe "Best&Ref" is the most appropriate baseline for online RLHF, as it closely mirrors how actions are selected in many existing online RLHF algorithms. Note again that DPO is best understood as an optimization algorithm for the offline setting, and thus not directly comparable in terms of online action selection or assortment design.
The results demonstrate that our method either consistantly and significantly outperforms all other baselines or is second-best with a minimal gap. (PL loss only, due to limited space.)
i) Synthetic experiment ()
| K | DopeWolfe | Uniform | Best&Ref | M-AUPO (ours) |
|---|---|---|---|---|
| 2 | 0.1629 | 0.1671 | 0.1575 | 0.0396 |
| 3 | 0.1587 | 0.1654 | 0.1482 | 0.0331 |
| 5 | 0.1539 | 0.0797 | 0.1534 | 0.0244 |
| 7 | 0.1547 | 0.0561 | 0.1551 | 0.0204 |
ii) Realworld experiment ()
ii-1) TREC-DL dataset
| K | DopeWolfe | Uniform | Best&Ref | M-AUPO (ours) |
|---|---|---|---|---|
| 2 | 6.115 | 3.579 | 3.465 | 3.204 |
| 3 | 5.999 | 2.964 | 3.458 | 2.381 |
| 5 | 5.971 | 2.548 | 3.430 | 2.052 |
ii-2) Nectar dataset
| K | DopeWolfe | Uniform | Best&Ref | M-AUPO (ours) |
|---|---|---|---|---|
| 2 | 1.418 | 1.279 | 1.291 | 1.285 |
| 3 | 1.246 | 1.243 | 1.275 | 1.019 |
| 5 | 1.232 | 1.116 | 1.257 | 0.997 |
[1] Xiong, Wei, et al. "Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-constraint." ICML 2024
[2] Xie, Tengyang, et al. "Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF." ICLR 2025
Please don’t hesitate to reach out if you have any further questions!
I thank the authors for their comprehensive responses, and they have mostly addressed my concerns. Based on this, I will raise the score to 4.
We deeply appreciate your decision to raise the score.
We’re also truly glad to hear that our responses have successfully addressed your concerns. Once again, we are sincerely grateful for your support and thoughtful engagement throughout the review process. It has truly meant a great deal to us.
Dear Reviewer h2r7,
Thank you very much for your time and thoughtful evaluation of our work. As the discussion period comes to a close, we would like to kindly follow up on our rebuttal.
We hope that our responses have addressed your concerns. In particular, we respectfully ask that our use of linear models not be seen as a limitation. We believe that our theoretical contributions—the improved sample efficiency for longer ranking feedback and the elimination of dependence in the leading term (which also extends to existing PbRL and dueling bandits)—are substantial and merit acceptance.
If any points remain unclear or merit further discussion, we would be more than happy to elaborate. We believe that a brief exchange at this stage could help resolve any remaining uncertainties and potentially contribute to a more favorable outcome.
Thank you once again for your time and consideration—we greatly value your input.
Best regards,
The Authors
This paper propose M-AUPO, an algorithm that selects action subsets by maximizing average uncertainty, to enhance sample efficiency for online PbRL. The authors theoretically prove a suboptimality gap and lower bound for M-AUPO, avoiding the exponential dependence on parameter norms. Experiments on synthetic and real-world datasets validate its effectiveness.
优缺点分析
Strengths:
- The theoretical derivation is sufficient.
- The paper is well organized and easy to follow.
Weaknesses:
- In RLHF, it's no surprise that increasing the number of options for the same context leads to performance improvements. For example, for RB methods, this approach means more amount of data, so the reward model will be more accurate. What I think more interesting is the discussion of whether to increase the number of questions or the number of options per question for a fixed amount of total data. For example, the first set of experiments have N questions with M options per question; the second set of experiments had N/2 questions with 2M options per question; which one is better?
- The experimental part is too simple. 1. Missing important baselines, e.g., other algorithms that can handle pairwise comparisons in the RB setting, such as DPO; 2. Missing dopewolfe vs. m-aupo comparisons with varying K (the last subfigure in Figure 1 and 2).
- Lack of hyperparametric analysis. The paper does not show the values of regularization parameter and step size hyperparameters in the experiments and lacks tuning or ablation studies on hyperparameters.
问题
- See weaknesses.
- The denominator of Eq. (1) should be and not , right?
局限性
Suggest adding a limitation paragraph about the effect of inherent limitations of PL or BT assumptions on RLHF.
最终评判理由
The authors respond to my question about which of the longer or short options are better using Theorem 3, clarifying the choice of baseline and providing additional explanations about hyperparameters. The authors address all of my concerns.
格式问题
N/A
We appreciate your time and thoughtful feedback in reviewing our paper.
Before addressing the reviewer's specific concerns, we would like to emphasize the importance of our theoretical contributions, which we believe are substantial but may have been somewhat underappreciated in your initial evaluation. Our key contributions and insights are as follows:
- the first theoretical result showing improved sample efficiency with multiple-option feedback (to our best knowledge),
- the elimination of the exponential dependence common in prior PbRL and dueling bandit literature—achieved without relying on any complex or auxiliary techniques, such as self-sampling updates [1] or the use of -dependent weighted covariance matrix [2] ( is generally unknown in practice), and
- The sample efficiency of RB loss and PL loss is equivalent in terms of the leading-order term. This result provides theoretical justification for the common empirical practice of decomposing ranking feedback into pairwise comparisons, as done in works such as Ouyang et al. (2022).
- the first -improving lower bound in PbRL.
Especially, from a theoretical perspective, the key lemmas used to eliminate the exponential dependence (Lemmas D.2 and E.1) are particularly significant, as they can be readily applied to many existing PbRL analyses—provided elliptical potential lemmas are used—without any modification to the underlying algorithms. This demonstrates that the widely observed dependence was not a fundamental limitation, but rather an artifact of overly loose analysis. We believe this insight is both surprising and broadly useful.
Given the novelty, generality, and potential impact of our theoretical contributions, we kindly ask the reviewer to reconsider their evaluation and support our paper for acceptance.
[1] Chen, Mingyu, et al. "Avoiding exp (rmax) scaling in rlhf through preference-based exploration." arXiv (2025).
[2] Di, Qiwei, Jiafan He, and Quanquan Gu. "Nearly Optimal Algorithms for Contextual Dueling Bandits from Adversarial Feedback." ICML 2025.
-
First Rigorous Theoretical Justification of the Insight!
While it may appear intuitive that offering more options per round should improve performance (though it is unclear whether this is a widely recognized insight), no prior work has formally established this. In fact, most existing theoretical results worsen as increases, contradicting the intuition. Our result therefore closes a long-standing gap between intuition and theory, representing a meaningful step forward for the PbRL literature. We believe this contribution should be viewed as a strength—providing a complete theoretical justification of the intuition—rather than as a weakness.
Furthermore, we would like to clarify that—even though ranking feedback can potentially provide more data—it is not obvious whether this feedback contains sufficient information to yield meaningful information gain. For example, under the RB loss, if many item pairs are very similar in preference, the gain from such feedback is not guaranteed. Our paper explicitly addresses this issue by studying how to maximize the average uncertainty in the offered options, a challenge that we believe has not been adequately explored in prior work.
-
Longer Options Are No Worse Than Shorter Options
We address your question comparing the following two settings:
- Option 1: questions with options each
- Option 2: questions with options each
Our upper and lower bounds yield the following suboptimality ranges:
Option 1:
Option 2:
These bounds suggest that the longer-option case (Option 2) is not worse than the shorter-option case (Option 1) in terms of sample efficiency. However, due to the remaining gap between the upper and lower bounds by a factor of (i.e., the optimal rate is not yet identified), we cannot definitively say which option is always better. It is possible that Option 2 may be better, as its lower bound is strictly smaller. Alternatively, the two settings may be equivalent if the current upper bound is tight.
We believe that rigorously closing this gap is an interesting and important direction for future work, and we thank the reviewer for raising this constructive point. That said, even with the current bounds, we find the result valuable, as it suggests that either setting can be used depending on practical convenience, with comparable (or similar) statistical efficiency.
-
Additional Experiments
In our experiments, we made every effort to include all relevant baselines and to cover a wide range of scenarios. We hope the reviewer finds our experimental evaluation sufficiently thorough.
1. vs. Other baselines
In Appendix I, we have already included an additional comparison with a "Uniform", which selects actions uniformly at random (to be moved to the main paper in the revision). To provide a more comprehensive evaluation, we now also include:
- "Uniform", which selects actions uniformly at random (already included in Appendix I, and will be moved to the main paper in the revision), and
- "Best&Ref", which selects an action pair () using the current policy and a reference policy (e.g., uniform random or SFT), respectively—following a selection mechanism similar to those in Online GSHF [1] and XPO [2].
We believe that "Best&Ref" is the most relevant baseline for online RLHF, as it reflects how many online RLHF algorithms select actions in practice. Note that a direct comparison with DPO is not appropriate, since DPO is best understood as an optimization algorithm for the offline setting, rather than an action (or assortment) selection strategy for the online setting.
The results demonstrate that our method either significantly outperforms all other baselines or is second-best with a minimal gap. (PL loss only, due to limited space.)
i) Synthetic experiment ()
| K | DopeWolfe | Uniform | Best&Ref | M-AUPO (ours) |
|---|---|---|---|---|
| 2 | 0.1629 | 0.1671 | 0.1575 | 0.0396 |
| 3 | 0.1587 | 0.1654 | 0.1482 | 0.0331 |
| 5 | 0.1539 | 0.0797 | 0.1534 | 0.0244 |
| 7 | 0.1547 | 0.0561 | 0.1551 | 0.0204 |
ii) Realworld experiment ()
ii-1) TREC-DL dataset
| K | DopeWolfe | Uniform | Best&Ref | M-AUPO (ours) |
|---|---|---|---|---|
| 2 | 6.115 | 3.579 | 3.465 | 3.204 |
| 3 | 5.999 | 2.964 | 3.458 | 2.381 |
| 5 | 5.971 | 2.548 | 3.430 | 2.052 |
ii-2) Nectar dataset
| K | DopeWolfe | Uniform | Best&Ref | M-AUPO (ours) |
|---|---|---|---|---|
| 2 | 1.418 | 1.279 | 1.291 | 1.285 |
| 3 | 1.246 | 1.243 | 1.275 | 1.019 |
| 5 | 1.232 | 1.116 | 1.257 | 0.997 |
[1] Xiong, Wei, et al. "Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-constraint." ICML 2024
[2] Xie, Tengyang, et al. "Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF." ICLR 2025
2. Varying values of is already included in Appendix I
A comparison between DopeWolfe and M-AUPO across varying values of is already included in Appendix I. Moreover, in the synthetic experiments, we evaluate our algorithm under diverse context structures. The results demonstrate the superior performance of our method across various settings. We also include runtime tables—covering both synthetic and real-world experiments—which highlight the computational efficiency of our algorithm. Please refer to Appendix I for further details.
-
We Use Exact Theoretical Parameter Values in All Experiments
In all of our experiments, we used the exact same parameter values (e.g., , , etc.) as defined in the theoretical results. For reference, please see the notation summary in Table C.1. Therefore, ablation studies on these parameters are unnecessary.
One key advantage of using theoretically grounded algorithms is that the prescribed parameter values can be directly applied in practice, without the need for exhaustive hyperparameter tuning. We will clarify this point more explicitly in the final version of the paper.
- (minor): As you pointed out, the denominator in Eq. (1) should indeed be . In fact, we had already made the correction in the supplementary appendix, but we still appreciate your attention to detail—thank you for pointing it out.
Please feel free to reach out if you have any further questions or require clarificationon any point!
Thank you for your detailed rebuttal. Upon revisiting the paper in light of your clarifications, I recognize the significance of the theoretical contributions, particularly how Theorem 3 enables the comparison between longer and shorter options. The analysis provides valuable insights, and your response has satisfactorily addressed my initial concerns. Accordingly, I raise my score to 5.
Best regards,
Reviewer fSmY
Thank you very much for your strong support of our work!
We are truly glad that the significance of our contributions came across clearly. It is always incredibly rewarding—and deeply encouraging—when our efforts are understood and appreciated. We are also sincerely grateful that you took the time to revisit and thoughtfully engage with our work.
Dear Reviewer fSmY,
We sincerely appreciate your time and thoughtful evaluation of our work. As the discussion period draws to a close, we would like to kindly follow up on our rebuttal.
We hope that our responses have addressed your concerns and that the significance of our theoretical contributions is recognized. If any aspects remain unclear or merit further discussion, we would be more than happy to provide additional clarification. We believe that a brief exchange at this stage could help resolve any outstanding uncertainties and contribute to a more informed final assessment.
We would also be truly grateful for any additional feedback or follow-up questions you may have.
Thank you once again for your time and consideration—we greatly value your input.
Best regards,
The Authors
The paper aims to design an algorithm for learning from multiple-option feedback in an online PbRL setting with stronger theoretical guarantees. The paper proposes M-AUPO which uses online mirror descent on the comparison loss and selects an assortment of actions by maximizing the feature uncertainty relative to a reference. The paper provides a sub optimality gap and performs experiments on synthetic data and real-world data with fixed features.
优缺点分析
Strengths:
- The paper considers a novel setting of multi-option PbRL where contexts are drawn from an unknown distribution and the assortment is selected.
- Background on PbRL and the algorithm are clearly presented.
- The paper proposes a new algorithm based on selecting a diverse set of actions and using an online mirror descent algorithm with theoretical guarantees.
- The paper performs numerical experiments following the theoretical setting and with real-world data.
Weaknesses:
- Full proofs are not provided.
- It seems it could be difficult to generalize M-AUPO to more practical settings beyond a linear model.
- The full bound consists of the term consider the leading term and an additional term with an dependency but decays faster with . The second term is considered negligible for large enough , but if large enough for this term to be neglected is being considered, under the same setting, the bounds from existing work with dependency would actually be better than the bound presented especially since the provided bound has a factor . Based on this, the claims can be misleading and it is unclear whether the sub optimality of M-AUPO is better in practice.
- Experiments only compare to DopeWolfe.
- The synthetic experiments work with and the decrease in sub-optimality seems relatively small with and based on the bound, the effect of could be smaller with larger . It would be useful to see how this generalizes with varying .
- The real world setting uses fixed features which differs from most practical settings which often train a full model or train at least multiple layers of a model. As a result it seems difficult to be able to make claims about the effect of multi-option feedback for tasks such as training a language model based on the analysis.
问题
- Can you provide synthetic experiments with varying B and also provide more details such as what the norms of the features look like in the real-world setting?
- Can you clarify why the provided bound is an improvement? It seems the second term alone for larger is weaker than existing bounds.
- Can you provide comparison to other methods beyond DopeWolfe? In particular, it would be good to see comparisons of sub-optimality gap to the methods presented in Table 1.
- Would it be possible to extend the experiments/method beyond a linear model?
- Can you provide the proofs for all results?
Having the full proofs and clarifications about the provided bound in particular would increase the score.
局限性
No, limitations around the method and analysis being restricted to linear models could be more explicitly presented and limitations of the bound under different conditions (larger B, smaller T) should be addressed.
最终评判理由
Authors provided clarifications around results showing their results are novel and can be verified. Additional discussion on implications has been provided, but significant revision would be required.
格式问题
No formatting concerns
We appreciate the time you took to review our paper and your valuable feedback.
Before addressing the reviewer’s concerns, we would like to re-emphasize the significance of our theoretical contributions. Our primary contributions and interesting insights are:
- the first theoretical result showing improved sample efficiency with multiple-option feedback (to our best knowledge),
- the elimination of the exponential dependence common in prior PbRL and dueling bandit literature—achieved without relying on any complex or auxiliary techniques, such as self-sampling updates [1] or the use of -dependent weighted covariance matrix [2] ( is generally unknown in practice), and
- The sample efficiency of RB loss and PL loss is equivalent in terms of the leading-order term. This result provides theoretical justification for the common empirical practice of decomposing ranking feedback into pairwise comparisons, as done in works such as Ouyang et al. (2022).
- the first -improving lower bound in PbRL.
From a theoretical perspective, the key lemmas used to eliminate the exponential $e^B$ dependence (Lemmas D.2 and E.1) are particularly significant, as they can be readily applied to many existing PbRL analyses—provided elliptical potential lemmas are used—without any modification to the underlying algorithms. This demonstrates that the widely observed dependence was not a fundamental limitation, but rather an artifact of overly loose analysis—an insight we find both surprising and highly impactful.
We believe these contributions mark a significant step forward in the theory of PbRL and are sufficient to merit acceptance. We also address all of the reviewer’s concerns below and hope the reviewer will reconsider our paper in light of these points.
[1] Chen, Mingyu, et al. "Avoiding exp (rmax) scaling in rlhf through preference-based exploration." arXiv (2025).
[2] Di, Qiwei, Jiafan He, and Quanquan Gu. "Nearly Optimal Algorithms for Contextual Dueling Bandits from Adversarial Feedback." ICML 2025.
-
Full Proofs Are Provided in the Appendix
We clarify that all theoretical proofs have already been provided in the appendix, which was submitted as part of the supplementary material in accordance with NeurIPS 2025 guidelines.
-
Linear Model Is a Foundational Starting Point
Assuming linear rewards (with fixed features) is a widely adopted approach for analyzing the statistical efficiency of PbRL, including dueling bandits. This modeling choice has been used in numerous prior works—such as Saha (2021), Wu and Sun (2023), Saha et al. (2023), Zhu et al. (2023), Di et al. (2024), Thekumparampil et al. (2024), among others—as a foundational tool for theoretical analysis.
Therefore, we strongly believe that this assumption should not be regarded as a unique weakness of our work, nor as a reason for rejection. The linear assumption serves as a fundamental starting point for understanding how the learning mechanism operates in PbRL. If certain phenomena fail to hold in this foundational setting, it is difficult to expect them to hold in more complex scenarios. Hence, we believe that the contributions outlined above are both meaningful and valuable to the community.
-
Extension to General Function Approximation
To further address the reviewer’s concern about the linearity and fixed feature assumptions, we show that MAU (Maximizing Average Uncertainty) method can be extended to general function classes, following a similar approach to Theorem 4 of Das et al. (2024). Specifically, given a function class , we define a confidence set and an exploration bonus:
b_t(x, a, a'):= \max_\{f_1, f_2 \in \mathcal{C}\_t (\mathcal{F}, \delta)}|f_1(x,a,a') - f_2(x,a,a')|.We then select the reference action-assortment pair according to: Then, we can bound the suboptimality as:
where and are the covering number and Eluder dimension of , respectively. The final inequality follows from Lemma 2 of [1], with minor modifications. Therefore, this result suggests that multiple-option feedback can lead to improved sample complexity for general function classes as well.
We hope that this additional result can addresses the reviewer’s concern regarding the linear reward assumption and the fixed feature assumption.
[1] Russo, Daniel, and Benjamin Van Roy. "Eluder dimension and the sample complexity of optimistic exploration." NeurIPS 2013.
-
Our Bound Is a Clear Improvement
It seems there is a possibility that readers may misunderstand the prior bounds presented in Table 1. To clarify, we highlight two key points:
- Prior works also incur dependence: For example, Zhu et al. (2023), Mukherjee et al. (2024), and Thekumparampil et al. (2024) all include (or when ), implying an dependence. Thus, the second term in our bound is NOT weaker than theirs.
- The Sq. Pred. Error measure used in Mukherjee et al. (2024) and Thekumparampil et al. (2024), roughly speaking, corresponds to the square of our suboptimality.: Taking square roots, their bounds become —which is worse than ours in terms of both and .
Therefore, our bound is a strict improvement, and even the second term alone is NOT weaker than existing results for large . We will make this clarification more explicit in the revised version. Thank you for pointing this out.
-
Additional Experiments
1. vs. Other Baselines
In Appendix I, we have already included an additional comparison with a "Uniform", which selects actions uniformly at random (to be moved to the main paper in the revision). To provide a more comprehensive evaluation, we now also include:
- "Uniform", which selects actions uniformly at random (already included in Appendix I), and
- "Best&Ref", which selects an action pair () using the current policy and a reference policy (e.g., uniform random or SFT), respectively—following a selection mechanism similar to those in Online GSHF [1] and XPO [2].
The results demonstrate that our method either significantly outperforms all other baselines or is second-best with a minimal gap. (PL loss only, due to limited space.)
i) Synthetic experiment ()
| K | DopeWolfe | Uniform | Best&Ref | M-AUPO (ours) |
|---|---|---|---|---|
| 2 | 0.1629 | 0.1671 | 0.1575 | 0.0396 |
| 3 | 0.1587 | 0.1654 | 0.1482 | 0.0331 |
| 5 | 0.1539 | 0.0797 | 0.1534 | 0.0244 |
| 7 | 0.1547 | 0.0561 | 0.1551 | 0.0204 |
ii) Realworld experiment ()
ii-1) TREC-DL dataset
| K | DopeWolfe | Uniform | Best&Ref | M-AUPO (ours) |
|---|---|---|---|---|
| 2 | 6.115 | 3.579 | 3.465 | 3.204 |
| 3 | 5.999 | 2.964 | 3.458 | 2.381 |
| 5 | 5.971 | 2.548 | 3.430 | 2.052 |
ii-2) Nectar dataset
| K | DopeWolfe | Uniform | Best&Ref | M-AUPO (ours) |
|---|---|---|---|---|
| 2 | 1.418 | 1.279 | 1.291 | 1.285 |
| 3 | 1.246 | 1.243 | 1.275 | 1.019 |
| 5 | 1.232 | 1.116 | 1.257 | 0.997 |
The reviewer also suggested comparing sub-optimality gaps with the methods in Table 1. However, Zhu et al. (2023) and Mukherjee et al. (2024) assume fixed (given) assortments, a different and simpler setting that does not account for interactive decision-making—crucial in real-world tasks like web search, where only the top-ranked results can be labeled. Therefore, a direct comparison with these methods is either infeasible or inappropriate.
[1] Xiong, Wei, et al. "Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-constraint." ICML 2024
[2] Xie, Tengyang, et al. "Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF." ICLR 2025
2. Robustness to Varying B
In response to the reviewer’s request, we report suboptimality across varying values of in our synthetic experiments. As shown below, our algorithm is quite robust to the scale of in practice, e.g., .
i)
| B | DopeWolfe | Uniform | Best&Ref | TopK | M-AUPO (ours) |
|---|---|---|---|---|---|
| 1.0 | 0.1629 | 0.1671 | 0.1575 | 0.1260 | 0.0396 |
| 3.0 | 0.1829 | 0.1673 | 0.1598 | 0.1349 | 0.0425 |
| 5.0 | 0.1859 | 0.1664 | 0.1587 | 0.1431 | 0.0482 |
| 10.0 | 0.1821 | 0.1686 | 0.1624 | 0.1476 | 0.0543 |
ii)
| B | DopeWolfe | Uniform | Best&Ref | TopK | M-AUPO(ours) |
|---|---|---|---|---|---|
| 1.0 | 0.1539 | 0.0797 | 0.1534 | 0.1026 | 0.0244 |
| 3.0 | 0.1623 | 0.0760 | 0.1550 | 0.0936 | 0.0269 |
| 5.0 | 0.1698 | 0.0785 | 0.1576 | 0.1043 | 0.0286 |
| 10.0 | 0.1766 | 0.0812 | 0.1618 | 0.1081 | 0.0395 |
Q. What the norms of the features look like in the real-world setting?
In our real-world experiments, we use the reward from Mistral-7B as the true reward signal, which ranges from approximately –20 to 20. Since we assume , this implies that the norm bound on (i.e., ) is approximately on the order of 20.
We welcome any additional questions!
- Thank you for the clarification and pointers to the appendix and I apologize for my earlier error. This addresses my concerns about the proofs and experiments.
- I appreciate the experiments and they have addressed my concern about robustness to B.
- I would like to clarify that my concern on the linear model is not about its simplicity or the extent of the theoretical analysis but rather the method proposed. The method itself is restricted to fixed features, and while I appreciate the theoretical analysis for general functions, the extension of the method appears to be infeasible for practical use. These limitations should be explicitly mentioned, and further discussion on how this method and the idea of maximizing uncertainty could inform preference learning for models in practice would make the contributions of the ideas proposed more clear.
- While the bound is an improvement, the claim about removing the dependence still seems misleading as especially with larger , would need to be before the leading term is actually larger, which seems like an impractical amount of training. It should be clarified that the dependence is removed from the leading term.
Based on this, I believe the theoretical analysis and verification are comprehensive. However, the limitations of the method need to be more explicitly mentioned in an additional limitations section and also throughout the paper. Further discussion on the implications of the method for practical preference learning would also be valuable.
I will update the score accordingly.
We sincerely thank you for taking the time to carefully read our lengthy rebuttal. We're very glad to hear that you recognized the significance of our theoretical contributions and chose to update your score. (While we cannot yet see the updated score, we truly appreciate your support for our work!)
We’re very glad to have addressed your remaining concerns.
Maximizing Average Uncertainty Beyond Linear Model
Thank you for raising this point. We now understand that your concern pertains to the practicality of our method for selecting assortments (i.e., sets of actions) in settings beyond the linear model.
One promising direction is to leverage the neural tangent kernel (NTK) framework [1]. Specifically, following the neural contextual bandit literature [2,3] and recent advances in neural dueling bandits [4], we can estimate the uncertainty between actions and given context as:
\left\\| g(x_t, a ; \theta_t) - g(x_t, a' ; \theta_t) \right\\|_{H_t^{-1}},where denotes the gradient of the neural network output with respect to its parameters , and is a regularized Fisher or NTK Gram matrix.
We can then select the reference-assortment pair as follows:
(\bar{a}\_t, S_t) = \arg\max\_{\bar{a}, S, \text{s.t.}, \bar{a} \in S} \frac{1}{|S|} \sum\_{a \in S} \left\\| g(x_t, a ; \theta\_t) - g(x_t, \bar{a} ; \theta\_t) \right\\|\_{H_t^{-1}}Note that this is a high-level idea, and formalizing it would involve several technical challenges. That said, if this method is feasible, one particularly appealing aspect is that our method may eliminate the need for a confidence radius term —and consequently, it would NOT require knowledge of the effective dimension , which is typically difficult to estimate in practice. This highlights the potential practicality of our approach based on maximizing average uncertainty.
We believe this is a very promising and important direction for future research. We plan to include a clearer discussion of this idea, along with its current limitations, in our revision. Thank you again for your insightful and constructive feedback—it will undoubtedly help advance the field!
On the other hand, we would like to emphasize that all existing methods in PbRL also come with their own practical limitations. For example, XPO (Xie et al., 2025), though relatively easy to implement, requires knowledge of the coverability coefficient —a quantity that is typically unknown in practice. (In contrast, our method does not require access to the complexity measure .) Moreover, Online GSHF (Xiong et al., 2024) provides theoretical guarantees only under linear reward assumptions (as we do), but employs a heuristic method in experiments that has no theoretical justification.
Thus, limited practicality is not a unique drawback of our approach, but rather a prevalent challenge across PbRL theory. In particular, quantifying uncertainty beyond the linear model remains a long-standing and open problem, and addressing it is crucial for advancing the field.
[1] Jacot, Arthur, Franck Gabriel, and Clément Hongler. "Neural tangent kernel: Convergence and generalization in neural networks." Advances in neural information processing systems 31 (2018).
[2] Zhou, Dongruo, Lihong Li, and Quanquan Gu. "Neural contextual bandits with ucb-based exploration." International conference on machine learning. PMLR, 2020.
[3] Zhang, Weitong, et al. "Neural Thompson Sampling." International Conference on Learning Representation (ICLR). 2021.
[4] Verma, Arun, et al. "Neural Dueling Bandits: Preference-Based Optimization with Human Feedback." The Thirteenth International Conference on Learning Representations.
Dependency on in Lower-order Term
First, we would like to correct your minor typo: the lower-order term in our bound is , not .
As you noted, when is very large and is small, this term can dominate and lead to large suboptimality gaps. However, we emphasize the following:
- The exponential dependence on in the lower-order term is not unique to our work. It appears in nearly all related literature—including logistic bandits (Faury et al., 2020; 2022; Abeille et al., 2021), GLM bandits (Lee et al., 2024; Sawarni et al., 2024), MNL bandits (Zhang and Sugiyama, 2024; Lee and Oh, 2024), and dueling bandits (Di et al., 2025; Chen et al., 2025). This dependence is widely acknowledged as unavoidable, due to the intrinsic non-linearity of sigmoid and softmax functions.
- Importantly, this exponential term is overly conservative in practice. For instance, in our real-world experiments with and , the theory suggests a sample complexity of —yet our algorithm performs well with only 20,000 rounds. This shows that our method remains practical even in large- regimes.
To summarize, the remaining concerns—(1) uncertainty quantification under general function classes, and (2) the dependence in the lower-order term—are not unique limitations of our paper. Rather, they are prevalent challenges in the existing literature and reflect long-standing open problems in the field. We respectfully hope that these shared limitations, which are well-known and broadly present in prior works, will not serve as the sole basis for a negative assessment of our contribution.
Thank you once again for your time and thoughtful review. We would be glad to provide any additional clarification if needed.
Thank you for the comments and they have mostly addressed my concerns. Although I would like to note that for the e^B dependence, the time it takes for the second term to be smaller than the first term is roughly O(e^8B d^2/T) since the leading term roughly differs in T by a factor of sqrt(T). However, my main point was that it should be clarified in the writing that the removal of the e^B dependence is in the leading term not that this is a key limitation of the theory.
We sincerely appreciate your time and patience in reading through our lengthy response once again, and we’re glad to hear that most of your concerns have been addressed.
We will make sure to clarify the existence of the dependence in the lower-order term in our revision. This clarification will improve the clarity of the paper and help avoid any potential misunderstanding. Thank you again for your constructive feedback!
This paper proposes selecting multiple actions in preference-based learning to improve sampling efficiency. The authors provide theoretical guarantees for the proposed algorithm, showing that the suboptimality gap is , where is the size of the action subset offered at round . This result eliminates the exponential dependence on parameter norm bound . The authors also establish a theoretical lower bound that matches the upper bound.
优缺点分析
Strengths:
- This work is the first theoretical guarantee that larger subsets directly improve sample efficiency in PbRL.
- This result eliminates the exponential dependence on parameter norm bound . And previous related works (eliminates ) only considered two options.
Weakness
- The parameter is the upper bound on the norm of in the linear reward model, i.e., . Generally, it is a small constant. Therefore, the theoretical improvement does not appear to be particularly significant.
- Each iteration requires solving the optimization problem (9), and the authors report a time complexity of . When the number of arms is large, the algorithm incurs a high computational cost.
- The paper assumes a finite number of arms (). Under the same setting (linear rewards with a finite number of arms), previous works [1] and [2] have provided regret upper bounds of (which correspond to a suboptimality gap of ), which are better than the bound presented in this paper.
[1] Saha, A. (2021). Optimal algorithms for stochastic contextual preference bandits. Advances in Neural Information Processing Systems 34 30050–30062.
[2] Bengs, V., Saha, A. and H¨ullermeier, E. (2022). Stochastic contextual dueling bandits under linear stochastic transitivity models. In International Conference on Machine Learning. PMLR.
问题
- In the field of contextual dueling bandits, theoretical guarantees are also often provided in terms of regret. The authors use the weaker notion of suboptimality gap. Is there an inherent difficulty in this work to provide a regret analysis?
- Saha, A. (2021) also provides a regret lower bound (matching the upper bound) of . In contrast, this paper gives a suboptimality gap lower bound of . How can this difference be explained?
局限性
Yes
最终评判理由
The authors have addressed my concerns during the rebuttal phases. Overall, my evaluation is positive.
格式问题
N/A
Thank you very much for your positive review! Below, we provide our responses to your comments and concerns.
-
The Parameter Boundedness Is Typically Large in Practice
We respectfully disagree with the claim that is generally a small constant. In many real-world scenarios, can in fact be quite large. For example, in our real-world experiments, the reward function derived from Mistral-7B spans a range of approximately –20 to 20. Given that we assume , this implies that the norm bound on , i.e., , must also scale with 20. Therefore, we strongly believe that eliminating exponential dependence on (e.g., ) represents a substantial and meaningful improvement.
In many prior works on logistic bandits (Faury et al., 2020; 2022; Abeille et al., 2021), GLM bandits (Lee et al., 2024; Sawarni et al., 2024 [1]), and MNL bandits (Zhang and Sugiyama, 2024; Lee and Oh, 2024), eliminating or reducing the dependence has been recognized as a central research goal. In contrast, PbRL and dueling bandit literature (e.g., Saha, 2021; Saha et al., 2023; Zhu et al., 2023; Xiong et al., 2023; Zhan et al., 2023) has largely overlooked this issue, and whether dependence can be avoided has remained a long-standing open question.
In this paper, we show that it is indeed possible to eliminate the dependence without relying on auxiliary techniques, such as specialized sampling schemes [2] or prior knowledge of [3], which are often impractical. Our improvement is enabled by rethinking the Hessian as a conditional expectation of the covariance matrix and applying matrix concentration inequalities repeatedly (see Lemmas D.2 and E.1).
This proof technique is not only theoretically elegant but also broadly applicable: it can be directly applied to any existing PbRL or dueling bandit analysis, including regret-minimization settings, as long as elliptical potential lemmas are used—without modifying the original algorithm! This reveals that the exponential dependence on was not fundamentally necessary, but rather a byproduct of loose analysis.
We strongly believe this new perspective deepens the theoretical understanding of PbRL and dueling bandits, and could inspire new research directions in the field.
[1] Sawarni, Ayush, et al. "Generalized linear bandits with limited adaptivity." Advances in Neural Information Processing Systems 37 (2024): 8329-8369.
[2] Chen, Mingyu, et al. "Avoiding exp (rmax) scaling in rlhf through preference-based exploration." arXiv (2025).
[3] Di, Qiwei, Jiafan He, and Quanquan Gu. "Nearly Optimal Algorithms for Contextual Dueling Bandits from Adversarial Feedback." ICML 2025.
-
Efficient Assortment Selection Provided in Appendix
In Appendix H.1, we introduce a more efficient assortment selection strategy. Instead of selecting the reference action that maximizes the average uncertainty, we simply choose arbitrarily. We show that this simplification still yields the same theoretical guarantees while reducing the computational cost to . We believe this is a significant improvement, as prior works in the dueling bandit literature typically incur computational costs (e.g., Saha, 2021; Bengs et al., 2022), which is worse than ours since in general.
-
Saha (2021) and Bengs et al. (2022) Assume Stronger Context Conditions
We would like to clarify that both Saha, 2021 and Bengs et al., 2022 make stronger assumptions on the context (feature) vectors. Specifically:
- Saha (2021) assumes that the feature vectors at each round are drawn i.i.d. from a fixed distribution.
- Bengs et al. (2022) imposes a more stringent condition (Assumption A1), requiring that the offered feature vectors contain a set of basis vectors satisfying for some constant .
In contrast, we do not require any structural assumptions on the feature vectors ; they can be specified arbitrarily. Even though we assume a fixed context distribution and a stationary action space in our main presentation for simplicity, our analysis readily extends to settings where both the context distribution and the action space vary over time (and can be given arbitrarily). We will make this point more explicit in the revised version, in addition to the brief note already provided in Footnote 1.
Therefore, we strongly believe that comparisons to their algorithms and results are not meaningful, as they rely on significantly stronger assumptions. In particular, without these contextual assumptions, their algorithms and analyses break down entirely and cannot achieve the regret bound of in more general setting we consider.
-
Q1. Is there an inherent difficulty in this work to provide a regret analysis?
No, there is no inherent difficulty. In fact, our method readily admits a regret analysis. Specifically, define the greedy policy at round as . Then, following the similar analysis as in the proof of Theorem 1 (and ignoring non-leading terms for simple presentation), we obtain:
This result also shows that the regret improves with larger comparison sets , clearly demonstrating the benefit of multiple-option feedback. We appreciate the constructive question.
-
Q2. Comparison to the lower bound of Saha (2021). How can the difference be explained?
First, we would like to emphasize that there are fundamental differences in the problem setting. Saha (2021) considers one-click feedback among the items, whereas our setting assumes access to the full ranking feedback over items. Hence, our setup reflects a richer information setting, which naturally allows for improved statistical efficiency—especially in terms of dependence on .
Moreover, the definitions of regret differ. In Saha (2021), the regret is defined as an average reward gap:
whereas in our case, the suboptimality gap is defined as the gap between the optimal and the current best item (for simplicity, let ):
Due to these key differences in feedback type and regret formulation, a direct comparison to the lower bound in Saha (2021) is not appropriate. That said, if one were to hypothetically align the performance measures and consider the special case of , then our lower bound of is strictly tighter than theirs, which is . (We also note that their bound does not include any dependence on .)
To directly address your question on why the bounds differ, we cautiously suggest that the looser bound in Saha (2021) arises from the choice of instance construction. A similar phenomenon can be observed in Theorem 3.1 of Bengs et al. (2022), where the authors derive an lower bound under the assumption that and , but only an bound when .
We speculate that a comparable suboptimality gap could also be derived in our setting by constructing a different (weaker) instance. However, since this would yield a looser bound than our current result, we do not explore this direction further in the this rebuttal.
Please feel free to reach out if you have any further questions!
Thank you for the authors' response, which has addressed most of my concerns.
That said, I still have some concerns regarding the regret upper bound. As pointed out in Li (2024)(Remark 5.5), the upper bounds in Saha (2021) and Beng (2022) achieve due to the assumption of a finite number of arms, which also applies to this paper. When the number of arms grows to , the bound becomes . In contrast, the upper bound presented in this paper is already (Equation D.7), which grows to when the number of arms reaches . I'm not sure whether I have missed something, or whether the derivation in the paper can easily eliminate the logarithmic dependency . However, based on my current understanding, I will keep my score unchanged.
Li, X., Zhao, H., & Gu, Q. (2024, July). Feel-Good Thompson sampling for contextual dueling bandits. In Proceedings of the 41st International Conference on Machine Learning (pp. 29406-29426).
Thank you for your quick response and for taking the time to read through our previous, albeit lengthy, rebuttal.
However, we believe that some confusion may have arisen due to notational differences across the papers. Specifically, in Saha (2021) and Bengs et al. (2022), the denotes the total number of arms. In contrast, in our paper, refers to the maximum size of the offered item set, while denotes the total number of arms. When in our setting (i.e., the pairwise comparison case), our problem reduces to theirs.
To enable a fair comparison of the regret bounds, we restate them below for the case of :
- Saha (2021), Bengs et al. (2022): , where is the total number of arms.
- Ours: , again when (i.e., the pairwise comparison setting).
Notably, our bound does NO depend on , the total number of arms. Even in extreme cases where the number of arms is exponential in dimension (e.g., ), our regret remains unchanged.
Furthermore, we re-emphasize that our analysis holds in the more general adversarial feature setting, whereas the results of Saha (2021) and Bengs et al. (2022) crucially rely on i.i.d. stochastic features or strong basis vector assumptions. Their analyses do not extend to the adversarial setting we consider and, in fact, would fail entirely under such conditions.
To the best of our knowledge, no prior work provides a regret bound that is strictly better than ours in this adversarial setting when . Moreover, for , our derived regret bound of is the tightest known to date!
Please don’t hesitate to reach out if any questions remain—we would be more than happy to clarify further. Otherwise, if our responses have fully addressed the reviewer’s concerns, we would respectfully and sincerely appreciate any consideration to strengthen the evaluation in light of these key clarifications and contributions.
As the discussion period is nearing its end in two days, we would like to kindly follow up to confirm whether the reviewer has had an opportunity to review our earlier clarification regarding the distinction between and in our work compared to prior literature. We believe this may have been a simple notational misunderstanding, and we hope this point is now clear.
Please do not hesitate to let us know if anything is still unclear—we would be glad to clarify further.
Thank you for your further clarification. My evaluation is overall positive.
We sincerely appreciate your consistent support for our paper and your thoughtful, constructive feedback! In particular, we will take care to clearly explain how our setting differs from that of Saha (2021) and Bengs et al. (2022), and provide a more explicit comparison to their results. Once again, thank you for your valuable time and for serving as a reviewer.
We sincerely thank all reviewers for generously dedicating their time to reviewing our paper, offering constructive feedback, and actively participating in the discussion. We are also deeply grateful to the area chair for facilitating and supporting this productive exchange.
Personally, we found this rebuttal process to be especially impressive, rewarding, and motivating. We are truly grateful for the reviewers’ willingness to engage in thoughtful and constructive discussions, as well as for their generous effort to carefully understand our main contributions, even if these may not have fully resonated with them at first. We will also make sure to reflect the reviewers’ valuable feedback.
As we conclude this rebuttal phase, we would like to summarize the main contributions of our work.
Summary of Contributions
PbRL is increasingly used in practice (for example, in LLM fine-tuning and recommender systems) and have a rich body of theoretical work. However, all existing theoretical results show degraded performance even as the length of the (ranking) feedback increases, and they suffer from the harmful dependency (a few recent works avoided dependency through complex algorithms).
Our paper addresses these issues and provides several new insights:
-
To our knowledge, it is the first to show improved sample efficiency with longer (larger ) feedback. We believe this closes a long-standing gap between intuition (that longer feedback could be helpful) and theory.
-
We eliminate the exponential dependency in the leading term without adding any complex algorithm. Notably, the key lemmas (Lemmas D.2 and E.1) used to remove the dependency can be readily applied to most existing PbRL analyses—without any modification to the underlying algorithms—to eliminate this dependency.
-
We show that the sample efficiency of RB loss and PL loss is equivalent in terms of the leading-order term. This provides theoretical justification for the common empirical practice of decomposing ranking feedback into pairwise comparisons, as done in works such as Ouyang et al. (2022).
- We present the first -improving lower bound in PbRL, leaving open the possibility that longer feedback () with fewer data points () can be more beneficial than shorter feedback () with more data ().
We believe these results are significant and deepen the theoretical understanding of the PbRL framework. As future work, we aim to extend these insights to more practical settings—beyond the linear model—to easy-to-implement algorithms with more complex models.
Once again, we sincerely appreciate the active discussion and the encouraging (possibly unanimous) positive assessments at the end. We truly enjoyed this discussion, and we will ensure that all valuable feedback from the rebuttal period is reflected in our revision (e.g., additional experiments, extensions to general function approximation, and clearer explanations of contributions and limitations).
The paper improves preference-based reinforcement learning by providing the theoretical analysis with sub-group ranking feedback.
As shown in early dueling bandits research, if multiple options are ranked simultaneously, the algorithms could be more efficient than traditional pairwise comparisons (e.g. Multi-dueling Bandits with Dependent Arms, UAI 2017).
Generating the results to full RL settings is important to apply PbRL to large-scale real world problems.
The authors further confirmed that increasing the maximum selection size K improves performance, aligning with the reference and the paper's own theoretical predictions. These efforts highlight the advantage of utilizing ranking feedback to achieve more data-efficient and scalable PbRL.