Efficient and Near-Optimal Algorithm for Contextual Dueling Bandits with Offline Regression Oracles
We design the first efficient, near-optimal regret algorithm for contextual dueling bandits using offline oracles, enabling scalable preference-based learning in RLHF and resolving a key open problem in AI alignment.
摘要
评审与讨论
This work investigates dueling contextual bandits in the presence of offline regression oracles. It introduces dueling bandit algorithms for both finite and infinite action spaces and establishes square-root regret upper bounds. The contributions are theoretically well-founded and significant, though the presentation could benefit from improved clarity and organization.
优缺点分析
The need for this work is evident, as interest in dueling bandits with offline regression oracles continues to grow. The theoretical foundations are solid and promising. However, the current presentation requires substantial improvement before the work can be considered for acceptance. In particular, the proofs and notation are difficult to follow. Clarifying the following points would lead to a significant improvement:
- Line 54 contains a repeated expression; Line 55 includes a typographical error.
- The range of in Line 108 should be explicitly clarified.
- Does refer to the -th component of the vector ? Please clarify.
- There appear to be typos in Equation (2).
- Lines 174–175 should be corrected, with attention to whether represents a joint or marginal distribution.
- Lines 176–184 should be rewritten for improved readability, including clearer notation. Specifically, the meaning of , the interpretation of its positive or negative value, and the role of the Nash equilibrium need to be better explained.
- What does denote in Assumption 3? This should be made explicit.
- Multiple notions of regret are introduced throughout the paper without sufficient explanation or connection. These definitions should be clearly distinguished and related to each other.
- Is the symbol in Line 290 referring to the sigmoid function? Please clarify.
问题
-
The proof of Corollary 4 is hard to follow. Please explain how the bound with leads to the regret bound stated in the corollary, especially focusing on the final two lines of the proof.
-
Could you derive the regret bound for the continuous case by plugging an appropriate one to ?
局限性
The simulation results do not show various cases.
最终评判理由
I keep my rating based on the authors' response.
格式问题
NA
Thanks a lot for your careful reading and insightful comments.
W1: Some notational clarification:
- Line 54, 55: Will fix repeated expression in L54 and the expression of Fig2 in L55
- Line 108: Please note we clarified this in L107, is the binary preference feedback
- : Yes, note is a joint distribution over item pairs. Hence denotes the probability of -th pair in .
- Equation (2): Will correct the typo by changing in the 1st term of Eq2.
- Lines 174-175: Note we mentioned (eg L166, L173), so is always a joint distribution. Will clarify more if needed.
- Lines 176-184: For any two mixed strategies and , denotes the expected probability of winning over . So a +ve value represents that is a stronger policy than and vice-versa. [9,13] also gives a detailed explanation of Nash (Von Neumann) policies of symmetric two-player 0-sum games, as referenced. Will add more details to improve the readability of the remark.
- in Assump3: Thanks for catching this. represents "Best-Response Policy", we will add a reference (Defn6, A.3) to it in Assump 3.
- Line 290: Yes, refers to the sigmoid function, as clarified in the objective, L125.
Many thanks for pointing them out, will clarify all of them in the update.
Q1: The Final two lines in the proof of Cor4: This is a standard analysis in the regret analysis of epoch-based contextual bandit algorithms, e.g., please see the last steps of the derivation provided in [30] for their proof of Cor1 (or even Thm1) to see the final steps. We will also add a few extra lines of derivation in our proof of Cor4 (C.5) to make the proof clearer.
Q2: Deriving pure regret bound for the continuous case: Yes, absolutely. When is the ball of radius 1 and every is 1- norm bounded (ie. ), using the simple vanilla least-square regression oracle will yield the desired regret bound. We will add this as a corollary to analyze some special cases of Thm9 (same as Cor4). Thanks again for the great suggestion!
We hope this addresses all of your concerns. We are happy to clarify any further questions and respectfully urge the reviewer to reconsider the final evaluation in light of the rebuttal.
Dear Reviewer czBn,
We sincerely appreciate your time, thoughtful evaluation. As the discussion phase is now active and the decision deadline approaches, we wanted to kindly follow up on our rebuttal.
We hope our responses addressed your concerns as well, and we would be truly grateful for any additional feedback or follow-up questions you may have.
If any points remain unclear or merit further discussion, we would be more than happy to elaborate. We believe that a brief exchange at this stage could help resolve any remaining uncertainties and potentially contribute to a more favorable outcome.
Thank you once again for your time and consideration—we greatly value your input.
Thanks, Authors
This paper addresses the problem of contextual dueling bandits by leveraging an offline regression oracle, rather than relying on an online regression oracle. This improves the practicality of the proposed algorithm. Furthermore, the authors present the first theoretical analysis of general contextual dueling bandits over continuous action spaces. In a regularized min-max optimization framework, they establish a regret bound of . Their theoretical results are further supported by empirical evaluations in synthetic environments.
优缺点分析
Strengths
This paper makes a meaningful contribution to the community by advancing dueling bandit research toward more practical settings—specifically, by addressing general function classes and continuous decision spaces. The theoretical arguments are well-supported, and the presentation is clear and well-structured. Additionally, the inclusion of empirical experiments in the context of oracle-based approaches is both interesting and valuable.
Weaknesses
There are some concerns that, if addressed, could further strengthen the paper:
-
Computational complexity:
- What is the main computational bottleneck in the proposed algorithm?
- Moreover, what are the theoretical computational complexities of the Convex-Constraint-Solver and Cont-CvxConstraint-Solver subroutines used in the method?
-
Experiments:
- No implementation details or code are provided: It is unclear what parameters were used in the experiments. Some of these parameters appear difficult to set without prior knowledge, so I wonder whether the authors used theoretically justified values or tuned them heuristically. Specifically, how is the model class defined in practice? Clarifying this would improve reproducibility.
- The model used in the experiments is quite small: Since one of the main motivations of this paper is RLHF, it would be more interesting to evaluate the algorithm with larger models, such as neural networks or transformers.
- No real-world dataset experiments.
- No comparison of computational cost with baselines: It is important to assess the practical efficiency of the algorithm in terms of runtime or wall-clock time.
-
What are the main technical challenges addressed in this work compared to existing methods? Providing such clarification would strengthen the paper.
问题
My questions are as follows:
-
Clarification on IBR: Could the authors provide a more detailed explanation of the IBR assumption? In particular, could you give concrete examples of settings where IBR holds, while other models such as RUM or SST do not?
-
Function class in practice: Is there an efficient or practical way to define and implement the function class used in this work?
-
Regret dependence on : In dueling or logistic bandits with linear rewards, it is common for the regret to scale with the non-linearity of the sigmoid function (e.g., [1,2]). However, despite using a sigmoid function in Objective (2), the regret bound in this work appears to have no dependence on . Could the authors clarify why this is the case?
[1] Zhu, Banghua, Michael Jordan, and Jiantao Jiao. "Principled reinforcement learning with human feedback from pairwise or k-wise comparisons." International Conference on Machine Learning. PMLR, 2023.
[2] Abeille, Marc, Louis Faury, and Clément Calauzènes. "Instance-wise minimax-optimal algorithms for logistic bandits." International Conference on Artificial Intelligence and Statistics. PMLR, 2021.
How to define function class in exp?
What parameters..?
Non linearity?
局限性
The authors are encouraged to create a separate "Limitations" section in their paper.
最终评判理由
Most of my concerns have been addressed.
Thus, I will maintain my positive score.
格式问题
None
Thanks a lot for your careful reading and insightful comments.
W1: Computation Complexity
Please see Q3 of Reviewer g3ya.
W2: Experiments:
Please see Q1 of Reviewer g3ya.
Q1. IBR Assumption:
Assumption 3 posits the existence of one or more “good” items that strictly outduel every other item and are equally strong among themselves. Concretely, the class of preference matrices ({P}) must allow for a “best set” of actions that tie each other and simultaneously outscore all remaining actions.
Example of IBR but not RUM or SST:
There are many such examples. Any random utility based model follows SST by design, so . But and the class of IBR is much larger than SST. E.g., any preference relation with a strong Condorcet winner automatically satisfies the IBR property, without having any transitivity relation; assume a preference matrix s.t. (hence 1 is also a CW). Another example is having a equally strong set of top-cycle arms s.t. and for any and , . Again such top-cycle class of preferences need not have any transitivity and can easily violate SST throughout.
Thanks for bringing this up, we will add these examples in the paper (remark after Assump3).
Note IBR holds by default for the Continuous decision space setting. Since RUM-like preference structures automatically satisfy the condition of having a best (or set of best) item(s), Assumption 3 trivially holds in our continuous setting (Alg. 2). Thus, in large or potentially infinite action spaces where utility-based models are commonplace, the assumption naturally remains valid and helps maintain the same recursion-based analysis as in the finite-armed case.
Q2: ``efficient or practical way to define and implement the function class used in this work"
To the best of our understanding, there might be some misunderstanding on the reviewer's part since we can define as broadly and generally as we want it to be based on the requirement/ complexity of the underlying preference environment. We have discussed multiple such functions classes (and the choice of efficient regression oracles for the corresponding We discussed several examples of such hypothesis preference classes in in A.3 and Cor 4).
Note such hypothesis‑class premise is identical to many (highly cited) contextual‑bandit work (ILTCB, SquareCB, FALCON) and in classical supervised‑learning settings such as linear and generalized‑linear regression, RKHS models, and deep networks (see [17, 30], Li et al. ’22, Zhu et al. ’22, Blum et al. ’21).
It's important to note that, despite the primary theoretical focus of our work, the assumption is also not restrictive in practice since the algorithm designer is free to choose ! It can be arbitrarily rich—finite sets, GLMs, RKHS classes, Banach spaces, or deep networks that approximate Sobolev‑ball functions (Farrell ’21). Of course, any theoretical guarantee will require some structural assumptions for provable guarantees, but Realizability is a very primitive level (broad) assumption, which is satisfied in practice for the appropriate choice of (rich enough) function classes.
Additionally, since our algorithm relies on an offline regression oracle (rather than an online oracle as we described the advantages in Rem 1 and App A), this flexibility makes the assumption milder, not stronger.
Does that answer your question (esp the discussion in A.3 and Cor 4)? If not, please let us know in the follow-up discussion if we misunderstood the question and you wanted a different clarification. We will be happy to clarify the details accordingly.
Q3. Dependency on
Thanks for the great question, the dependency on the slope of the sigmoid () comes through the performance of the regression oracle (see Thm 9), as the regression oracle is responsible for finding a good estimator of the underlying preference relation (which is sigmoid-based in this case). We can use any MLE based regression oracle (Li et al'17 for GLM bandits) for the purpose.
We hope this addresses all of your concerns. We are happy to clarify any further questions and respectfully urge the reviewer to reconsider the final evaluation in light of the rebuttal.
Thank you for your thorough response. As most of my concerns have been adequately addressed, I will maintain my positive score.
Dear Reviewer ACKv,
Many thanks for taking the time to review our rebuttal and share your feedback.
We are glad to hear that our responses helped clarify your concerns adequately, and we sincerely appreciate your consideration to maintain the positive score. Your comments and suggestions have been very helpful in strengthening the paper, and we will be sure to incorporate the additional clarifications provided in the rebuttal in the revised version.
Finally, we are happy to address any further questions or suggestions you may have and would be glad to incorporate any additional feedback. Thanks once again for your thoughtful insights.
Thanks,
Authors
Dear Reviewer ACKv,
We sincerely appreciate your time, thoughtful evaluation. As the discussion phase is now active and the decision deadline approaches, we wanted to kindly follow up on our rebuttal.
We hope our responses addressed your concerns as well, and we would be truly grateful for any additional feedback or follow-up questions you may have.
If any points remain unclear or merit further discussion, we would be more than happy to elaborate. We believe that a brief exchange at this stage could help resolve any remaining uncertainties and potentially contribute to a more favorable outcome.
Thank you once again for your time and consideration—we greatly value your input.
Thanks, Authors
The paper studies contextual dueling bandits (CDB) with continuous action spaces under a realizability assumption. It introduces two algorithms:
Double-Monster for finite-arm CDB, replacing online regression oracles with offline ones. Double-Monster-Inf for continuous actions.
These algorithms are the first to provide near-optimal regret guarantees for contextual dueling bandits in both finite and continuous action spaces, using more practical offline regression oracles.
优缺点分析
Strength:
- First theoretical treatment of general CDB over continuous action sets with offline oracles.
- Near-optimal regret bound (Thm 9).
- Replacing powerful online oracles with offline ones broadens practical applicability.
- Presentation of this paper is clear to me.
Weakness:
- Experimental results do not include real-world datasets.
- Proofs are hard to follow; I did not check the correctness of the proof.
问题
- Could the authors report results on at least one public CDB dataset or RLHF-style simulated environment?
- How would the algorithms' performance and theoretical guarantees be affected by model misspecification, where the true model lies outside of the assumed function class?
- The core of the proposed algorithms involves calling a Convex-Constraint-Solver at each time step within an epoch to determine the dueling strategy. What is the computational complexity of this solver, particularly for the Double-Monster-Inf algorithm in high-dimensional, continuous action spaces, and how does it impact the overall runtime efficiency that the paper highlights as a key advantage?
局限性
yes
最终评判理由
I will maintain my current score
格式问题
NA
Thanks a lot for your careful reading and insightful comments.
Q1. Experiments of public dataset/ RLHF environment.
Thanks for the great suggestion. Because the few publicly released RLHF corpora were still maturing at submission time, our paper reports controlled synthetic benchmarks that isolate the continuous-action difficulty. Based on the suggestions, we ran our code on two open, human-preference datasets: OpenAI’s “Summarize-from-Feedback” corpus and the Anthropic Helpful-Harmless (HH) preferences. In each case, we fit the required regression oracle with the standard reward-model architecture (6-layer transformer encoder fine-tuned by cross-entropy on the pairwise labels), invoke it only times per experiment, and let Double-Monster-Inf drive sampling through its log-det barrier solver. The implementation is a ~300-line PyTorch module that plugs straight into TRL/PEFT pipelines; after every epoch of duels we refresh the model, solve the mirror-descent step in under one second on a single GPU, and resume data collection—mirroring the workflow practitioners already follow in RLHF training loops. In summary, our algorithm performs competitively for the Anthropic HH preference dataset, as expected.
We are also running a similar experiment for the Preference-Driven Sim-to-Real Locomotion dataset recently released by Google DeepMind, illustrating the robotics angle. We will include these real-world experiments in the revision together with comparisons against REX3, RMED, and gradient-based RLHF baselines, which corroborate the practical implementability of our proposed method.
Unfortunately, this time, due to the NeurIPS rebuttal policy, we are unable to include the figures or even anonymous links to these new experiments. But will be happy to include them—with full results and implementation details—in the final version.
Q2. Performance under misspecified model:
Thank you for raising this important question regarding model misspecification—i.e., the setting where the true reward function lies outside the assumed function class . Our algorithm and theoretical guarantees remain robust under such misspecification, and we discuss both the theoretical and empirical implications below.
Theoretical robustness.
Our regret bounds depend on the oracle’s excess prediction error, captured by the offline regression error term (see Eq. (5)). This quantity is always well-defined, even when the true reward function . Note our regret bound depends on the oracle’s excess‑risk term (), which becomes (by Assump2) simply
Thus is , then , the regret increases only by —the standard agnostic penalty. Enlarging (e.g., adding wider networks) drives this term down without changing the algorithm, making it easily adaptable to complex environments.
Thus, the performance of the algorithm degrades gracefully with the degree of misspecification—mirroring what is observed in classical supervised learning. This mirrors prior work in offline contextual bandits (e.g., Foster et al., 2021) where regret bounds are also stated in terms of the approximation error of the chosen hypothesis class. More specifically, if we assume
following a similar analysis from Foster et al'21 or Ghosh et al'17, under -misspecification, our regret bound can be shown as (see Cor1, Foster et al'21).
Thus, in such cases, approximation errors in the preference model propagate into downstream regret bounds in exactly the same way—typically through the risk of the best-in-class predictor. Like our setup, PFA accommodates rich function classes (e.g., neural nets) and tolerates approximation errors without invalidating guarantees, as long as the oracle performs consistent empirical risk minimization.
Our flexibility: Importantly, our framework allows the designer to choose . When richer function approximators (e.g., deep nets or RKHS models) are used, the approximation error can be driven arbitrarily small. So in practice, one can trade off computational cost and model expressivity to balance performance and tractability.
In summary, model misspecification does not break our theoretical guarantees—it only introduces an additional approximation error term into the regret. This structure is standard in agnostic learning theory and is well-handled by both our analysis and by related works such as PFA. Empirically, our method remains effective even under non-realizable scenarios, and richer models for can further reduce the practical gap.
Q3: Computation complexity:
Firstly, note Eq6 is concave in (since the first term is linear and the term is concave) and it is a maximization objective. So the equivalent minimization objective is convex in . Thus any convex programming solver (minimizing the negative objective of Eq6) will find the optimizer in time-complexity (Eg., see Cvx-Opt Monograph, Bubeck (https://arxiv.org/pdf/1405.4980)).
Another easy and practical trick is for all computation purpose, we could consider any -net of the duel-space (for small enough, say ) and try to find the optimal in that space which will in fact have finite support -- any convex programming solver will find the solution efficiently for all practical purposes (assuming reasonably sized ). This is the most commonly used standard approach towards solving simplex optimization over a continuous action space (since we can simply use convex programming, it's efficient and tractable).
Our reported experiments also note the runtime of our algorithm in the two environments, justifying the practicality of the solvers, as reported in Appendix E.
We hope this addresses all of your concerns. We are happy to clarify any further questions and respectfully urge the reviewer to reconsider the final evaluation in light of the rebuttal.
Thank you for your response. I will maintain my current score
Dear Reviewer g3ya,
We sincerely appreciate your time, thoughtful evaluation. As the discussion phase is now active and the decision deadline approaches, we wanted to kindly follow up on our rebuttal.
We hope our responses addressed your concerns as well, and we would be truly grateful for any additional feedback or follow-up questions you may have.
If any points remain unclear or merit further discussion, we would be more than happy to elaborate. We believe that a brief exchange at this stage could help resolve any remaining uncertainties and potentially contribute to a more favorable outcome.
Thank you once again for your time and consideration—we greatly value your input.
Thanks, Authors
This paper studies the contextual dueling bandits problem in which a learner observes and learns from noisy, pairwise preference feedback rather than absolute rewards. The goal is to select arms for each context so that the best-response regret is minimized (defined in Section 2). Existing algorithms for contextual dueling bandits either made strong modeling assumptions, were limited to finite action spaces, or used computationally infeasible online regression oracles. To overcome these shortcomings, the authors propose two algorithms: Double-Monster ( for finite action spaces) and Double-Monster_Inf (for infinite action spaces), using offline regression oracles. The authors have theoretically shown the sub-linear regret upper bounds of both algorithms. Furthermore, they have also empirically validated the performance of proposed algorithms using synthetic problem instances.
优缺点分析
The following are the strengths of the paper:
-
This paper studies the contextual dueling bandit algorithms that use offline regression oracles. These algorithms are very useful in many areas such as LLM alignment, recommendation, online advertisements, and so on.
-
This paper proposes two algorithms: Double-Monster ( for finite action spaces) and Double-Monster_Inf (for infinite action spaces). Both algorithms use offline regression oracles and are theoretically shown to have sub-linear regret upper bounds. Furthermore, they have also empirically validated the performance of proposed algorithms using synthetic problem instances.
-
The authors have empirically demonstrated that the proposed algorithms outperform some existing dueling bandits algorithms and have comparative performance to DTS using synthetic problem instances.
The following are the weaknesses of the paper:
-
Realizability assumption: There should be a discussion on how practical the realizability is in real-life applications and what the consequences are when this assumption does not hold (misspecification), especially the regret upper bounds.
-
Gap between motivation in introduction and experiments: The proposed problem setting is heavily motivated by the practical application of contextual dueling bandits, such as AI alignment (or LLMs), in the Abstract and Introduction sections. However, no relevant experiments (even simpler real-life datasets) have been done to show the empirical performance of the proposed algorithms. Furthermore, it is unclear how the reward function, which is estimated only a few times (), will affect empirical performance.
-
Missing high-level ideas: The paper will become more readable if the authors add algorithms and observations about key results instead of mentioning all intermediate results in the main paper. The current paper is too dense, and insufficient details and motivation are given.
-
Missed key related work: The authors have missed the existing work on neural contextual dueling bandits [1] that considered a general class of latent reward functions. Since the proposed algorithms only have comparative performance to DTS (or sometimes even worse), these algorithms may not be able to empirically outperform the algorithms proposed in [1].
[1] Verma et al. "Neural dueling bandits: Preference-based optimization with human feedback." ICLR 2025.
问题
Please address the weaknesses raised in Strengths And Weaknesses*.
Minor comments:
- Line 11: Is a typo or should it be ?
- Line 54: There is a typo.
I may change my score based on the authors' responses.
局限性
I have raised a few limitations of the paper in my response to the Strengths And Weaknesses*. Since the paper is a theoretical contribution to bandit literature, I do not find any potential negative societal impact of this work.
最终评判理由
This paper introduces novel contextual dueling bandit algorithms that use offline regression oracles. Such algorithms have many real-world applications in areas such as LLM alignment, recommendation systems, and online advertising.
The authors' rebuttal has addressed my concerns. As a result, I am also raising my rating from 3 to 4.
格式问题
I found no major formatting issues in this paper.
Thanks a lot for your careful reading and insightful comments.
W1: Realizability assumption
The realizability assumption (Assumption 1, page 3) is standard in contextual bandit literature and is necessary for any meaningful theoretical guarantees. This hypothesis‑class premise is identical to many (highly cited) contextual‑bandit work (ILTCB, SquareCB, FALCON) and in classical supervised‑learning settings such as linear and generalized‑linear regression, RKHS models, and deep networks (see [17, 30], Li et al. ’22, Zhu et al. ’22, Blum et al. ’21).
It's important to note that, despite the primary theoretical focus of our work, the assumption is also not restrictive in practice since the learner is free to choose ! It can be arbitrarily rich—finite sets, GLMs, RKHS classes, Banach spaces, or deep networks that approximate Sobolev‑ball functions (Farrell ’21). Of course, any theoretical guarantee will require some structural assumptions for provable guarantees, but Realizability is a very primitive level (broad) assumption, which is satisfied in practice for the appropriate choice of (rich enough) function classes.
Additionally, since our algorithm relies on an offline regression oracle (rather than an online oracle as we described the advantages in Rem 1 and App A), this flexibility makes the assumption milder, not stronger.
One more important point to observe is how gracefully the regret bound degraded even without the realizability assumption (misspecified case). Note our regret bound depends on the oracle’s excess‑risk term (), which becomes (by Assump2) simply
If the true function , then , the regret increases only by —the standard agnostic penalty. Enlarging (e.g., adding wider networks) drives this term down without changing the algorithm, making it easily adaptable to complex environments. Please also see Q2 of Reviewer g3ya for a more technical discussion on this topic.
We also report experiments on both linear and deliberately quadratic (miss‑specified) rewards in Sec7, which shows regret still follows the predicted curve, demonstrating robustness across different models.
In summary, `Realizability' is (i) theoretically indispensable, (ii) practically mild when using high‑capacity model classes and offline oracles, and (iii) non‑fragile—under modest miss‑specification our regret guarantees degrade gracefully by at most the unavoidable approximation term.
W2: Gap between introduction, motivation and experiments
Our framework naturally subsumes the preference-optimization tasks that motivate the paper. In RLHF for LLMs, a context is the user prompt (plus dialogue history) and each action is a full model continuation in after log-prob “decoding” into continuous logit space; human raters supply pairwise wins/losses between two continuations—exactly the observables of a contextual dueling bandit. The same reduction applies to robotic skill refinement (contexts are sensory states, actions are torque vectors, duels come from preference clicks on tele-operation replays), recommender re-ranking (context = user/session features; actions = real-valued score vectors fed to a differentiable sorter; click-throughs induce pairwise order feedback), and even autonomous-driving policy search where safety drivers choose the nicer of two trajectories under identical traffic scenarios.
Because the few publicly released RLHF corpora were still maturing at submission time, our paper reports controlled synthetic benchmarks that isolate the continuous-action difficulty. Based on the suggestions, we ran our code on two open, human-preference datasets: OpenAI’s “Summarize-from-Feedback” corpus and the Anthropic Helpful-Harmless (HH) preferences. In each case, we fit the required regression oracle with the standard reward-model architecture (6-layer transformer encoder fine-tuned by cross-entropy on the pairwise labels), invoke it only times per experiment, and let Double-Monster-Inf drive sampling through its log-det barrier solver. The implementation is a ~300-line PyTorch module that plugs straight into TRL/PEFT pipelines; after every epoch of duels we refresh the model, solve the mirror-descent step in under one second on a single GPU, and resume data collection—mirroring the workflow practitioners already follow in RLHF training loops. In summary, our algorithm performs competitively for the Anthropic HH preference dataset, as expected.
We are also running a similar experiment for the Preference-Driven Sim-to-Real Locomotion dataset recently released by Google DeepMind, illustrating the robotics angle. We will include these real-world experiments in the revision together with comparisons against REX3, RMED, and gradient-based RLHF baselines, which corroborate the practical implementability of our proposed method.
Unfortunately, this time, due to the NeurIPS rebuttal policy, we are unable to include the figures or even anonymous links to these new experiments. But will be happy to include them—with full results and implementation details—in the final version
W3: Missing high-level ideas
We appreciate the suggestion and fully agree that improving the high-level clarity and organization will significantly enhance the readability of the paper. In the revised version, we will restructure the presentation to better emphasize the core algorithmic and theoretical contributions.
Specifically, the key algorithmic insights are captured in Sections 4 and 6, which introduce Double-Monster and Double-Monster-Inf, respectively. We will move the detailed pseudocode of Alg1 and 2 in the main draft (with the help of the additional content page in the final version). We will also reduce the technical proof discussions in the regret analysis of Alg 1, simplifying the content of Sec 4.1 or 5.
In the revised version, we will leverage the additional page to move all algorithm pseudocode into the main body (rather than the appendix), while also pruning or simplifying intermediate technical sections such as Section 4.1 or parts of Section 5. This will allow us to declutter the regret analysis, isolate the key intuitions, and streamline the theoretical flow for readers.
These changes will make the paper significantly more accessible and self-contained, especially for readers less familiar with the contextual bandit literature. Thanks for the suggestions again.
W4: Related work
Thank you for pointing this out—we will certainly include a discussion of [Verma et al., ICLR 2025] in the updated related work section. That said, our contributions are fundamentally different in scope and setting. While [Verma et al.] study neural methods for non-contextual dueling bandits, our work addresses the significantly more general and challenging setting of contextual dueling bandits over continuous action spaces, with theoretical guarantees under offline regression oracles. Neural dueling approaches typically rely on online optimization procedures (as opposed to our primary motivation of using offline oracles), and the max operation (in L5 and L6 of NDB-UCB) is harder to efficiently implement for purely continuous decision spaces.
Moreover, as noted in our paper, DTS is also designed for non-contextual, -armed problems. Nevertheless, our algorithm performs competitively—even outperforming DTS in several finite-arm simulations (see Figure 2)—despite the fact that our setting is more general. As explained in Q1, our method is modular in the choice of function class , so we can instantiate with neural networks, including architectures similar to those used in [1], and still retain our offline oracle-based framework.
We are currently running experiments to evaluate this comparison directly and will include the resulting plots and analysis in the updated version.
Minor Comments
- Line 11: Indeed, it should be , as noted in our main result (Thm9). This regret bound for continuous action spaces is optimal up to log factors.
- Line 54: Thanks a lot for catching the typo, will remove the repeated expressions in Line 54 in the revised version.
We hope this addresses all of your concerns. We are happy to clarify any further questions and respectfully urge the reviewer to reconsider the final evaluation in light of the rebuttal.
Thank you for your detailed rebuttal. Since all my concerns have been addressed, I will be increasing my rating. I encourage you to incorporate the high-level ideas into the main paper.
Dear Reviewer FxvP,
Thanks a lot, we are glad to know that our response helped in clarifying your concerns. Thank you for considering increasing your score. We will certainly ensure that all the high-level ideas (W3) and other key points outlined in the rebuttal clarifications are incorporated into the revised version of the paper.
Thank you once again for your time, thoughtful feedback, and constructive suggestions --- that have been really invaluable in enhancing the quality and strengthening the contribution of our work
Sincerely, Authors
Dear Reviewer FxvP,
We sincerely appreciate your time, thoughtful evaluation. As the discussion phase is now active and the decision deadline approaches, we wanted to kindly follow up on our rebuttal.
We hope our responses addressed your concerns as well, and we would be truly grateful for any additional feedback or follow-up questions you may have.
If any points remain unclear or merit further discussion, we would be more than happy to elaborate. We believe that a brief exchange at this stage could help resolve any remaining uncertainties and potentially contribute to a more favorable outcome.
Thank you once again for your time and consideration—we greatly value your input.
Thanks, Authors
Dear Reviewer FxvP,
Thank you again for your time and insightful comments.
We hope to have we have addressed all of your questions in the rebuttal. But with limited time remaining in the author-reviewer discussion phase, we thought of checking in once more to see if there are any outstanding concerns we can address. We would be glad to provide any additional clarification as necessary.
Thanks Authors
Dear Reviewer FxvP,
Thank you again for your time and insightful comments. Since we are only a few hours away from the end of the author-reviewer discussion phase, we wanted to check one last time if we can clarify any remaining concerns. Please let us know, we would be happy to.
Sincerely, Authors
This paper studies the contextual dueling bandits problem, which is central to the theoretical modeling of RLHF. It makes a strong theoretical contribution, as recognized by both the reviewers and the AC. The presentation is clear, and the paper is well-situated within the existing literature. During the rebuttal, the authors sufficiently addressed the reviewers’ concerns, leading to increased scores from some reviewers. Overall, I recommend accepting this paper.