5.8

/10

Rejected4 位审稿人

最低5最高6标准差0.4

3.3

置信度

正确性3.0

贡献度2.8

表达2.8

NeurIPS 2024

Nearly Optimal Algorithms for Contextual Dueling Bandits from Adversarial Feedback

Qiwei Di,Jiafan He,Quanquan Gu

OpenReview PDF

提交: 2024-05-15更新: 2024-11-06

TL;DR

We prove the optimal result for contextual dueling bandits with adversarial feedback

摘要

Learning from human feedback plays an important role in aligning generative models, such as large language models (LLM). However, the effectiveness of this approach can be influenced by adversaries, who may intentionally provide misleading preferences to manipulate the output in an undesirable or harmful direction. To tackle this challenge, we study a specific model within this problem domain--contextual dueling bandits with adversarial feedback, where the true preference label can be flipped by an adversary. We propose an algorithm namely robust contextual dueling bandit (\algo), which is based on uncertainty-weighted maximum likelihood estimation. Our algorithm achieves an $\tilde O(d\sqrt{T}+dC)$ regret bound, where $T$ is the number of rounds, $d$ is the dimension of the context, and $ 0 \le C \le T$ is the total number of adversarial feedback. We also prove a lower bound to show that our regret bound is nearly optimal, both in scenarios with and without ($C=0$) adversarial feedback. Additionally, we conduct experiments to evaluate our proposed algorithm against various types of adversarial feedback. Experimental results demonstrate its superiority over the state-of-the-art dueling bandit algorithms in the presence of adversarial feedback.

关键词

Dueling banditsbanditspreference feedbackadversarially robustweighted MLE

评审与讨论

审稿意见

评分: 5置信度: 42024-07-06

The work studies linear contextual dueling bandits with adversarial feedback. In each round $t$ the agent observes a context $x_t$ and chooses two actions $(a_t,b_t)$ . The environment generates a binary preference label $\ell_t = \mathbb{I}(a_t > b_t)$ . The underlying assumption is that there exists a linear reward function $r(x,a) = \theta_{\star}^{\top}\phi(x,a)$ , where $\theta_*$ is a latent $d$ -dimensional vector and $\phi$ is a known feature map such that $\|\|\theta_*\|\|_2 \le B$ and $\|\|\phi\|\|_2 \le 1$ .

Based on this, the preference $\ell_t$ is a random variable such that

\mathbb{P}\big(a > b \mid x\big) = \sigma\big(r(x,a)-r(x,b)\big)

The link function $\sigma$ is antisymmetric, $\sigma(-z) = 1-\sigma(z)$ and such that $\sigma' \ge \kappa > 0$ .

It is further assumed that a nonoblivious adversary may occasionally flip the preference label with the knowlege of $(a_t,b_t)$ , where $C_T$ indicates the number of flips in the first $T$ rounds.

The regret is measured according to the following formula

\max_a \sum_t 2r(x_t,a) - \sum_t \Big( r(x_t,a_t) + r(x_t,a_t) \Big)

The main result is a regret bound of the order $d\sqrt{T} + dC$ , ignoring logarithmic factors. This bound is tight because $\Omega(d\sqrt{T})$ is the known lower bound without adversarial corruption and $\Omega(dC)$ is shown to be the regret due to the adversarial corruption.

Experiments complete the contribution.

优点

The topic is of adversarial feedback in dueling bandits is interesting and was only studied in a $K$ -armed setting.

The regret bound is tight up to log factors

The related work section is complete and accurate.

The experimental section is significant.

缺点

The main results (upper and lower bounds) appear to be mostly based on combinations of known techniques from previous papers. The originality of the technical contributions is unclear.

The analysis of the $C$ unknown case (Section 5.2) is trivial.

Assumption 3.2 could create bad dependencies in the bounds on $B$ (and $\|\|\theta_{*}\|\|_2$ ).

The conditions $\sigma' \le 1$ and $\phi_t^{\top}\theta_{*} \le 1$ in the paragraph before (4.3) are not explicitely stated in the assumptions. Moreover, the second condition seem to imply that $B \le 1$ .

There is no discussion on the hardness of computing $(a_t,b_t)$ .

问题

Can the authors point out what where the main technical challenges in the proof of Theorem 5.3?
Please compute explicit values for $\kappa$ in terms of $B$ and $\|\|\theta_{*}\|\|_2$ for concrete choices of $\sigma$ .
Please elaborate on the hardness of computing $(a_t,b_t)$ .

局限性

There is an explicit limiation paragraph in the conclusions.

作者回复

2024-08-07

Thank you for your insightful comments and suggestions! We answer your questions as follows.

Q1: Technical novelties

A1: We want to emphasize that we study the dueling bandit problem, which is different from the standard linear bandit problem and incurs several challenges when using uncertainty-based weight in dueling bandits. First, the feedback is binary, given by a preference model $\mathbb{P}(a \succ b| x) = \sigma(r^*(x,a)-r^*(x,b)).$ This difference in settings leads to a different analysis.

Unlike the weighted regression method, our model employs weighted maximum likelihood estimation (MLE) to estimate the underlying parameter $\theta$ . The nonlinearity of the function prevents us from obtaining a closed-form solution for $\theta$ , adding difficulty in determining the confidence radius. Moreover, our analysis of weights for the nonlinear model in Section 4 is novel. We explain how our weight selection can cancel out the variance, resulting in a more robust theoretical analysis under adversarial feedback.

In our proof, we bypass the difficulty of nonlinearity by utilizing an auxiliary vector function $G_{t}({\theta}) = \lambda\kappa{\theta} + \sum_{i = 1}^{t-1}w_i\Big[\sigma\big(({\phi}(x_i,a_i)-{\phi}(x_i,b_i))^\top {\theta}\big) -\sigma\big(({\phi}(x_i,a_i)-{\phi}(x_i,b_i))^\top {\theta}^*\big)\Big]\big({\phi}(x_i,a_i)-{\phi}(x_i,b_i)\big)$ . The elliptical potential lemma provides an upper bound of $\||G_ {t}({\theta_t}) \|| _ {\Sigma_t^{-1}}$ . We bridge the gap between this and the confidence radius $\|| \theta_t-\theta^*\||_{\Sigma_t}$ by mean value theorem (Lines 525 to 527)

To make it clearer, we have provided a roadmap of proof in Appendix A.

Q2: Analysis of unknown $C$ is trivial

A2: While the analysis is simple, it does not diminish the significance of our result that our algorithm is able to achieve optimal regret even when faced with an unknown $C$ (see Remark 5.8). This is also included for the completeness of our work.

Q3: Conditions are not explicitly stated in the assumptions

A3: In Section 4, our argument aims to explain the motivation for introducing uncertainty-based weights. We want to clarify that we provide rigorous proof for the algorithm's performance in Appendix B.1, which relies solely on our specific choice of weights and does not depend on the conditions or analysis from Lines 207 to 210.

In the motivating analysis, we use $\sigma' \le 1$ since for many useful link functions (e.g., sigmoid function), the inequality does hold, In addition, $\phi_t^\top \theta \le 1$ is a typo and we only need $\sigma'( \phi_t^\top \theta^*) \le 1$ , which always holds for the sigmoid link function. In our revision, we will correct the typo and make the writing clearer.

Q4: Assumption 3.2 could create bad dependencies in the bounds on $B$

A4: We can calculate $\kappa$ in Assumption 3.2 using $B$ . Take $\sigma(x) = 1/ (1+e^{-x})$ for example. $\sigma(x)’ = e^{-x} /(1+e^{-x})^2$ . Due to Assumption 3.1, $|(\phi(x,a)-\phi(x,b))^\top\theta| \le 2B$ , therefore, $\sigma’((\phi(x,a)-\phi(x,b))^\top\theta) \le 1/(e^{-2B} + 2 + e^{2B})$ for all $a,b \in \mathcal A, \theta$ . As a result, $\kappa = 1/(e^{-2B} + 2 + e^{2B})$ . Our regret’s dependency on $B$ will be $O(e^{2B} + e^{-2B})$ . Usually, $B$ is considered as a constant, , and thus it will not cause bad effects to our regret. Similar dependency on $B$ is widely seen in the literature, for example [1][2][3].

Q4: Hardness of computing the arms

A4: In this work, we assume there is a computation oracle to solve the optimization problems over the action set $\mathcal{A}$ (e.g., Line 6 of Algorithm 1). A similar oracle is implicitly assumed in almost all existing works for solving standard linear bandit problems with infinite arms (e.g., OFUL and CW-OFUL algorithms). Without such an oracle, choosing a pair of actions from the infinite decision set would be computationally intractable.

In the special case where the decision set is finite, we can iterate across all actions, resulting in $O(k^2d^2)$ complexity for each iteration, where $k$ is the number of actions, and $d$ is the feature dimension.

In our revision, we will add more discussion on the computational complexity.

[1] Zhu et al. Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons ICML

[2] Xiong et al. Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-constraint ICML

[3] Li et al. Feel-Good Thompson Sampling for Contextual Dueling Bandits ICML

2024-08-08

Thank you for your answers. I found them useful. I have still some reservation concerning novelty though, hence I will raise my score by one point.

2024-08-09

We're glad that we were able to address your questions. Thank you for raising the score. To the best of our knowledge, our work is the first to achieve nearly minimax optimal regret for dueling bandits in the presence of adversarial preference feedback, regardless of whether the amount of adversarial feedback is known. We will highlight this in the final version.

审稿意见

评分: 6置信度: 32024-07-10

This paper investigated the contextual dueling bandits with adversarial feedback, where the adversary can corrupt the binary feedback of the agent to a certain level. A new algorithm named RCDB has been proposed. The key idea lies in the utilization of uncertainty-weight MLE. Regret analysis of RCDB was provided along with some experimental evaluations.

优点

This paper studies a known problem but with new angle, i.e., the adversary can corrupt the binary feedback of the agent to a certain level. The problem is well motivated.
A novel algorithm named RCDB was designed and incorporated uncertainty-dependent weighting into the MLE.
The theoretical performance of RCDB in terms of regret is provided.
Experimental results to validate the performance of RCDB is presented and compared to existing methods.

缺点

Assumption 3.1 assumes a linear reward. The reviewer agrees that this is a "widely-used" assumption in the recent RLHF literature, but was curious if your framework and regret analysis can be extended without such an assumption? If not, what are the new challenges? Can you comment on this?
The construction of the parameter estimator of $\theta$ requires the Taylor expansion. How the $\approx$ s in Section 4 impact the regret analysis?
The experiments were run for multiple times, however, the variance is not shown in Figure 1.
The paper is very dense, and the authors have changed the template a bit, e.g., the space has been largely squeezed throughout the paper.

问题

Assumption 3.1 assumes a linear reward. The reviewer agrees that this is a "widely-used" assumption in the recent RLHF literature, but was curious if your framework and regret analysis can be extended without such an assumption? If not, what are the new challenges? Can you comment on this? The reviewer noticed that the authors discussed this at the end of the paper and mentioned Li et al. 2024. Will such an extension be straightforward?
The construction of the parameter estimator of $\theta$ requires the Taylor expansion. How the $\approx$ s in Section 4 impact the regret analysis?
The experiments were run for multiple times, however, the variance is not shown in Figure 1.
The paper is very dense, and the authors have changed the template a bit, e.g., the space has been largely squeezed throughout the paper.

局限性

N/A.

作者回复

2024-08-07

Thank you for your positive feedback! We answer your questions as follows.

Q1: Challenges to extend the linear reward to more general settings.

A1: It might be possible to consider the corruption problem for a nonlinear reward function class with finite Eluder dimension, like in [1]. However, it is worth noticing that the observed comparison feedback depends on the difference in the reward gap $r(x,a)-r(x,b)$ , however, the regret/sub-optimality of the selected action is based on the summation of two reward function $r(x,a)+r(x,b)$ . For the linear reward function, both the reward gap $r(x,a)-r(x,b)$ and reward summation $r(x,a)+r(x,b)$ fall into a linear region. However, for a general setting, the gap and summation may share different function classes, which makes it difficult to analyze. This is beyond the scope of our work and we leave it as a future work.

Q2: Will Taylor's expansion impact the regret analysis?

A2: In Section 4, we aim to provide a clear explanation of the chosen weights. Therefore, we apply Taylor's expansion to clarify its relation to the variance and use approximations to illustrate the motivation, making it easier to understand. We want to clarify that we provide rigorous proof for the algorithm's performance in Appendix B.1, which relies solely on our specific choice of weights and does not need to use Taylor’s expansion.

Q3: variance for the experiments

A3: Thank you for your advice. In the uploaded pdf, we have included the variance information for the experiments in the plot.

Q4: Paper is too dense

A4: Thanks for your suggestion. We will restructure the formulas to create more space and improve readability in our revision, given that there is one extra page for camera-ready.

[1] Ye et al.(2023) Corruption-robust algorithms with uncertainty weighting for nonlinear contextual bandits and Markov decision processes.

评论- Thank you for the rebuttal

2024-08-09

Thanks for the clarification. I will keep the current score.

2024-08-13

Thank you for your support!

审稿意见

评分: 6置信度: 32024-07-11

The author proposed an algorithm, coined robust contextual dueling bandits (RCDB) for advarial feedback, using uncertainty-weighted maximum likelihood estimation. The algorithm guarantees $\widetilde{O}(d\sqrt{T}+CT)$ .

优点

Their algorithm is not limited to a finite number of arms.
Their algorithm considers adversarial attacks based on selected actions (although the maximum number of adversarial attacks is restricted).

缺点

Hard to follow the paper

问题

$\Sigma_t$ must be the Gram matrix in line 209, $\Sigma_t$ appeared without introducing.
Please specify exactly where in the supplementary materials proofs of these results (Lemma 5.1 ~ Theorem 5.7) are located.
It seems to specify the initial weight $w_t$ in Algorithm 1. $w_t$ is specified in Line 8 in Algorithm 1, but it is used from line 3.
Further explanation is needed for lines 204-209. For instance, clarify how the property of $\mathcal{F}_t$ -measurability is used in the conditional expectation in line 207. Additionally, explain why the assumption $|\theta_t-\theta^*|<<1$ is made to consider Taylor's expansion.
It would be better to clearly state that the proposed algorithm is effective under both known and unknown numbers of adversarial attacks. Prior to Section 5, it is unclear whether this paper considers both aspects or only one of them.

局限性

The authors precisely pointed out the limitations of their results, such as the reward function being linear with a known feature map.
Additionally, choosing $a_t, b_t$ (by computing argmax) might be infeasible when interaction with the environment needs to be fast.

作者回复

2024-08-07

Thank you for your positive feedback! We answer your questions as follows.

Q1: Hard to follow the paper.

A1: Thank you for pointing out these issues. We will address your concerns one by one:

(1) $\Sigma_t$ appeared without introducing.

Our definition of $\Sigma_t$ is provided in line 3 of Algorithm 1. In our revision, we will add a reference to this definition when we use it to avoid confusion.

(2) specify exactly where in the supplementary materials proofs of these results (Lemma 5.1 ~ Theorem 5.7)

Due to page limits, we do not provide links in the main paper. However, the subtitles in the appendix clearly indicate where the results are proved. To be more specific, the proofs for the main theorems are in Appendix B, while other technical lemmas are placed in Appendix C. We will be sure to specify it in the revision.

(3) The initial weight $w_t$ is not specified before use

It is important to note that for each round $t$ , the covariance matrix $\Sigma_t$ in Algorithm 1 (Line 3) only depends on the previous weights $w_1,...,w_{t-1}$ . For the initial round where $t=1$ , the covariance matrix $\Sigma_t$ is initialized as $\Sigma_1=\lambda \boldsymbol{I}$ and is not related to $w_0$ . For subsequent rounds, the weights $w_1,...,w_{t-1}$ have already been calculated in the previous rounds.

(4) Further explanation is needed for lines 204-209. For instance, clarify the property of $\mathcal F_t$ -measurability, and explain Taylor's expansion.

In Section 4, we introduce Taylor's expansion to explain the motivation for introducing uncertainty-based weights. We want to clarify that we provide rigorous proof for the algorithm's performance in Appendix B.1, which relies solely on our specific choice of weights and does not need to use Taylor’s expansion.

Intuitively, even without the weighting, a similar analysis to Lemma 5.1 can show that the estimated parameter $\theta_t$ will approach $\theta^*$ with a larger confidence radius, e.g., $||\theta_t - \theta^*||_{\Sigma_t} \leq \tilde{O}(\sqrt{d} C)$ . Under this situation, we almost have $| \phi_t^\top \theta - \phi_t^\top \theta^*| \leq \tilde O(\sqrt{d} C) \cdot \||\phi_t|| _ {\Sigma^{-1}} \ll 1$ for a large round $t$ , which encourages us to use Taylor's expansion.

For the property of $\mathcal F_t$ -measurability, we aim to use the following property of conditional expectation with respect to a sub- $\sigma$ -algebra.

Lemma: (Pulling out known factors): If $X$ is $\mathcal {H}$ measurable, then $E[XY|\mathcal {H}] = XE[Y|\mathcal {H}]$ . By taking $Y=1$ , we have $E[X|\mathcal {H}] = X$ . Therefore, when calculating $\sigma(\phi_t^\top \theta_t) - \mathbb E[\sigma(\phi_t^\top \theta_t)| \mathcal F_t]$ , the $\mathcal F_t$ measurable part will be canceled out. In our revision, we will add more explanation for our analysis.

(5) Prior to Section 5, it is unclear whether this paper considers both aspects or only one of them.

Thank you for your advice. Our paper considers both known and unknown number of adversarial feedbacks. We will add more discussion to clarify this point in our revision.

In our revision, we will modify our paper to address these issues and make the content easier to understand.

Q2:Calculating the argmax might be infeasible

A2: In this work, we assume there is a computation oracle to solve the optimization problems over the action set $\mathcal{A}$ (e.g., Line 6 of Algorithm 1). A similar oracle is implicitly assumed in almost all existing works for solving standard linear bandit problems with infinite arms (e.g., OFUL and CW-OFUL algorithms). Without such an oracle, choosing a pair of actions from the infinite decision set would be computationally intractable.

In our revision, we will add more discussion about the calculation complexity.

2024-08-12

Thank you for your kind explanation.

While there are lines where mathematical notation appears abruptly (e.g., the Gram matrix on line 209), these are minor issues. If the author provides additional guidance to help readers follow more easily, I believe the paper is overall good. I would therefore raise the rating from 5 to 6.

2024-08-12

Thank you for raising your rating! We will be sure to improve the presentation according to your suggestion.

审稿意见

评分: 6置信度: 32024-07-13

This paper studies the Contextual Dueling Bandits from Adversarial Feedback problem, in a linear reward setting. The authors propose an algorithm named robust contextual dueling bandits (RCDB), which is designed based on uncertainty-weighted regression and MLE. The authors prove that the proposed algorithm achieves a nearly optimal regret upper bound that matches the lower bound both in scenarios with and without (C=0) adversarial feedback. Experimental evaluations are provided to validate the theoretical results.

优点

The problem is well-motivated and important.
The paper is well-written.
The authors prove a nearly optimal regret upper bound for the proposed algorithm that matches the lower bound with and without (C=0) adversarial feedback.
The authors also conduct some experiments to validate the theoretical results.

缺点

The title may be a little bit misleading, I think the setting of this paper is the adversarial corruption setting, not the setting with completely adversarial feedback. And the setting is the linear reward model, which is not specified in the title.
I have not checked the details, but I feel the uncertainty-weighted technique (which is the key to dealing with the corruption in the linear bandits) is mostly based on the previous works, could the authors highlight the technical difficulties in the dueling bandit setting?

问题

See the weakness above.

局限性

No negative societal impact of this work.

作者回复

2024-08-07

Thank you for your positive feedback. We will address your questions one by one.

Q1: The title may be a little bit misleading, I think the setting of this paper is the adversarial corruption setting, not the setting with completely adversarial feedback. And the setting is the linear reward model, which is not specified in the title.

A1: In previous studies on standard linear bandit problems, including both adversarial corruption and adversarial bandits, the adversary directly targets the reward. In contrast, our setting involves binary preference feedback rather than direct reward feedback, with the adversary attacking by flipping the preference labels. We use “adversarial feedback” to differentiate our work from prior studies on corrupted or adversarial reward settings, for example [1][2]. Indeed, our setting focuses on the linear reward model, and we will specify this in the title in the revision. Thank you for your suggestion.

Q2: The uncertainty-weighted technique is based on previous works. What are the technical difficulties?

A2: We want to emphasize that we study the dueling bandit problem, which differs from the standard linear bandit problem in [3] and incurs several challenges when using uncertainty-based weights in the dueling bandit context. Specifically, in the dueling bandit setting, the feedback is binary and given by a preference model $\mathbb{P}(a \succ b| x) = \sigma(r^*(x,a)-r^*(x,b))$ . This difference in feedback necessitates a different analytical approach.

Unlike the weighted regression method, our model employs weighted maximum likelihood estimation (MLE) to estimate the underlying parameter $\theta$ . The nonlinearity of the function stops us from having a closed-form solution of $\theta$ , making it difficult to determine the confidence radius. Additionally, in Section 4, we discuss the weight selection, explaining how our uncertainty-based weights can cancel out the variance of the estimated preference probability.

To overcome the nonlinearity challenge in our proof, we utilize an auxiliary vector function $G_{t}({\theta}) = \lambda\kappa{\theta} + \sum_{i = 1}^{t-1}w_i\Big[\sigma\big(({\phi}(x_i,a_i)-{\phi}(x_i,b_i))^\top {\theta}\big) -\sigma\big(({\phi}(x_i,a_i)-{\phi}(x_i,b_i))^\top {\theta}^*\big)\Big]\big({\phi}(x_i,a_i)-{\phi}(x_i,b_i)\big)$ . Then, the elliptical potential lemma provides an upper bound of $\||G_{t}({\theta_t}) || _ {\Sigma_t^{-1}}$ . We bridge the gap between this and the confidence radius $\||\theta-\theta^*\||_{\Sigma_t}$ by mean value theorem (Lines 525 to 527).

To make it clearer, we have provided a roadmap of proof in Appendix A.

[1] Gajane et al., A Relative Exponential Weighing Algorithm for Adversarial Utility-based Dueling Bandits. ICML

[2] Saha et al., Versatile dueling bandits: Best-of-both world analyses for learning from relative preferences. ICML

[3] He et al., Nearly Optimal Algorithms for Linear Contextual Bandits with Adversarial Corruptions. NEURIPS

2024-08-08

Thanks for your reply. I keep my score. Good luck!

2024-08-12

Thank you for your continued support!

最终决定Reject

2024-09-25

This paper is the first paper to study contextual (linear) dueling bandits with adversarial corruption in the preference labels. They propose an algorithm that is robust to the level of corruption and achieves matching upper and lower bounds.

The problem is of interest due to the extensive application of preference based learning for LLMs. Assuming preferences to be corrupted in a "label-flipping" manner is a reasonable assumption.

However, while the problem itself has not been studied in this exact form, the algorithm and the analysis itself are borderline incremental. There are some concerns among the reviewers, which I share, that there is not sufficient novelty over prior work in the techniques that are used. While some reviewers deem the results itself noteworthy enough to warrant publication, I disagree for the reason that it requires knowledge of an upper bound of the corruption level. Adapting close to optimal to the unknown level of corruption is in my opinion the most interesting and challenging question in this problem setup. This is a question that has been fully solved in most bandit and RL regimes, so it is reasonable to expect an answer (or impossibility result) in the dueling bandits regime as well.