PaperHub
5.5
/10
Poster4 位审稿人
最低4最高8标准差1.5
4
5
8
5
2.8
置信度
正确性3.0
贡献度2.5
表达2.8
NeurIPS 2024

Bandits with Abstention under Expert Advice

OpenReviewPDF
提交: 2024-05-14更新: 2024-11-06
TL;DR

We study bandits with expert advice when given an option to abstain from making any action, and achieve novel reward bounds for confidence rated predictors.

摘要

关键词
Multi-armed banditsExpert adviceAbstentionContextual bandits

评审与讨论

审稿意见
4

This paper considers the problem of prediction with expert advice in the bandit feedback setting with the possibility of abstention. The game is sequential: where in each round the each of EE experts gives a distribution over K+1K+1 actions/arms. One of these actions represents abstention, yielding a reward of 0, while the remaining KK actions have rewards in [1,1][-1,1] and are set by an oblivious adversary.

The contributions of this paper include a procedure called confidence-rated bandits with abstention (CBA), which is an adaptation of the well-known exponential weighted scheme. The authors present guarantees in the form of an upper bound on the expected cumulative regret. Additionally, they explore applications of the resulting algorithm in various other problems.

优点

The problem under consideration is both interesting and practically relevant, as the ability for a prediction strategy to abstain is crucial in many scenarios, particularly when the risk of making poor decisions is high. Additionally, accounting for the context is a significant and pertinent feature of modern applications.

缺点

Overall, I find this paper unclear and difficult to read. The results obtained are not easily comparable to existing ones in the literature.

For instance, in the case of regret minimization with expert advice, such as the analysis of the EXP4 algorithm, regret is typically defined with respect to the best fixed expert in hindsight. The fact that the authors develop guarantees with respect to the best combination of experts (where this combination is not necessarily convex, since the set of vectors uVu \in \mathcal{V}, and V\mathcal{V} as defined in line 123 is not the set of convex weights) can be very confusing for the reader. The authors discuss comparisons with the guarantees of the EXP4 algorithm and state that "The EXP4 bound essentially replaces the term ρ(u)\rho(u) in our bound by ρ(u)/u\rho(u)/u" in line 138. However, it is unclear how this translates in terms of worst-case guarantees. This part clearly needs a thorough discussion to be convincing.

Additionally, it is not clear how this work compares to previous research on online learning with abstention.

问题

How do your results compare to [1]?

[1] Gergely Neu and Nikita Zhivotovskiy. Fast rates for online prediction with abstention. In Proc. COLT, pages 3030–3048, 2020.

局限性

See weaknesses section.

作者回复

Thank you for your time and effort in reviewing our paper.

The results obtained are not easily comparable to existing ones in the literature.

Our results can be, for example, compared to EXP4 (see reply below) and SpecialistEXP (see e.g., our reply to reviewer Edre). In both cases we achieve at least the same bounds and in many cases achieve significantly better bounds. We will improve our exposition and a add a more thorough discussion and in the final version.

For instance, in the case of regret minimization with expert advice, such as the analysis of the EXP4 algorithm, regret is typically defined with respect to the best fixed expert in hindsight. The fact that the authors develop guarantees with respect to the best combination of experts (where this combination is not necessarily convex, since the set of vectors , and as defined in line 123 is not the set of convex weights) can be very confusing for the reader. The authors discuss comparisons with the guarantees of the EXP4 algorithm and state that "The EXP4 bound essentially replaces the term in our bound by ρ(u)/u\rho(u)/u" in line 138. However, it is unclear how this translates in terms of worst-case guarantees. This part clearly needs a thorough discussion to be convincing.

You are right in that both the reward of our algorithm and that of Exp4 are measured with respect to a linear combination of experts and for Exp4 that linear combination is required to be convex (i.e. the weights have to sum up to 11). Typically the weights of our linear combination sum to some value SS far greater than 11 (and never less than 11). Hence (as the regret term limits to 00) our reward is SS times that of Exp4. We will add this discussion to the paper for clarity.
The fact that “Exp4 bound essentially replaces the term in our bound by ρ(u)/u1\rho(u)/|u|_1” is because Exp4 plays with the weight vector uu normalized at each time step and therefore operates in a stricter space.

The results obtained are not easily comparable to existing ones in the literature. [...] Additionally, it is not clear how this work compares to previous research on online learning with abstention. How do your results compare to [1]?

[1] Gergely Neu and Nikita Zhivotovskiy. Fast rates for online prediction with abstention. In Proc. COLT, pages 3030–3048, 2020.

We note that, to the best of our knowledge, there are no previous results on adversarial contextual bandits with the abstention option. The main novelty of the paper lies in merging the abstention action with confidence-rated predictors. This allows us to extend our result to (adversarial) contextual bandits with the abstention option. We introduce the first cumulative reward bounds on confidence-rated experts and we investigate contextual bandits as a potential application. Note that the cumulative reward bound we obtain is not achievable by previous work on confidence rated experts as they scale the regret by the confidence of the best expert in hindsight. We discuss previous results on online learning with abstention in the main body, but the previously considered settings are different from the ones we consider. Moreover, previous methods cannot be directly applied to confidence-rated predictors. Specifically, our results cannot be directly compared to [1], as their learning setting is different; it is a full information (in contrast to our bandit setting) binary classification problem.

评论

Thank you for your rebuttal; I have read your response.

审稿意见
5

This paper investigates the problem of prediction with expert advice in contextual adversarial bandits, introducing the CBA (Confidence-rated Bandits with Abstentions) algorithm, they apply CBA to adversarial contextual bandits and achieve near-optimal upper regret bound.

优点

The CBA in adversarial contextual bandits achieves near-optimal upper regret bound.

缺点

Weakness 1: This algorithm is not as efficient as Exp4.

Weakness 2: In the algorithm, they mentioned using interval bisection to find λ\lambda, but the regret upper bound does not include the approximating error of interval bisection.

问题

Q1: Why not consider the reward of the abstention action to be non-negative instead of zero?

Q2: What is the novelty in the proofs?

局限性

The algorithm is not as efficient as Exp4 and the upper bound does not include the approximation error of λ\lambda.

作者回复

Thank you for your time and effort in reviewing our paper.

Weakness 1: This algorithm is not as efficient as Exp4.

Given that the numerical precision of our machine is bounded, our algorithm has exactly the same O(EK)\mathcal{O}(EK) efficiency as Exp4. See the additional result presented in the general rebuttal for our time complexity (equal to that of Exp4 up to logarithmic factors) when the precision is unbounded.

Weakness 2: In the algorithm, they mentioned using interval bisection to find lambda\\lambda, but the regret upper bound does not include the approximating error of interval bisection.

We took the assumption that the degree of precision was bounded (the interval bisection method requires O(ln(P))\mathcal{O}(\ln(P)) steps where PP is the precision. Thanks to your question, we attached a proof in the general rebuttal that removes this assumption. When the magnitudes of the components of the comparator u are bounded by some arbitrary value Z (which will typically be very small) we can achieve a time complexity of O(EK+Eln(EZT))\mathcal{O}(EK+E\ln(EZT)) whilst only adding an additive factor of 1 (which can be made arbitrarily small) to the regret.

Q1: Why not consider the reward of the abstention action to be non-negative instead of zero?

We can have any range [a,b][-a,b] with any a,b>0a,b > 0 and the reward of the abstention action can be any fixed constant. This can be achieved by scaling and translating the reward (on any trial) when received - bringing it into the constraints of our paper (hence Thm 3.1 still applies but the regret term is scaled by some constant in O(a+b)\mathcal{O}(a+b)). Such a scaling approach is standard in online learning.

Q2: What is the novelty in the proofs?

Our proof borrows ideas from the analysis of mirror descent, EXP3, and the method of Lagrange multipliers (also with unbounded precision (see general comment)) . However, we note that the main novelty of our work was the realization that making such a modification to mirror descent would lead to true reward bounds for confidence-rated predictors. We note that mirror descent itself would not work as the constraint set is not known until after the last trial (and, even if it was known, computing the projections would be computationally hard).

评论

I have read your response and will keep my positive score.

审稿意见
8

The paper considers the problem of prediction advice under the bandit framework.

There are KK arms plus a special (K+1)(K+1)0-th arm, which always incurs the gain of 0; this arm may be interpreted as the action of abstaining. The learner outputs a distribution over K+1K+1 handles and earns the scalar product gain. In other terms, the learner outputs a vector of KK non-negative components summing to a number between 0 and 1; the sum can be though of as the measure of its confidence.

For this setup, the problem of prediction with expert advice is considered and a bound is obtained. There are two terms in the regret, one involves KL-divergence (there is a class of prediction with expert advice bounds of this kind) and the other has the form O(ηKT)O(\eta KT). It is seemingly linear, but η\eta can be tuned to get sublinear growth.

The result has been applied to the framework with side information xtx_t. Each expert is associated with a pair (B,k)(B,k). Whenever xtBx_t\in B, the expert bets everything on arm kk and is abstaining otherwise.

The main result of the paper is applied to this scenario with finitely many experts based on disjoint balls and a matching lower bound is obtained.

The case where BBs are balls in a metric space is considered separately.

优点

I think this is a very interesting result in prediction with expert advice.

缺点

No obvious weaknesses.

问题

None

局限性

Yes.

作者回复

Thanks for the very encouraging feedback!

评论

Thank you. This is to acknowledge the response.

审稿意见
5

This paper first considers the problem of bandits with expert advice, allowing the learner to abstain in any round. It introduces an algorithm, CBA, based on mirror descent, which can achieve a better cumulative regret bound in comparison with EXP4.

The algorithm is then applied to adversarial contextual bandits, where the learner abstains in rounds where the given context is not covered by any expert. The paper provides upper and lower bounds on the cumulative reward, and details a more computationally-efficient implementation for settings where the contexts have inherent structure.

优点

  • Online learning with abstention is an interesting and relevant problem for the community, and the extension of CBA to adversarial contextual bandits with abstention seems novel.

  • The paper presents reward bounds for both the bandits with expert advice problem and the adversarial contextual bandit problem. These bounds can improve upon those of existing algorithms. A lower bound is also provided for the latter setting.

  • The experiments offer some insights into the construction of basis functions for contextual bandits with abstention.

缺点

  • The paper can be quite difficult to follow at times. For instance, in Section 5, the discussion about covering foreground and background classes seems very abrupt. I understand this is mentioned in the introduction, but since the problem settings are formally introduced until later, the initial discussion is also confusing. In addition, the purpose of the experiments remains somewhat unclear to me -- they are not properly discussed, and there is this sudden change of theme from learning in bandits with abstentions to constructing basis functions for various graphs. I also find the contextualization relative to prior work somewhat inadequate -- there are not enough details about the SpecialistExp algorithm to make the comparison concrete (lines 215 - 223). See also questions below.

  • The improvement of CBA over EXP4 is somewhat vague and seems to largely depend on the set of experts. It would be nice if some concrete examples are provided.

  • There is an O(lnB)\mathcal{O}(\ln |\mathcal{B}|) gap between the upper and lower bounds in Section 5, which is substantial as B|\mathcal{B}| can be exponential in the size of the set of contexts.

  • There seems to be a lack of discussion on the intuition behind why CBA should be applied for adversarial contextual bandits with abstention in this particular way.

问题

  • Following Theorem 3.1, can you provide a concrete example where u1=E|| u ||_1 = E. The example given in the footnote seems to require all other experts to have confidence 00 in the round where the expert has confidence 11. What conditions need to be imposed on the set of experts?

  • In Corollary 5.1, can MM be any number in N\mathbb{N} for the sequence of basis elements to be disjoint?

  • How are the corresponding actions bjb_j chosen?

  • In Section 5, what are the specific benefits of allowing the learner to abstain?

局限性

It would be nice if the authors include a separate section/paragraph that discuss the limitations in detail.

作者回复

Thank you for your time and effort in reviewing our paper.

[...] logB\log|B| gap and B|B| can be exponential in the number of contexts NN.

While the comment is true, B|B| can indeed be as large as 2N2^N in principle, such a large basis set would defy its purpose (as an inductive prior on the shape of the clusters). All the basis sets we consider have size N2|N|^2, thus, the logB\log|B| gap is just logarithmic in NN. Generally, we think of BB as a set of “simple” stereotypical sets, which, for example, can be quantified by assuming BB has finite VC-dimension. In that case logB=O(vc(B)logN)\log|B|=\mathcal{O}(\mathrm{vc}(B)\log N) the dependence is still logarithmic in NN by the Sauer lemma.

[...] there is this sudden change of theme from learning in bandits with abstentions to constructing basis functions for various graphs [...] There seems to be a lack of discussion on the intuition behind why CBA should be applied for adversarial contextual bandits with abstention in this particular way.

A natural application of experts, and therefore confidence-rated experts, is adversarial contextual bandits with a finite context space. In this setting, it is quite common in the literature (e.g., [1], [2]) to consider a graph structure over the contexts. Therefore, we decided to study a practical application of confidence-rated experts, where the experts, in our case, are the basis elements we introduce. We will improve the exposition here in the final version.

[1] Cesa-Bianchi et al. "A gang of bandits." NeurIPS 2013.

[2] Wu, Qingyun, et al. "Contextual bandits in a collaborative environment." SIGIR 2016.

[...] there are not enough details about the SpecialistExp algorithm to make the comparison concrete (lines 215 - 223). See also questions below.

We shall add the regret bound of SpecialistExp to the paper. The difference is that while our regret is O(MKTln(N))\mathcal{O}(\sqrt{MKT\ln(N)}) where MM is the number of disjoint basis elements that exactly cover each of the foreground (non-abstention) classes, the regret of SpecialistExp is O(MKTln(N))\mathcal{O}(\sqrt{M’KT\ln(N)}) where MM’ is the number of disjoint basis elements that cover each of the classes (including the background/abstention class). As depicted in Figure 1, MM’ can be much larger than MM.

In Section 5, what are the specific benefits of allowing the learner to abstain?

The benefits of the abstention action are clear when dealing with contexts (and therefore nodes in graph-structured contexts) that do not have a behavior that reflects the inductive bias. In Figure 1 of the provided example, you can think of any point in white as being colored in any possible way, but we would need a ball that covers any of these points as in SpecialistsEXP. Therefore, abstaining avoids having to deal with points that present “unnatural” behavior with respect to our inductive bias.

Following Theorem 3.1, can you provide a concrete example where u1=E||u||_1=|E|. The example given in the footnote seems to require all other experts to have confidence 0 in the round where the expert has confidence 1. What conditions need to be imposed on the set of experts?

Assuming each expert has confidence 1 on at least one trial (otherwise u1||u||_1 could be much higher) then E|E| is the maximum possible value of u1||u||_1. The condition for having such a value of u1||u||_1 is that the sum of the confidences of all experts are, on any trial, no greater than 11. Such a uu is not that interesting - we place equal weight on each expert - but it was introduced in order to show just how high u1||u||_1 can be, so that the potential degree of improvement over Exp4 is huge. Things are much more interesting at slightly lower values of u1||u||_1.

In Corollary 5.1, MM can be any number N\mathbb{N} in for the sequence of basis elements to be disjoint?

MM corresponds to the number of basis elements in the comparator. Hence, as we assume the basis elements to be disjoint in Corollary 5.1., MM is only restricted by the size of the number of contexts X|\mathcal{X}| and the size of the basis B|\mathcal{B}|. We will clarify this in the final version.

How are the corresponding actions bjb_j chosen?

The bjb_j are simply the actions in the comparator sequence. The result holds for an arbitrary such sequence.

评论

Thank you for the responses. I updated my confidence level to 2, and intend to maintain my rating.

作者回复

We thank all reviewers for their feedback.

We are happy to provide an additional result that was prompted by a question from Reviewer 3cAo. The reviewer asked how the regret bound and efficiency are affected if λ\lambda is not exactly determined but only approximated (we previously assumed bounded precision so we did not address this). We will now show that, even in this case, we incur a time complexity equal to that of Exp4, up to logarithmic factors, adding only 1 to the regret. This additive factor, however, can be made arbitrarily small.

Let us restrict ourselves to compare against u\boldsymbol{u} with uiZu_i\leq Z for some arbitrary ZZ. Note that this always has to be the case when each expert has confidence of at least 1/Z1/Z on some trial. Due to this restriction, we can always project, at the beginning of trial tt, wt\boldsymbol{w}_t into the set {{vviZ}\{ \boldsymbol{v} \mid v_i \leq Z \}} which simply requires clipping its components.

For any qRq\in\mathbb{R} let Vt(q)\mathcal{V}_t(q) be the set of all v\boldsymbol{v} with vcq\boldsymbol{v}\cdot\boldsymbol{c}\leq q. We note that given, for all t[T]t\in[T], a value qt[11/T,1]q_t\in[1-1/T,1] we have that there exists u^tVt(qt)\hat{\boldsymbol{u}}\in\bigcap_t\mathcal{V}_t(q_t) such that the cumulative reward of π(u^)\boldsymbol{\pi}(\hat{\boldsymbol{u}}) is no less than that of π(u)\boldsymbol{\pi}(\boldsymbol{u}) minus 11. This means that as long as, on any trial tt, we project into the set Vt(qt)\mathcal{V}_t(q_t) for some qt[11/T,1]q_t\in[1-1/T,1] then we will not add more than one to the regret.

So the problem (for the projection step at time tt if necessary) is now to project into the set of all {{vvcqt}\{\boldsymbol{v}\,|\,\boldsymbol{v}\cdot\boldsymbol{c}\leq q_t\}} for some arbitrary qt[11/T,1]q_t\in[1-1/T,1]. Following our use of Lagrange multipliers, this means that we need to find λ>0\lambda^*>0 with ict,iwt,iexp(λct,i)[11/T,1]\sum_{i}c_{t,i}w_{t,i}\exp(-\lambda^*c_{t,i})\in[1-1/T,1]. So consider the function ff defined by f(λ):=ict,iwt,iexp(λct,i)f(\lambda):=\sum_{i}c_{t,i}w_{t,i}\exp(-\lambda c_{t,i}).

Consider λ:=ZEln(ZE)\lambda:=ZE\ln(ZE). Since wt,iZw_{t,i}\leq Z we have that when ct,i<1/ZEc_{t,i}<1/ZE then ct,iwt,iexp(λct,i)ct,iwt,i<1/Ec_{t,i}w_{t,i}\exp(-\lambda c_{t,i})\leq c_{t,i}w_{t,i}<1/E and that when ct,i1/ZEc_{t,i}\geq 1/ZE then ct,iwt,iexp(λct,i)Zexp(λ/ZE)=1/Ec_{t,i}w_{t,i}\exp(-\lambda c_{t,i})\leq Z\exp(-\lambda/ZE)=1/E. This implies that f(λ)1f(\lambda)\leq 1 and hence (since ff is monotonic decreasing) an acceptable λ\lambda^* lies in [0,ZEln(ZE)][0,ZE\ln(ZE)].

For general λ\lambda we note that f(λ)=ict,i2wt,iexp(λct,i)f(λ)\nabla f(\lambda)=-\sum_i c_{t,i}^2w_{t,i}\exp(-\lambda c_{t,i})\geq- f(\lambda). This means that f(λ)1|\nabla f(\lambda^*)|\leq 1. Since the length of the interval [11/T,1][1-1/T,1] is 1/T1/T this means that the length of the interval containing acceptable values of λ\lambda^* is at least approximately 1/T1/T.

So we have shown that either λ=ZEln(ZE)\lambda^*=ZE\ln(ZE) is acceptable or the range of acceptable values of λ\lambda^* is of length approximately 1/T1/T and lies in [0,ZEln(ZE)][0,ZE\ln(ZE)] (which has length ZEln(ZE)ZE\ln(ZE)). The ratio of these lengths is ZETln(ZE)ZET\ln(ZE) so interval bisection will find an acceptable value of λ\lambda^* in O(ln(ZETln(ZE)))=O(ln(EZT))O(\ln(ZET\ln(ZE)))=O(\ln(EZT)) steps.

So we have a time complexity O(EK+Eln(EZT))O(EK+E\ln(EZT)) and we have only added 11 to the regret (although this additive factor can be arbitrarily small).

最终决定

This paper received a significant amount of attention from the reviewers, and we appreciate the authors for making a serious effort to address their concerns and answer questions. Overall, the paper was very borderline, but the most negative reviewer was largely concerned with the difficulty of comparing the method to previous methods, due to a distinction in how the comparator set is defined. I looked into this more carefully, and I don't believe the comparator class is a problem. Indeed, the problem formulation is unique and I think it's pretty cool. So I am inclined to accept this paper.

However, I would urge the authors to prepare their final revision for the camera-ready with an eye towards clarity. The question about the comparator set is important, but the reviewers raised other issues as well. Please read their feedback carefully when updating the manuscript.