Bandits with Abstention under Expert Advice
We study bandits with expert advice when given an option to abstain from making any action, and achieve novel reward bounds for confidence rated predictors.
摘要
评审与讨论
This paper considers the problem of prediction with expert advice in the bandit feedback setting with the possibility of abstention. The game is sequential: where in each round the each of experts gives a distribution over actions/arms. One of these actions represents abstention, yielding a reward of 0, while the remaining actions have rewards in and are set by an oblivious adversary.
The contributions of this paper include a procedure called confidence-rated bandits with abstention (CBA), which is an adaptation of the well-known exponential weighted scheme. The authors present guarantees in the form of an upper bound on the expected cumulative regret. Additionally, they explore applications of the resulting algorithm in various other problems.
优点
The problem under consideration is both interesting and practically relevant, as the ability for a prediction strategy to abstain is crucial in many scenarios, particularly when the risk of making poor decisions is high. Additionally, accounting for the context is a significant and pertinent feature of modern applications.
缺点
Overall, I find this paper unclear and difficult to read. The results obtained are not easily comparable to existing ones in the literature.
For instance, in the case of regret minimization with expert advice, such as the analysis of the EXP4 algorithm, regret is typically defined with respect to the best fixed expert in hindsight. The fact that the authors develop guarantees with respect to the best combination of experts (where this combination is not necessarily convex, since the set of vectors , and as defined in line 123 is not the set of convex weights) can be very confusing for the reader. The authors discuss comparisons with the guarantees of the EXP4 algorithm and state that "The EXP4 bound essentially replaces the term in our bound by " in line 138. However, it is unclear how this translates in terms of worst-case guarantees. This part clearly needs a thorough discussion to be convincing.
Additionally, it is not clear how this work compares to previous research on online learning with abstention.
问题
How do your results compare to [1]?
[1] Gergely Neu and Nikita Zhivotovskiy. Fast rates for online prediction with abstention. In Proc. COLT, pages 3030–3048, 2020.
局限性
See weaknesses section.
Thank you for your time and effort in reviewing our paper.
The results obtained are not easily comparable to existing ones in the literature.
Our results can be, for example, compared to EXP4 (see reply below) and SpecialistEXP (see e.g., our reply to reviewer Edre). In both cases we achieve at least the same bounds and in many cases achieve significantly better bounds. We will improve our exposition and a add a more thorough discussion and in the final version.
For instance, in the case of regret minimization with expert advice, such as the analysis of the EXP4 algorithm, regret is typically defined with respect to the best fixed expert in hindsight. The fact that the authors develop guarantees with respect to the best combination of experts (where this combination is not necessarily convex, since the set of vectors , and as defined in line 123 is not the set of convex weights) can be very confusing for the reader. The authors discuss comparisons with the guarantees of the EXP4 algorithm and state that "The EXP4 bound essentially replaces the term in our bound by " in line 138. However, it is unclear how this translates in terms of worst-case guarantees. This part clearly needs a thorough discussion to be convincing.
You are right in that both the reward of our algorithm and that of Exp4 are measured with respect to a linear combination of experts and for Exp4 that linear combination is required to be convex (i.e. the weights have to sum up to ). Typically the weights of our linear combination sum to some value far greater than (and never less than ). Hence (as the regret term limits to ) our reward is times that of Exp4. We will add this discussion to the paper for clarity.
The fact that “Exp4 bound essentially replaces the term in our bound by ” is because Exp4 plays with the weight vector normalized at each time step and therefore operates in a stricter space.
The results obtained are not easily comparable to existing ones in the literature. [...] Additionally, it is not clear how this work compares to previous research on online learning with abstention. How do your results compare to [1]?
[1] Gergely Neu and Nikita Zhivotovskiy. Fast rates for online prediction with abstention. In Proc. COLT, pages 3030–3048, 2020.
We note that, to the best of our knowledge, there are no previous results on adversarial contextual bandits with the abstention option. The main novelty of the paper lies in merging the abstention action with confidence-rated predictors. This allows us to extend our result to (adversarial) contextual bandits with the abstention option. We introduce the first cumulative reward bounds on confidence-rated experts and we investigate contextual bandits as a potential application. Note that the cumulative reward bound we obtain is not achievable by previous work on confidence rated experts as they scale the regret by the confidence of the best expert in hindsight. We discuss previous results on online learning with abstention in the main body, but the previously considered settings are different from the ones we consider. Moreover, previous methods cannot be directly applied to confidence-rated predictors. Specifically, our results cannot be directly compared to [1], as their learning setting is different; it is a full information (in contrast to our bandit setting) binary classification problem.
Thank you for your rebuttal; I have read your response.
This paper investigates the problem of prediction with expert advice in contextual adversarial bandits, introducing the CBA (Confidence-rated Bandits with Abstentions) algorithm, they apply CBA to adversarial contextual bandits and achieve near-optimal upper regret bound.
优点
The CBA in adversarial contextual bandits achieves near-optimal upper regret bound.
缺点
Weakness 1: This algorithm is not as efficient as Exp4.
Weakness 2: In the algorithm, they mentioned using interval bisection to find , but the regret upper bound does not include the approximating error of interval bisection.
问题
Q1: Why not consider the reward of the abstention action to be non-negative instead of zero?
Q2: What is the novelty in the proofs?
局限性
The algorithm is not as efficient as Exp4 and the upper bound does not include the approximation error of .
Thank you for your time and effort in reviewing our paper.
Weakness 1: This algorithm is not as efficient as Exp4.
Given that the numerical precision of our machine is bounded, our algorithm has exactly the same efficiency as Exp4. See the additional result presented in the general rebuttal for our time complexity (equal to that of Exp4 up to logarithmic factors) when the precision is unbounded.
Weakness 2: In the algorithm, they mentioned using interval bisection to find , but the regret upper bound does not include the approximating error of interval bisection.
We took the assumption that the degree of precision was bounded (the interval bisection method requires steps where is the precision. Thanks to your question, we attached a proof in the general rebuttal that removes this assumption. When the magnitudes of the components of the comparator u are bounded by some arbitrary value Z (which will typically be very small) we can achieve a time complexity of whilst only adding an additive factor of 1 (which can be made arbitrarily small) to the regret.
Q1: Why not consider the reward of the abstention action to be non-negative instead of zero?
We can have any range with any and the reward of the abstention action can be any fixed constant. This can be achieved by scaling and translating the reward (on any trial) when received - bringing it into the constraints of our paper (hence Thm 3.1 still applies but the regret term is scaled by some constant in ). Such a scaling approach is standard in online learning.
Q2: What is the novelty in the proofs?
Our proof borrows ideas from the analysis of mirror descent, EXP3, and the method of Lagrange multipliers (also with unbounded precision (see general comment)) . However, we note that the main novelty of our work was the realization that making such a modification to mirror descent would lead to true reward bounds for confidence-rated predictors. We note that mirror descent itself would not work as the constraint set is not known until after the last trial (and, even if it was known, computing the projections would be computationally hard).
I have read your response and will keep my positive score.
The paper considers the problem of prediction advice under the bandit framework.
There are arms plus a special 0-th arm, which always incurs the gain of 0; this arm may be interpreted as the action of abstaining. The learner outputs a distribution over handles and earns the scalar product gain. In other terms, the learner outputs a vector of non-negative components summing to a number between 0 and 1; the sum can be though of as the measure of its confidence.
For this setup, the problem of prediction with expert advice is considered and a bound is obtained. There are two terms in the regret, one involves KL-divergence (there is a class of prediction with expert advice bounds of this kind) and the other has the form . It is seemingly linear, but can be tuned to get sublinear growth.
The result has been applied to the framework with side information . Each expert is associated with a pair . Whenever , the expert bets everything on arm and is abstaining otherwise.
The main result of the paper is applied to this scenario with finitely many experts based on disjoint balls and a matching lower bound is obtained.
The case where s are balls in a metric space is considered separately.
优点
I think this is a very interesting result in prediction with expert advice.
缺点
No obvious weaknesses.
问题
None
局限性
Yes.
Thanks for the very encouraging feedback!
Thank you. This is to acknowledge the response.
This paper first considers the problem of bandits with expert advice, allowing the learner to abstain in any round. It introduces an algorithm, CBA, based on mirror descent, which can achieve a better cumulative regret bound in comparison with EXP4.
The algorithm is then applied to adversarial contextual bandits, where the learner abstains in rounds where the given context is not covered by any expert. The paper provides upper and lower bounds on the cumulative reward, and details a more computationally-efficient implementation for settings where the contexts have inherent structure.
优点
-
Online learning with abstention is an interesting and relevant problem for the community, and the extension of CBA to adversarial contextual bandits with abstention seems novel.
-
The paper presents reward bounds for both the bandits with expert advice problem and the adversarial contextual bandit problem. These bounds can improve upon those of existing algorithms. A lower bound is also provided for the latter setting.
-
The experiments offer some insights into the construction of basis functions for contextual bandits with abstention.
缺点
-
The paper can be quite difficult to follow at times. For instance, in Section 5, the discussion about covering foreground and background classes seems very abrupt. I understand this is mentioned in the introduction, but since the problem settings are formally introduced until later, the initial discussion is also confusing. In addition, the purpose of the experiments remains somewhat unclear to me -- they are not properly discussed, and there is this sudden change of theme from learning in bandits with abstentions to constructing basis functions for various graphs. I also find the contextualization relative to prior work somewhat inadequate -- there are not enough details about the SpecialistExp algorithm to make the comparison concrete (lines 215 - 223). See also questions below.
-
The improvement of CBA over EXP4 is somewhat vague and seems to largely depend on the set of experts. It would be nice if some concrete examples are provided.
-
There is an gap between the upper and lower bounds in Section 5, which is substantial as can be exponential in the size of the set of contexts.
-
There seems to be a lack of discussion on the intuition behind why CBA should be applied for adversarial contextual bandits with abstention in this particular way.
问题
-
Following Theorem 3.1, can you provide a concrete example where . The example given in the footnote seems to require all other experts to have confidence in the round where the expert has confidence . What conditions need to be imposed on the set of experts?
-
In Corollary 5.1, can be any number in for the sequence of basis elements to be disjoint?
-
How are the corresponding actions chosen?
-
In Section 5, what are the specific benefits of allowing the learner to abstain?
局限性
It would be nice if the authors include a separate section/paragraph that discuss the limitations in detail.
Thank you for your time and effort in reviewing our paper.
[...] gap and can be exponential in the number of contexts .
While the comment is true, can indeed be as large as in principle, such a large basis set would defy its purpose (as an inductive prior on the shape of the clusters). All the basis sets we consider have size , thus, the gap is just logarithmic in . Generally, we think of as a set of “simple” stereotypical sets, which, for example, can be quantified by assuming has finite VC-dimension. In that case the dependence is still logarithmic in by the Sauer lemma.
[...] there is this sudden change of theme from learning in bandits with abstentions to constructing basis functions for various graphs [...] There seems to be a lack of discussion on the intuition behind why CBA should be applied for adversarial contextual bandits with abstention in this particular way.
A natural application of experts, and therefore confidence-rated experts, is adversarial contextual bandits with a finite context space. In this setting, it is quite common in the literature (e.g., [1], [2]) to consider a graph structure over the contexts. Therefore, we decided to study a practical application of confidence-rated experts, where the experts, in our case, are the basis elements we introduce. We will improve the exposition here in the final version.
[1] Cesa-Bianchi et al. "A gang of bandits." NeurIPS 2013.
[2] Wu, Qingyun, et al. "Contextual bandits in a collaborative environment." SIGIR 2016.
[...] there are not enough details about the SpecialistExp algorithm to make the comparison concrete (lines 215 - 223). See also questions below.
We shall add the regret bound of SpecialistExp to the paper. The difference is that while our regret is where is the number of disjoint basis elements that exactly cover each of the foreground (non-abstention) classes, the regret of SpecialistExp is where is the number of disjoint basis elements that cover each of the classes (including the background/abstention class). As depicted in Figure 1, can be much larger than .
In Section 5, what are the specific benefits of allowing the learner to abstain?
The benefits of the abstention action are clear when dealing with contexts (and therefore nodes in graph-structured contexts) that do not have a behavior that reflects the inductive bias. In Figure 1 of the provided example, you can think of any point in white as being colored in any possible way, but we would need a ball that covers any of these points as in SpecialistsEXP. Therefore, abstaining avoids having to deal with points that present “unnatural” behavior with respect to our inductive bias.
Following Theorem 3.1, can you provide a concrete example where . The example given in the footnote seems to require all other experts to have confidence 0 in the round where the expert has confidence 1. What conditions need to be imposed on the set of experts?
Assuming each expert has confidence 1 on at least one trial (otherwise could be much higher) then is the maximum possible value of . The condition for having such a value of is that the sum of the confidences of all experts are, on any trial, no greater than . Such a is not that interesting - we place equal weight on each expert - but it was introduced in order to show just how high can be, so that the potential degree of improvement over Exp4 is huge. Things are much more interesting at slightly lower values of .
In Corollary 5.1, can be any number in for the sequence of basis elements to be disjoint?
corresponds to the number of basis elements in the comparator. Hence, as we assume the basis elements to be disjoint in Corollary 5.1., is only restricted by the size of the number of contexts and the size of the basis . We will clarify this in the final version.
How are the corresponding actions chosen?
The are simply the actions in the comparator sequence. The result holds for an arbitrary such sequence.
Thank you for the responses. I updated my confidence level to 2, and intend to maintain my rating.
We thank all reviewers for their feedback.
We are happy to provide an additional result that was prompted by a question from Reviewer 3cAo. The reviewer asked how the regret bound and efficiency are affected if is not exactly determined but only approximated (we previously assumed bounded precision so we did not address this). We will now show that, even in this case, we incur a time complexity equal to that of Exp4, up to logarithmic factors, adding only 1 to the regret. This additive factor, however, can be made arbitrarily small.
Let us restrict ourselves to compare against with for some arbitrary . Note that this always has to be the case when each expert has confidence of at least on some trial. Due to this restriction, we can always project, at the beginning of trial , into the set {} which simply requires clipping its components.
For any let be the set of all with . We note that given, for all , a value we have that there exists such that the cumulative reward of is no less than that of minus . This means that as long as, on any trial , we project into the set for some then we will not add more than one to the regret.
So the problem (for the projection step at time if necessary) is now to project into the set of all {} for some arbitrary . Following our use of Lagrange multipliers, this means that we need to find with . So consider the function defined by .
Consider . Since we have that when then and that when then . This implies that and hence (since is monotonic decreasing) an acceptable lies in .
For general we note that . This means that . Since the length of the interval is this means that the length of the interval containing acceptable values of is at least approximately .
So we have shown that either is acceptable or the range of acceptable values of is of length approximately and lies in (which has length ). The ratio of these lengths is so interval bisection will find an acceptable value of in steps.
So we have a time complexity and we have only added to the regret (although this additive factor can be arbitrarily small).
This paper received a significant amount of attention from the reviewers, and we appreciate the authors for making a serious effort to address their concerns and answer questions. Overall, the paper was very borderline, but the most negative reviewer was largely concerned with the difficulty of comparing the method to previous methods, due to a distinction in how the comparator set is defined. I looked into this more carefully, and I don't believe the comparator class is a problem. Indeed, the problem formulation is unique and I think it's pretty cool. So I am inclined to accept this paper.
However, I would urge the authors to prepare their final revision for the camera-ready with an eye towards clarity. The question about the comparator set is important, but the reviewers raised other issues as well. Please read their feedback carefully when updating the manuscript.