PaperHub
4.3
/10
Rejected4 位审稿人
最低3最高6标准差1.3
5
3
6
3
3.5
置信度
正确性2.3
贡献度1.8
表达2.5
ICLR 2025

Online Decision Deferral under Budget Constraints

OpenReviewPDF
提交: 2024-09-26更新: 2025-02-05

摘要

关键词
Online LearningHuman-AI Collaboration

评审与讨论

审稿意见
5

This paper introduces a new online decision-making problem that includes budget constraints and different types of feedbacks. The authors formalize this problem into a contextual bandit setting and propose a new algorithm for generalized linear rewards. They show that the algorithm achieves a near-optimal regret bound. They also validate its effectiveness with empirical results on real-world datasets.

优点

a. The authors consider an interesting and practical problem that is well-motivated.

b. The paper provides both theoretical analysis and experimental results for the proposed algorithm. The regret bound of the algorithm is near-optimal under certain conditions. Experiments are conducted on both synthetic and real datasets, and the proposed algorithms outperform the baselines.

c. The presentation is clear and easy to follow.

缺点

a. The proposed algorithm and analysis largely build on the work of Agrawal and Devanur (2016), which addresses a similar problem. Although the authors claim that they extend the algorithm to generalized linear rewards, it is unclear what specific challenges this extension presents and whether it requires new techniques.

b. The theoretical results only hold when the budget BB is relatively large, Bd1/2T3/4B \geq d^{1/2} T^{3/4}. This requirement may limit the applicability of the algorithm in scenarios with a limited budget.

c. The neural linear algorithm is a straightforward extension of Algorithm 1, treating the neural network’s embedding as the context. And there is no theoretical guarantees for the neural linear algorithm.

d. In the experiments, the authors only consider scenarios where BB is at least 0.25T0.25T, which is a relatively large budget when TT is large. Results for smaller values of BB would be helpful to further validate the algorithm's performance.

问题

a. Is it possible to obtain a theoretical guarantee (even if suboptimal) when BB is relatively small?

b. What is the challenge in extending the work of Agrawal and Devanur (2016) to your setting?

c. What are the empirical results of the proposed algorithm when B<0.25TB < 0.25T, such as 0.1T0.1T?

评论

Thank you for your feedback. The general response above clarifies the weaknesses a,b, and c.

The budget issue is interesting indeed. The budget is needed for the algorithm to learn the allocation function, i.e. whether to query the human or not on certain types of questions. If the budget is too small, the deferral system will not be able to get enough signal on the human's performance on certain contexts and it will not be able to make near optimal decisions. This is a known issue in Bandits with Knapsacks and we discuss this limitation below the Theorem, l251-256. Our experiments, Figure 3, shows that when the budget is low our algorithm performs just a bit better than always deferring to the model, which is the "cheapest" option.

Your question is actually right on point. The cost of online learning to defer is a crucial problem to deploy these methods and this is precisely the problem we are trying to tackle with bandit algorithms. This had not been proposed before and we demonstrate empirically that, with more or less simple modifications to standard algorithms, we can achieve great performance.

评论

I acknowledge the contribution of implementing provable algorithms on real-world datasets. However, as the authors have stated, the algorithm's novelty is limited, and the datasets used are already publicly available. Therefore, I still have concerns regarding the overall contribution of the work and will maintain my score.

审稿意见
3

The paper proposes using the framework of Bandits with Knapsacks to choose a model's prediction or to defer to a skilled expert, where the expert is the limited resource available to the meta learner.

优点

None

缺点

The paper has limited novelty and contribution.

  1. The paper simply uses the an existing framework (Bandits with Knapsacks [Badanidiyuru et al., 2018], [Agrawal and Devanur, 2016]) to choose between a model's prediction or to defer to a skilled expert.
  2. Limited contribution:
  • The regret guarantee provided is a straightforward combination of the linear contextual bandit with knapsack guarantee [Agrawal and Devanur, 2016] with the generalized linear bandit analysis from of [Li et al. (2017)], with the bulk of proof in the appendix filled with re-statments of specific Lemmas and Corollaries from [Agrawal and Devanur, 2016] and [Li et al. (2017)].

  • The experiments are very limited. Section-5.1 is synthetic, while section 5.2 provides results for two different problems: 1) 0-1 Knapsack problem, that chooses between human solutions to the 0-1 knapsack problem and the solution to a greedy algorithm. This does not align with the original desription of distribution shift leading to decline in ML model's prediction accuracy. 2) ImageNet: chooses between a pretrained model and human prediction. Although this aligns with the original problem set-up, the evaluation is shallow and limited.

  1. Some missing references on further developments in Bandits with Knapsacks:

问题

  • Could the authors specify the contributions if any, beyond a simple combination of [Agrawal and Devanur, 2016] and [Li et al. (2017)] and the limited empirical evaluation on the ImageNet dataset?

  • There have been several new developments in the Neural Bandits literature since Riquelme et al. (2018) (see [1], [2] and [3]). Why have the authors not used these frameworks instead?

[1] Dongruo Zhou, Lihong Li, and Quanquan Gu. Neural contextual bandits with ucb-based exploration. In International Conference on Machine Learning, pp. 11492–11502. PMLR, 2020.

[2] Weitong Zhang, Dongruo Zhou, Lihong Li, and Quanquan Gu. Neural thompson sampling. In International Conference on Learning Representation (ICLR), 2021.

[3] Rohan Deb, Yikun Ban, Shiliang Zuo, Jingrui He, and Arindam Banerjee. Contextual bandits with online neural regression. In The Twelfth International Conference on Learning Representations.

评论

Thank you for the references. We agree there is a lot of great work on Bandits with Knapsacks and our goal is to demonstrate to the community working on Online-Learning to Defer that there is great potential in using these methods on real data. Our benchmark is new, and the very proposal of leveraging Contextual Bandits with Knapsacks is new to this community to the best of our knowledge. Our paper is meant to draw attention to this literature through novel and thorough experiments showing promising results.

We hope this short paragraph answers your concerns about contributions.

评论

I thank the authors for responding to my questions. As already stated in my Weakness section, I do not think the empirical evaluations are thorough enough and I maintain my score.

The authors say, "As such, your comments on "lack of novelty"(*) of our method and analysis are CORRECT, but they often and sadly sidetrack the main contribution which is to give a purpose and a concrete application to these good existing algorithms." They further say "It is not obvious, however, that such online deferral algorithm would work on real data, beyond the theoretical setting of these papers. Our paper is the first to show that one can indeed make them work in practice and the community lacks even benchmarks to test ideas. We illustrate them on novel benchmarks including recently published datasets."

I agree that using bandit frameworks for soving concrete problems would be a very good contribution, and that the cimmunity should also care about practical problems, beyond regret bounds and simple simulations. However I think the current set of experiments are shallow for this to be considered a benchmark paper. I sincerely encourage the authors to evaluate the algorithms more comprehensively.

审稿意见
6

This paper studies an online decision-making problem with a human expert. The authors phrase the problem as a contextual bandit problem, where a deferral learner needs to choose whether to use an ML model or a human expert to obtain the reward. The authors propose an algorithm to solve this problem based on a UCB-like algorithm and characterize its regret guarantees. The authors further extend their algorithm using neural networks to handle complicated datasets.

优点

  1. The problem formulation and the algorithm design seem to be novel and of practical interest.
  2. Theoretical regret guarantees are provided for the proposed algorithms.

缺点

  1. Considering optimal static policy seems to be limited as it can be far from the true dynamic optimal policy. Would it be possible or how difficult it is to extend the current regret analysis in the paper to handle dynamic regret which uses the dynamic optimal policy as benchmark?
  2. The regret analysis seems to be straightforward extensions from existing works on UCB-based algorithms for bandit problems as the authors mentioned in Section~4.
  3. There is no regret guarantee for the neural linear algorithm provided in the paper. Could you at least explain the specific challenges in deriving such guarantees for this approach?

问题

  1. In line~163, what is the distribution D there?
  2. Does Algorithm~1 work for the full information setting or the bandit information setting? Please clarify how the algorithm design differs when full information or bandit information is considered. It would be good to discuss the implications of using the algorithm in each setting, and how this choice affects the algorithm's performance or implementation.
  3. In general full information online learning algorithm typically yields smaller regret (e.g., algorithms for online convex optimization Hazan9 & Levy NIPS 2014). Could you explain why this is not the case for the algorithm studied in this paper?
评论

Thank you for your feedback and comments. Regarding your main comments in the "Weakness section", we will refer to the general clarification. You are right that we do not provide all regret guarantees, especially not in the neural linear setting. But our goal here was to explore how to build on principled algorithms and carefully use them on real data. Neural bandits have been separately studied (se e.g. http://proceedings.mlr.press/v119/zhou20a/zhou20a.pdf) but this is not our focus.

Answers to your questions:

  1. We assume that the contexts (features) are sampled from a fixed distribution. This is a standard assumption in contextual bandits (see e.g. Chapter 8.4 in Slivkins's book https://arxiv.org/pdf/1904.07272)

  2. Thank you, indeed O_t (the set of observations) is not defined in the algorithm, but only in the text. O_t decides whether the deferral system observes both the chosen action's AND the model's rewards (full info) or only the chosen actions's reward (bandit). Here there are 2 actions: 'model' and 'human', so full info refers to being able to always observing the model's reward.

  3. We believe this last remark also answers your last question. You are right in saying that usually full-info feedback has lower regret (it save a \sqrt(K) factor, where K is the number of actions). But here we only have 2 actions and the 2 information regimes (bandit or full-info) are not the main issue. We distinguish them mainly because both main occur in real-world settings and we wanted to highlight that the algorithm works the same. Typically querying the model may be very cheap if the decider owns the model, or expensive if it's a third party service. So one may or not be able to always observe the model. We will discuss this distinction further because we agree it is slightly misleading.

评论

Thanks for your response. I will maintain my already positive score.

审稿意见
3

The manuscript introduces a framework for online learning-to-defer under constrained budget for the human expert.

优点

The topic of the manuscript originates from a practical problem.

缺点

The original contribution of the paper is hard to justify as it is largely based on the work of Agrawal and Devanur (2016).

问题

See the weaknesses above.

评论

Dear reviewers,

Thank you for your feedback. We will respond to specific concerns but we would like to stress one simple point: this is not a conventional "bandit paper" in the sense that our contribution is NOT to design and analyse a bandit algorithm. In fact, in Introduction, the last paragraph specifically does NOT list this as a contribution because we always refer to existing literature and existing results in Section 4 (Algorithm). As such, your comments on "lack of novelty"(*) of our method and analysis are CORRECT, but they often and sadly sidetrack the main contribution which is to give a purpose and a concrete application to these good existing algorithms. This application is in Machine Learning with Human in the Loop, the main topic of our paper.

Bandit algorithms with Knapsack constraints have been extensively studied (we will add citations suggested by Rev. MoQa) from a theoretical or algorithm-design perspective. On the other hand, the literature on (Online) Learning to Defer (see our Related Work, first 2 paragraphs), has not considered the problem of learning under potentially partial information. Rather than proposing a new algorithm, we show how to efficiently apply existing good ones to a hard and novel problem, and we demonstrate their efficiencies on several new experiments including new and original datasets. We believe that we always explicitly refer to the relevant papers whenever we quote theoretical results and we explicitly write: "The regret guarantees (Corollary4.3 below) combine the proof of Agrawal and Devanur(2016) with results on unconstrained generalized linear bandits (Filippietal.,2010;Lietal.,2017)".

It is not obvious, however, that such online deferral algorithm would work on real data, beyond the theoretical setting of these papers. Our paper is the first to show that one can indeed make them work in practice and the community lacks even benchmarks to test ideas. We illustrate them on novel benchmarks including recently published datasets. These findings are novel.

We hope this clarifies the main contribution. We will make sure to reflect this discussion in the main paper.

Best wishes, The authors

(*) See in particular Rev, a7cX: a,b,c Rev MoQa: 1,2 Rev. ykhm: 2

AC 元评审

The reviewers point out that the paper does not have algorithmic novelty. The authors clarify that this is the case, and the main point of novelty is to run an existing algorithm on a publicly available dataset. As such, there doesn't seem to be much disagreement between the two parties on the technical merit. My judgment ultimately rests on the fit: the authors should consider submitting the paper to a more practice-oriented venue (icml and neurips would likely be difficult). A good choice can be MSOM practice based research competition (see https://pubsonline.informs.org/page/msom/practice-based-research-competition) as well as the various tracks of the journal MSOM. The authors will likely have an easier time there.

审稿人讨论附加意见

NA

最终决定

Reject