Bandits with Replenishable Knapsacks: the Best of both Worlds
摘要
评审与讨论
This paper focuses on the bandits with replenishable knapsacks (BwRK), building upon a framework introduced in Kumar and Kleinberg (2022). This paper presents a "best-of-both-worlds" algorithm to address such a learning problem. Furthermore, it conducts theoretical analyses of the algorithm under two distinct scenarios: adversarial inputs and stochastic inputs. Finally, this paper presents an exploration of potential economic applications.
优点
(1) This paper presents, for the first time, a best-of-both-worlds algorithmic design for bandits with replenishable knapsacks, accompanied by the corresponding theoretical analysis. (2) This paper provides the first instance-independent regret bound under i.i.d. inputs. This complements the earlier work of Kumar and Kleinberg (2022), which primarily focused on instance-dependent analysis in the initial BwRK paper.
缺点
(1) There are some implicit assumptions that are not well justified. Firstly, the authors say that B=\omega (T) which seems not practical as it is not evident in which real-world situations a direct linear relationship exists between the budget and the total number of rounds T. A constant budget will result in a non-sublinear final upper regret bound. Secondly, the range of the replenishable resource \beta is [-1,0] which indicates that such amount of source is close to the consumption cost of other action. This, in turn, suggests that the policy will frequently favor the empty action once the budget B is depleted. It would be beneficial to offer motivating real-world scenarios that illustrate the relevance and feasibility of such a resource range. Finally, in the example of the inventory management in Section 6.2, the reward action s is r_t(s)\in [-1,0] which is not consistent with the setting in Section 2. Please give a clear justification for this variation. (2) The Meta-Algorithm proposed in this paper appears to be a trivial extension of the one presented in "Online Learning with Knapsacks: the Best of Both Worlds" (2022). The main difference is that the policy does the empty action when the budget is smaller than one. Furthermore, in terms of theoretical analysis, although the concept of a replenishable resource is introduced, the proofs for Theorem 4.1 and Theorem 4.3 do not exhibit significant advancement when compared to the proofs for Theorem 6.1 and Theorem 7.1 of the 2022’s work. Please clarify the primary contribution compared with previous literature. (3) The paper would benefit from the design of experiments to provide stronger empirical support for the theoretical findings.
Minor Comments: Page 4: note --> Note Page 9: simple simple --> simple
问题
See weakness.
On the assumption: We do not need to assume . Indeed, one of the main contributions of our paper is to show that when and the replenishable factor is constant, we can still achieve regret in the stochastic setting and constant competitive ratio in the adversarial setting. Nonetheless, we agree with the reviewer that our results are meaningful only when the budget or the replenishment factor is ``large enough''. This is common to all the previous work on BwK in which the regret depends on . We observe that this is not a limitation of our analysis but rather an intrinsic feature of the problem. Indeed, Balseiro and Gur (2019) provide a lower bound on the competitive ratio in the adversarial setting without replenishment. Hence, it is impossible to obtain a constant competitive ratio in general.
Constant budget: The observation is correct only if there is no replenishment. If the replenishment factor is constant, we obtain meaningful regret bounds (e.g., a in the stochastic setting). In the extreme case when the budget is constant and there is no replenishment our theorems hold but become meaningless. However, this is fine since any algorithm fails in this setting, as it does not have enough time to learn before depleting the budget (think about the case in which the initial budget is and it is depleted at the first round if an expensive action is played). We remark that this problem is present in all the algorithms for BwK. For instance, the regret in BwK problems depends on (see, e.g., Castiglioni et al. (2022)) while for the case of bandits with replenishable knapsacks Kumar and Kleinberg (2022) provide a regret bound that depends on the replenishment factor (as in our work).
Regarding the assumption about the replenishment range: this is without loss of generality since we can always scale the budget, costs, and replenishments to have replenishments in . So it might be the case that the costs and replenishments are similar, that costs are close to 1, and the replenishments are very small, or the opposite. Our algorithm can deal with all these scenarios. As is the case in previous works, this flexibility of the model and algorithm motivates the choice of keeping the problem at a high level of abstraction without the need to focus on specific applications.
Relation with prior work: From a high-level, many studies on BwK may appear similar in their approaches, yet they differ fundamentally in their analysis. For instance, ``Online Learning with Knapsacks: the Best of Both Worlds’’ (ICML 2022) employs almost the same algorithm of Immorlica et al. (FOCS 2019). This similarity is also present in subsequent works such as Slivkins et al. (COLT 2023), Castiglioni et al. (NeurIPS 2023). The reason why all these works might look similar at first glance is that they usually present a general primal-dual template, which is then instantiated with different subroutines. The real contribution of these works, as in our case, lies in the specific subroutines and the new analysis of the framework.
We strongly believe that our techniques introduce several important innovations which distinguish them from the algorithms and analysis used in prior related works:
- Extending primal-dual methods to adversarial BwK with non-monotone resource consumption is highly non-trivial. A non-monotone constraint is more complex than monotone one, and our constraints are also hard (i.e., they must be satisfied at each round like any budget constraint). Our proofs crucially rely on the fact that the regret minimizers have sublinear regret in different sub-intervals (see proofs of Thm. 4.1 and Thm. 4.3). Most of the existing literature relies on standard regret minimization algorithms, which would prove inadequate in our setting.
- One of the key technical challenges is bounding the Lagrangian multipliers. This can be easily done in classical BwK problems (without replenishment) since a bound on the Slater’s parameter is available. Here, the replenishment factor is unknown and, therefore, the dual space cannot be bounded a priori. To circumvent this issue, we exploit no-adaptive-regret guarantees of the primal and dual regret minimizer. Dealing with such constraints requires different techniques and a new analysis, like the division of the rounds into three phases and a novel use of the dual regret minimizer with no-adaptive-regret.
On inventory management: In the problem's original formulation, the reward can be negative. However, this problem can be trivially solved by normalizing the rewards in [0,1]. Formally, we add 1 to the reward and divide it by 2. This does not affect the results in the stochastic setting. As mentioned in the paragraph ``Challenges of the adversarial setting'', rescaling and translation introduce some challenges in the adversarial settings, which we leave as an interesting open problem.
On experiments: Since we are the first to study an adversarial BwK model with replenishment, we would have no previous algorithms to employ as a meaningful baseline. All the existing algorithms would fail in the case (existing algorithms cannot even be initialized in this case). Conducting an experimental evaluation will certainly be valuable once there is a deeper understanding of adversarial BwK problems with replenishment and more algorithms are available.
This paper studies the problem of BwK under replenishable knapsack assumption, where the budget consumption in each round could be negative. The authors propose and analyze a primal-dual framework, achieving best-of-both world performance.
优点
The problem setup is clearly motivated and introduced. The algorithm and analysis are thoroughly explained and neatly presented. Regret bounds improves on existing results.
缺点
I believe there are some related work that could be beneficial to be added to the comparison. The line of work I would like to mention is the bandits with (soft) constraints. The authors argue that . However, when allowing negative s, it is essentially allowing constraint violation in those rounds.
问题
- Any lower bound results?
- Connection and comparison with soft constraint satisfaction? It seems to me that allowing negative budget consumption is somewhat related to constraint violation in such round, so the results should be comparable.
- Optimizing the coefficients in the regret bound? In bandits these are hidden in the big O term most of the times, but it could be pretty large in many cases, especially when T is not very large. Alternatively, without theoretical justification, some experiments showing that the bounds are empirically tight would be very helpful.
We thank the Reviewer for their comments.
There is a misunderstanding. We require that the global budget constraint must be satisfied at each round, i.e., the budget cannot be negative for all . This is not the case for general (soft) constraints that can be globally violated in some rounds, i.e., for all . Of course the two problems are related (but not equivalent) to satisfying the per-round consumption constraint at each round.
Lower bounds: Our setting strictly generalizes the classical (i.e., without replenishment) BwK model. In this setting, the lower bound is (as in standard bandit problems without constraints) in the stochastic setting, and -competitive ratio (that in the BwK framework is equivalent to ) in this adversarial setting (see Balseiro and Gur (2019)). We match the lower bounds in both settings.
Q2: We partially agree with the reviewer. However, budget constraints cannot be violated (globally) at any round. This makes the problem harder. Moreover, we are unaware of any best-of-both-worlds algorithm working with general soft constraints and unknown Slater’s parameter.
Optimizing coefficients: Most recent research on BwK primarily focuses on theoretical characterization instead of experimental evaluations. Also for this reason, we tried to prioritize clear result presentation over constant optimization. This helps to simplify the presentation of the analysis and results of the paper which, as noticed by other reviewers, is already quite complex.
This paper studies an extension of Bandits with Knapsacks, called Bandits with Replenishable Knapsacks (BwRK). It proposes a primal-dual algorithm addressing both the adversarial and stochastic cases, with competitive ratio and regret upper bound analysis. The paper also discusses application examples of BwRK and its results.
优点
- Well-motived model formulation. BwRK is an interesting extension of BwK, and the authors provide application examples as well (Sec. 6).
- Algorithmic framework. The authors propose a primal-dual template (Alg. 1) that can be applied with various minimizers under different scenarios.
缺点
- The theoretical results are not clearly discussed. For example, how tight are the theoretical results compared to lower bounds?
- Lack of experiments. It would be interesting to know the empirical performance of the proposed algorithm. Especially compare the actual performance of this paper's algorithm with known ones in BwK when . Are they exactly the same algorithms?
问题
- From my aspect, "the best of both worlds" means that the proposed algorithm can achieve optimal theoretical guarantees in two different settings. Could the authors point out how their theoretical upper bounds match the lower bounds in these two settings? As BwRK is a new model, do we have such lower bounds?
We thank the Reviewer for their comments.
Lower bounds: Our setting strictly generalizes the classical (i.e., without replenishment) BwK model. In this setting, the lower bound under stochastic inputs is (as in standard bandit problems without constraints), and -competitive ratio (that in the BwK framework is equivalent to ) in the adversarial case (see Balseiro and Gur (2019)). We match the lower bounds in both settings.
Experimental Evaluations: Most studies on BwK have focused more on theoretical contributions than on experimental evaluations. Moreover, since we are the first to study an adversarial BwK model with replenishment, we would have no previous algorithms to employ as a meaningful baseline. We observe that all the existing algorithms would fail in the case (existing algorithms cannot even be initialized in this case) while ours will still provide meaningful performance. Therefore, we have chosen to prioritize theoretical characterization of the problem. An experimental evaluation will certainly be valuable once there is a deeper understanding of adversarial BwK problems with replenishment and more algorithms are available in this context.
The case of : When there is no replenishment, our algorithms would resemble existing ones, though not exactly the same, as they both utilize the same primal-dual structure. We remark that our approach requires more advanced regret minimizers to address the general case. This makes sense since we want to recover the optimal guarantees of previous works when .
Thanks for the response. The authors did not respond to the question raised by the reviewer on clarifying the best of both worlds.
We thank the reviewer for their comment and we apologise if our response was not clear enough on this point.
Our algorithm guarantees are tight in both the stochastic and adversarial case. Indeed we achieve regret in the stochastic case (which is obviously tight). In the adversarial case our result essentially depends on only one parameter and we achieve fraction of the optimal reward. This is optimal in the worst case, since in the case without replenishment (where and ) the lower bound is as shown in Balseiro and Gur (2019). We will clarify this in the final version of the paper.
We agree that it would be interesting to further characterise the "interplay" between and . We leave this as an interesting open question.
This paper studies the problem of bandits with knapsacks (BwK). In a BwK problem, there are m (m > 1) different types of cost for each action, and the agents needs to maximize the cumulative rewards (i.e., minimize the cumulative regret) before any resource runs out. In classic bandit setting, there is only one cost which is time, so the standard MAB problem is a special case of BwK.
In this paper, the cost of each action can be negative, whereas the previous BwK works only study the cases where the costs are non-negative. This new setting simulates the real-world applications (eg inventory management and Bilateral trade given in the paper) where the resources can be recovered with time.
This paper gives a general primal-dual algorithms for this new setup. In both of adversarial and stochastic bandit settings, this paper achieves nearly optimal results (linear gaps for adversarial and logarithm gaps for stochastic). To be specific, in the adversarial case, no algorithm can achieve sub-linear regrets, and the proposed algorithm only has constant gap to the lower bound. In the stochastic case, the algorithm achieves \tilde{O}(\sqrt{T \log{mT}}) regret upper bound (note that the trivial lower bound is \Omega(\sqrt{T})).
优点
-
This paper proposes and studies a novel and interesting problem. It is pretty smart to consider the case where the cost of an arm could be negative.
-
The results are rich and good. It studies multiple cases, and provides good results for them. The results look good enough from my understanding and knowledge.
缺点
This paper has one major weakness, and one minor weakness about presenting.
-
It is not well discussed or stated how the negative cost could affect the problem. Is this new setup invalidate the existing algorithms, or it creates major challenges where we cannot simply use the existing algorithms or the regret analysis techniques? The algorithm looks pretty similar to the existing works, so as the regret bounds. From the current writing, I am not able to tell how hard the new setting is.
-
This paper's writing and presenting need improvement. It is not clearly presented. For me, there are two major challenges. i) There are too many notations and many of them can be avoided. For instance, in Theorem 4.1, it defines \alpha = \nu/(1+\beta), which is not necessary at this point but makes this paper less easier to perceive. ii) I do not find the clear definition of NextElement(), which makes Algorithm seem incomplete. I did a search of "next" but did not find any definition. Probably it is hidden in some sentences, but it is better to have a clear statement of it.
问题
It is not well discussed or stated how the negative cost could affect the problem. Is this new setup invalidate the existing algorithms, or it creates major challenges where we cannot simply use the existing algorithms or the regret analysis techniques?
We thank the Reviewer for their comments.
The presence of negative costs makes existing algorithms ineffective for the following two reasons. First, previous algorithms have regret upper bounds that depend on the ratio and, therefore, these guarantees are not meaningful when is small compared to (e.g. ). Our algorithm works regardless of the initial budget . It might seem that extending the classical algorithm by playing the void action when required is sufficient. In our paper, we show that this is not the case. For instance, we show that regret minimizers with weakly-adaptive regret guarantees are a crucial requirement for obtaining meaningful guarantees.
Finally, it's important to note that although our algorithm and existing ones might appear similar due to their use of a primal-dual framework, the underlying analysis differs significantly. For example, our analysis divides the rounds into 3 separate phases, and our proofs crucially rely on the fact that the regret minimizers that we use are weakly adaptive and have sublinear regret in different sub-intervals (see the proofs of Theorem 4.1 and Theorem 4.3). Most existing literature relies on standard regret minimization algorithms, which would prove inadequate in this particular scenario.
This paper considers the online Knapscaks problem, where some resources can be bought back (by considering negative consumption). The contributions are a constant competitive ratio in the adversarial case (assuming budget of the order of horizon) and a sqrt{T} regret in he stochastic case.
The model is interesting, not completely new but sufficiently so that the contributions are interesting and not that trivial.
The 4 reviewers agree on the same score "marginally above the acceptance level" and I do agree with them. So we are 5 of us considering this paper to be interesting enough for the whole community (even though it might not be the breakthrough paper in bandits)
为何不给更高分
It's an interesting well-written paper, on a subject of interest.
The contributions are here, without being breathtaking.
It will interest the community.
为何不给更低分
N/A
Accept (poster)