7.0

/10

Poster4 位审稿人

最低6最高8标准差1.0

3.8

置信度

正确性3.3

贡献度3.5

表达2.5

ICLR 2025

Almost Optimal Batch-Regret Tradeoff for Batch Linear Contextual Bandits

Zihan Zhang,Xiangyang Ji,Yuan Zhou

OpenReview PDF

提交: 2024-09-27更新: 2025-05-02

TL;DR

We establish near-optimal batch-regret trade-off for contextual linear bandit problem.

摘要

We study the optimal batch-regret tradeoff for batch linear contextual bandits. For this problem, we design batch learning algorithms and prove that they achieve the optimal regret bounds (up to logarithmic factors) for any batch number $M$, number of actions $K$, time horizon $T$, and dimension $d$. Therefore, we establish the full-parameter-range (almost) optimal batch-regret tradeoff for the batch linear contextual bandit problem. Along our analysis, we also prove a new matrix concentration inequality with dependence on their dynamic upper bounds, which, to the best of our knowledge, is the first of its kind in literature and maybe of independent interest.

关键词

contextual linear banditonline batch learning

评审与讨论

审稿意见

评分: 8置信度: 32024-11-02

This paper studies the batch linear contextual bandits with stochastic contexts. They design some batch learning algorithms and prove that they achieve the optimal regret bounds up to some logarithmic factors for any batch number, number of actions, time horizon, and dimension. They also develop some new techniques which may be of independent interest.

优点

The paper presents near-optimal results for this type of bandit problem, with matching upper and lower bounds that may represent a definitive solution in this area. Additionally, the authors introduce a novel concentration inequality tailored to this problem, which has the potential to be valuable for addressing other bandit problems as well.

缺点

It would be better if some numerical experiments were included.

问题

Could the authors consider including additional experiments in this section? Additionally, what would be the key challenges in implementing this algorithm?
Could the authors provide more detail on the hidden constants in the upper and lower bounds? Specifically, what are their magnitudes, and what parameters do they depend on?

评论- Response to Reviewer DVV4

2024-11-23

Thanks for the detailed review and informative comments over the numerical experiments. We then present our response.

About numerical experiments: The overall computational cost of the algorithm is $O(K^2Td^2)$ , where the major challenge is the linear dependence on the number of arms $K$ . In the case $K$ is small, we conduct simple experiments using random feature vectors generated by np.random.rand(). However, due to computational cost, we can at most run with the configuration $T=4000, d = 20, K = 40$ and $M=7$ . In such parameter regime, the regret bound depends mainly on the constant. As a result, our algorithm is no better than BatchLinUCB (the batch version of LinUCB), but better than the BatchPureExp (Algorithm 5 in Han et. al., 2020). We have added the code and result to the supplementary materials.

About the constants: For the constant in the upper bound, we provide a rough estimation of $C_{\mathrm{upper}} = 10000$ (see Lemma 5 and equation(30) and noting that $\alpha = \sqrt{50\ln(KTd/\delta)}, L = \frac{1}{200\ln(Td/\delta)}$ ); for the lower bound, the constant would be $C_{\mathrm{lower}} =1/100000$ (see Lemma 18). The constants are just absolute constants, which depends on the constants in the Lemmas, e.g., general equivalence lemma, concentration inequalitis and elliptical potential lemma. We remark that we also hide some logarithmic terms within the $\tilde{O}(\cdot)$ notation.

2024-11-27

Thank you for your response. I would like to keep my high score.

审稿意见

评分: 6置信度: 42024-11-07

This research explores batched linear bandits, in which the learner decides the number of arms to pull in each batch, with rewards for the selected arms revealed only at the end of the batch.

The proposed algorithm functions within a predefined maximum number of batches denoted as $M$ . It aims to minimize regret. Theorem 2 illustrates the balance between regret and $M$ , indicating that the trade-off approaches a near-optimal order, as shown by Theorem 4. The findings suggest that $M = O(\log \log T)$ batches suffice to achieve asymptotically minimax optimal regret.

优点

The theoretical strengths of the paper include providing a tight bound on the expected maximum width, $\mathcal{W}(m,n)$ . This enhancement builds upon the results of Zanette et al. (2021) and mitigates the limitations identified in Ruan et al. (2021). Additionally, the matrix concentration inequality presented may be of independent interest.

缺点

1: Theorem 4 is compared with Gao et al. (2019), yet the more pertinent lower bound by Han et al. (2020) might provide a more relevant comparison. It is unclear why this comparison was omitted.

2: The paper overlooks several significant works on batched multi-armed bandits and linear bandits for asymptotic $T$ .

3: The paper often neglects to condition on $F_k$ throughout many parts of the proof.

Additionally, a minor notational correction is recommended: " $\epsilon + t$ " described as an independent sub-Gaussian noise with zero mean should be correctly noted as " $\epsilon_t$ ".

问题

See weaknesses.

评论- Response to Reviewer sv76

2024-11-23

We thank the reviewer for careful review. We present the response as below.

About the comparison with the lower bound in [Han et. al., 2020]: Thanks for the suggestion. The lower bound in Hat et.al., is based on Gaussian contexts. Here we present a brief comparison, and will fix accordingly in the revision. The lower bound in [Han et. al., 2020] is $\Omega( T^{\frac{1}{2-2^{-M+1})}}d^{\frac{1-2^{-M+1}}{2-2^{-M+1}}} )$ . In comparison, our lower bound is (assume $K\geq d$ for simplicity), where the first part $\min\\{\Omega(T^{\frac{1}{2-2^{-M+2}}}d^{\frac{1-2^{-M+2}}{2-2^{-M+2}}} ), \Omega( T^{\frac{1}{2-2^{-M+1}}}d^{\frac{1}{2-2^{-M+1}}} ) \\}$ , which is strictly better than that in [Han et. al., 2020]. The reason is that, the context distribution in [Han et.al., 2020] is a non-mixed Gaussian distribution, which is easy to learn; while the context distribution in our case could be arbitrary distributions. In the counterexample, we let the context distribution be a mixture of a group of hard instances for different horizons. As a result, we obtain improved lower bounds.

About $\epsilon_t$ and $\mathcal{F}_k$ : Thanks for the careful review. We have fixed accordingly in the revision.

2024-11-23

Thanks for your response. I noticed that you did not provide feedback on the second point, so I have a follow-up question. This paper examines scenarios where actions are sampled from an unknown distribution. In contrast, the recent publication 'Optimal Batched Linear Bandits' investigates scenarios with a fixed action set and reports achieving a batch complexity of three. Does this result contradict your findings? Could the lower batch complexity be attributed to the fixed action set?

评论- Discussion

2024-11-23

Thanks for the follow-up question. The result in (Ren et al., 2024) does not contradicts to our results. Firstly, if the action set is fixed, our lower bound result is no long true, but the lower bound in (Gao et al., 2019) still holds. So for fixed $T$ , one still required $\Omega(\log\log T)$ batches in the worst case. The 3-batch claim in (Ren et al., 2024) works for asymptotic regret. That is, as long as the size of second batch is sufficiently large as $T\to \infty$ , all sub-optimal arms would be eliminated after the second batch, so that the regret in the third batch is 0. On the other side, to bound the regret in the second batch, they take the size of the second batch as a function of $\log(T)$ , which means even if $T$ is very large, the regret is still bounded by a function of $\log(T)$ .

2024-11-24

Thank you for your response. Please include more discussions and references related to the asymptotic setting in batched bandits in the revised manuscript.

评论- Discussion

2024-11-24

Thank you again and we will fix accordingly.

审稿意见

评分: 6置信度: 32024-11-08

The paper addresses the problem of batch linear contextual bandits, a type of online learning model where decisions are made sequentially, balancing exploration and exploitation. The authors design algorithms for both context-blind and context-aware settings of the batch linear contextual bandit problem. In the context-blind setting, the learner cannot use current batch information to make decisions, while in the context-aware setting, the learner can observe the context in the current batch, allowing potentially more adaptive decisions.

优点

The contributions are:

Regret Bounds and Tradeoffs: They analyze the regret, or performance loss, for different batch numbers and prove that their algorithms achieve nearly optimal regret bounds (up to logarithmic factors) across varying batch sizes. For instance, they demonstrate that as few as $O(\log \log T)$ batches suffice to achieve nearly the minimax-optimal regret without constraints.
Simplification of Algorithms: Compared to prior work, such as Ruan et al. (2021), the authors propose a simpler algorithm that is more practical to implement and works for all time horizons $T$ as long as $T \geq d$ (where $d$ is the context dimension). This represents an improvement over previous methods, which required large polynomial relations between $T$ and $d$ .
Matrix Concentration Inequality: A significant technical contribution is a new matrix concentration inequality that depends dynamically on upper bounds, which is useful for the algorithm’s exploration policy. This inequality could be valuable for other research areas in machine learning due to its novel structure.
Context-Aware Performance Gains: In the context-aware setting, they show that observing the context allows the learning algorithm to achieve better regret bounds by taking advantage of additional information in each batch, demonstrating that optimal regret can be further improved compared to the context-blind setting.

缺点

I believe $m$ and $n$ represent the same quantities (albeit separated by batch index). So I don’t see how this is an improvement over Zanette 2021. See lines~276-281.

问题

The algorithms are based on restarting Lin-UCB at every epoch. How does this relate to designing forgetting-based episodic RL algorithms?

评论- Response to Reviewer kW9Y

2024-11-23

Thanks for the careful review and insightful questions. We try to answer your question as follows.

About difference between $m$ and $n$ in Line 276-281: Here we restate the difference between $m$ and $n$ . The learner have $m$ samples to learn the context distribution, and then have $n$ samples following this policy to conduct linear regression. Each sample here is a context $\\{x_1,x_2,...,x_K\\}$ and a pair $\\{a, r\\}$ such that $r=x_{a}^{\top}\theta+\epsilon$ ( $\epsilon$ is a zero-mean noise) is the reward by playing $x_a$ .

Write the $i$ -th sample as $(x_{1,i},x_{2,i},...,x_{K,i}, a_i, r_i)$ for $1\leq i \leq m+n$ .

For the first $m$ samples, we do not care the chosen arms and the reward feedback $(a_j, r_j)_{j=1}^m$ , and use only the context information to learn the context distribution.

While in the next $n$ samples, the learner choose $a_j$ according to some certain policy, and compute $\hat{\theta} =(\lambda \mathbf{I}+\sum_{j=m+1}^{m+n}x_{a_j,j}x_{a_j,j}^{\top} )^{-1}\sum_{j=m+1}^{m+n}x_{a_j,j} r_j$ .

In the case $m$ is larger than $n$ , the width of the confidence region is dominated by the uncertainty in the $n$ samples. That is, even if assuming the full knowledge of the context distribution, the confidence region still has width $\Omega(\sqrt{d/n})$ .

In the case $n$ is larger than $m$ , the learner can not find the desired exploration policy (consider the case $m=1$ so that the learner can only play the local optimal-design policy), and the confidence width is limited by $\Omega(d/m)$ .

We remark that, one can still get a $m$ -free bound of $\tilde{O}(\sqrt{d^2/n})$ by the local optimal-design policy. So it is still open what is the real optimal possible width of the confidence region.

In our problem, $m,n$ are roughly the sizes of two consecutive batches. For example, $m = \Theta(T_3)$ and $n = \Theta(T_4)$ . As a result, $n$ is generally much more larger than $m$ , which implies an improvement over the result in [Zanette et.al., 2021].

About the restarting framework: Both algorithms take advantage of statistical independence from the restarting framework, which help simplify the analysis and remove the lower order terms.

审稿意见

评分: 8置信度: 52024-11-08

The paper studies the number of batches vs regret tradeoff for contextual linear bandits. The authors propose an algorithm that achieves nearly optimal regret for a given number of batches M.

优点

The paper considers batch learning in stochastic contextual linear bandits which I believe is an important problem. It proves upper and lower bounds on the regret given number of batches M. Exp-Policy (algorithm 2) is introduced to solve the distributional optimal design problem needed to design the policy in each batch. The algorithm is novel and simple unlike algorithms proposed in Ruan et al. To the best of my knowledge multiple of the analysis techniques proposed are novel including the matrix concentration inequality in Lemma 20. The regret upper and lower bounds are nearly matching.

Overall, the paper introduces important algorithms and analysis techniques for contextual linear bandits, as well as being the first to prove the optimal regret bound dependency on the number of batches.

缺点

The computational complexity of the algorithms scale linearly with the number of actions which can be very large. However, this is justifiable since this is a challenging problem even without the batch learning constraints.

问题

Could the authors include the computational complexity of the proposed algorithms as a function of K, d, T, M? Could the authors add motivation for stochastic contexts as this is the main setting in the paper? e.g., what applications would satisfy the stochastic i.i.d. context assumption?

评论- Response to Reviewer XcK9

2024-11-23

We first thank the reviewer for constructive suggestions. We reply to the questions as below.

About time cost: The algorithm is based on arm elimination so the time cost depends on the number of arms $K$ polynomially. The exact time cost is $O(TK^2d^2)$ , which is dominated by arm elimination in each round.

About motivation for stochastic contexts: The problem is motivated by stable environments with expensive policy deployment and communication cost, e.g., large-scale online advertisement and recommendation systems. In these cases, the distribution of user behavior changes slowly in a certain period, which allows assumption on stochastic contexts.

2024-11-23

Thank you for the reply.

Regarding the stochastic context applications, it is well known that in recommendation systems the user activity over time slots is correlated, hence, the context observed by the learner. I am asking about applications that satisfy the assumption where the context received by the learner at each time slot is independent of other slots, and follow the same distribution. While the distribution could be steady in recommendation systems but the independence assumption does not hold? Is there another way to formulate the problem to satisfy the iid assumption? Are there other applications that follow these constraints?

评论- Discussion

2024-11-24

Thanks for the reply and the follow up questions.

About recommendation system: In recommendation system, the user behavior often follows predictable daily rhythms. For example, streaming services might see peak activity in the evenings, while e-commerce platforms may see more activity during lunch breaks or weekends. As a result, we might regard the user behavior is stochastic at some certain time period. On the other hand, to address the correlation between user activities at different time slot, we could consider a mixed model of stochastic context. That is, for each time period $[t_1,t_2]$ , there is a known probability vector $[w_1,w_2,...,w_k]$ and a group of unknown base distributions $[D_1,D_2,...,D_k]$ such that the user context is from the mixed distribution $\sum_{i=1}^k w_i D_i$ . We can still perform similar experimental design methods as long as $[w_1,w_2,...,w_k]$ is periodic.

About other applications with i.i.d. assumption: Below we list some other possible applications.

Medicine and clinical trials: Contextual linear bandits can be used in medical settings where the algorithm could select the most effective treatment and help adjust the drug dosages based on patient-specific information such as age, medical history, and current health status. In this case, the contexts depend on the profiles of the individual patients. Therefore, the distribution of the contexts is roughly stochastic. On the other hand, the price of deploying a new policy is high due to the risk in healthcare, which inspires the batch learning framework.

Decision-making in autonomous vehicles: In self-driving cars, contextual bandits can be applied to make decisions about navigation, route selection, or obstacle avoidance based on the current context, like traffic conditions or road type. As a result, the context distribution is roughly steady along a certain time period. In general, the decision-making system is deployed by a center agent while the data comes from each individual vehicles, which provides motivation for batch learning.

2024-11-26

Thank you for your detailed response. If the authors agree, I think adding one paragraph summarizing this discussion could be helpful in increasing the papers impact from a practical perspective. I understand that this work did not introduce the stochastic context setting, but somehow a thorough discussion on the modeling aspect for applications to follow this setting has been missing from the literature.

评论- Discussion

2024-11-27

Thanks for the suggestion. We have revised the submission accordingly.

AC 元评审

2024-12-22

A great paper that solves the remaining open problem left by Han 2020! Accept.

审稿人讨论附加意见

最终决定Accept (Poster)

2025-01-22

Accept (Poster)