Variance-Dependent Regret Lower Bounds for Contextual Bandits
摘要
评审与讨论
This paper provides new lower bounds on the linear contextual bandit. The results are two-fold. First, they provide a lower bound to the setting where the variances are all given to the learner at the beginning; second they provide a similar lower bound when the variances are adaptive. The lower bound matches the upper bound in previous literatures.
优缺点分析
Strengths:
- The paper is well-written, and the proofs appear to be correct.
- The lower bounds established in this work match the existing upper bounds, thereby providing a tight characterization of the variance-aware linear contextual bandit problem.
Weaknesses:
- The explanation of whether the learner and the adversary are in competition or collaboration is somewhat unclear. It would be very helpful if the authors could illustrate this relationship using real-world applications to provide greater intuition.
- This paper focuses exclusively on establishing a lower bound. However, in presenting the lower bound, the authors only provide a specific construction. It would be greatly beneficial if the paper could also discuss the types of settings or conditions under which this hard instance is likely to occur. Such insights would offer valuable guidance for future algorithm design, particularly in identifying and addressing worst-case scenarios.
问题
I have the following questions regarding this paper:
- Is it possible to generalize the results into contextual bandit with general function approximation?
- This paper mentioned that the lower bound does not work for stochastic linear bandit setting, e.g. when the first rounds the variances are all zero, then the optimal policy can already be learned. Can you provide a lower bound for this setting as well?
- The 'high probability bound' in Theorem 5.2 appears to be in the form 'with probability at least , ...', so this means the probability of the event is fixed when is fixed. This seems a little bit weird. Can you prove a high probability bound in the form 'with probability at least ,...'?
局限性
Yes.
最终评判理由
This paper presents good results, but I think there is still room for improvement, e.g. the presentation of the lower bounds with respect to the probability failure. Given the answers by the authors to my questions, I will maintain the score of this paper as weak acceptance.
格式问题
No formatting concerns in this paper.
Thank you for your thoughtful and constructive feedback. We respond to your concerns and questions point by point below.
Q1. Learner–Adversary Relationship
“The explanation of whether the learner and the adversary are in competition or collaboration is somewhat unclear. It would be very helpful if the authors could illustrate this relationship using real-world applications to provide greater intuition.”
A1. In linear contextual bandits, two common settings are considered:
(i) the oblivious setting, where the sequence of decision sets is fixed in advance, and
(ii) the adaptive setting, where the decision set may depend on the learner’s behavior.
When incorporating variance, a similar distinction applies. The variance sequence can be either predefined or chosen adaptively. Our lower bound focuses on the more challenging adaptive case, where an adversary selects the round-dependent variance , with each action’s variance upper bounded by at that round. This flexible setup captures both adversarial and cooperative behaviors, depending on how variance is selected to influence learning. We will update the manuscript to clarify this relationship and provide practical examples to aid intuition.
Q2. Specificity of the Lower Bound Construction
“This paper focuses exclusively on establishing a lower bound... it would be greatly beneficial if the paper could also discuss the types of settings or conditions under which this hard instance is likely to occur.”
A2. Our lower bound characterizes the worst-case regret, which typically arises when the reward gap is . In such settings, distinguishing between instances requires samples, leading to total regret of .
One important condition is whether the decision set is fixed. As we highlight in Corollary 5.7, when the decision set is fixed, early low-variance observations can significantly reduce regret, even if the cumulative variance is large. In contrast, our lower bound construction crucially relies on the adaptive decision set setting, which allows a lower bound for general variance sequence.
We acknowledge that this worst-case construction may not arise in many real-world applications. Nonetheless, it serves as a fundamental limit that any algorithm must overcome in the general setting with adaptive variance sequences and decision sets. We will emphasize these distinctions more clearly in the revised manuscript.
Q3. Generalization to Function Approximation
“Is it possible to generalize the results into contextual bandit with general function approximation?”
A3. Yes. The lower bound construction naturally extends to contextual bandits with general function classes by leveraging the notion of eluder dimension (Russo & Van Roy, 2013). Our current hard instance, defined over a linear function class of dimension , also applies to function classes with eluder dimension .
Q4. Lower Bound for the Stochastic Setting
“Can you provide a lower bound for the stochastic linear bandit setting, e.g., when the first rounds have zero variance?”
A4. The key distinction between the stochastic linear bandit and the contextual linear bandit lies in the fixed decision set. As shown in Corollary 5.7, in the stochastic setting, an algorithm may incur only regret, even with large cumulative variance, provided early rounds allow accurate estimation of the latent vector . This lower bound is tight, as it still requires informative rounds to learn under zero noise. We will clarify this contrast in the revised version.
Q5. High-Probability Bound Form
“The 'high probability bound' in Theorem 5.2 appears to be in the form 'with probability at least ', which seems a bit odd. Can you prove a high probability bound in the form 'with probability at least '?”
A5. Thank you for pointing this out. Extending the result to a general high-probability statement (i.e., "with probability at least ") is indeed possible and follows standard concentration techniques.
In our current formulation, we set to simplify the presentation and ensure that the expected regret is lower bounded by . Since most prior lower bounds in contextual bandits focus on expected regret, we adopted this fixed-probability form for clarity.
In the revision, we will update the theorem to explicitly state the bound holds with probability at least . Additionally, we will include a remark noting that by setting , the expected regret is lower bounded by , which matches the focus of prior work on expected regret in contextual bandits.
Thanks to the authors for the reply. As mentioned, it is possible to achieve the high probability lower bound, e.g. with probability at least , .... I wonder what this lower bound looks like, especially the dependence on .
Thank you for your follow-up question.
As noted in Lines 378–380, our construction begins with a constant-probability lower bound for each individual sub-instance. Specifically, each sub-instance achieves the desired regret lower bound with probability . To boost this to a high-probability event, we replicate the construction across multiple independent instances.
The key idea is that the high-probability bound arises from the event that at least one sub-instance exhibits the desired lower bound. In our current proof, we use independent sub-instances, which yields a failure probability of at most —that is, the regret bound holds with probability at least .
To generalize this to hold with probability at least , we would increase the number of sub-instances to .
As described in Lines 210–212, the dimension assigned to each sub-instance is inversely proportional to the number of sub-instances, and therefore decreases as we split more. Consequently, the final lower bound in Theorem 5.2 would scale as:
This generalizes the current bound of , replacing the factor from boosting with to reflect the desired confidence level.
This work investigates variance-dependent contextual bandits, where the variance can be adaptively selected by an adversary who collaborates with the learner over time. Two types of adversaries are considered: (i) a weak adversary, who selects the variance without knowledge of the learner’s decision set, and (ii) a strong adversary, who has access to the decision set prior to choosing the variance. The paper also establishes a lower regret bound for the setting where the variance sequence is fixed in advance and known. Furthermore, a lower regret bound is provided for the case involving a weak adversary.
优缺点分析
The proposed lower regret bounds for both the pre-fixed variance sequence and the weak adversary setting are tight and represent significant contributions to the study of variance-dependent contextual bandits. Notably, these bounds match known upper bounds up to logarithmic factors. While the theoretical contributions are strong, there are several critical concerns regarding presentation and assumptions.
1. Lack of Application Context: Although variance-dependent bandits have been studied for some time, the paper does not adequately motivate the problem with practical applications. A clear discussion of potential use cases should be included in the introduction to better frame the relevance of the problem.
2. Clarity of Presentation: The exposition requires significant improvement. Specifically, Section 1.1 ("Our Contributions") is difficult to follow on a first read and only becomes clearer after reading the full paper. It should be rewritten for better readability and accessibility. Additionally, the proof of Theorem 4.1, particularly lines 206–226, should be revised for clarity and structure.
3. Problematic Assumptions: More serious issues lie in the assumptions:
- The condition effectively requires the number of arms to exceed 1000, which excludes most practical applications. This assumption is overly restrictive and should be reconsidered.
- The assumption appears ad hoc and insufficiently justified. Since the regret scales with , a that depends on the number of rounds undermines the meaningfulness of the lower bound. This dependency should be either removed or properly justified, as it currently weakens the claimed lower bounds.
问题
The lower regret bound presented in the paper (ones in other papers as well) are turned out to depend on the simple sum of variances across time. This raises two important questions:
(i) If the lower regret bound depends only on the simple sum of variances, how fundamentally different are the problems under a fixed variance budget versus those with a general (possibly adaptive) variance sequence? From a regret perspective, are they essentially equivalent, or does adaptivity introduce additional complexity not captured by the current analysis?
(ii) A variance incurred in the early stages of learning may have a more pronounced impact on estimation and decision-making than one incurred later. In this sense, it seems natural that early-stage variances should be weighted more heavily than those in later stages. How, then, does it make sense to treat all variances equally in the regret bound? Is there a theoretical reason why the regret depends only on the total sum, without accounting for the temporal ordering of variances?
局限性
addressed in weakness
最终评判理由
I raised my rating by 1 point based on the authors' rebuttal.
Thanks for the authors' response.
格式问题
NA
Thank you for your thoughtful feedback. Below we address your concerns point-by-point.
Q1. Lack of Application Context
“The paper does not adequately motivate the problem with practical applications...”
A1. Our work focuses on the theoretical characterization of the fundamental difficulty in stochastic contextual linear bandits. While our lower bound construction may not always correspond directly to real-world applications, it is essential to understand worst-case regret behavior, which forms the foundation for evaluating algorithmic performance in general settings. We will update the introduction to better clarify this theoretical motivation.
Q2. Clarity of Presentation
“Section 1.1 (‘Our Contributions’) is difficult to follow... and proof of Theorem 4.1 needs revision...”
A2. Thank you for this suggestion. We will revise Section 1.1 to enhance its readability and restructure the proof of Theorem 4.1 (especially lines 206–226). The hard instance construction, elaborated in Section 4.2, partitions the variance spectrum into logarithmic bins , with a modified lower-bound instance constructed for each bin. Details are provided in Appendix C.3. We will reorganize the section for clarity and highlight the intuition behind this construction.
Q3. Problematic Assumption on
“This requires the number of arms to exceed 1000...”
A3. This appears to be a misunderstanding. In our paper, denotes the number of rounds, not arms, since we are working in the linear bandit setting. Unlike the multi-armed bandit case, the complexity depends on the dimension , not the number of actions. A condition like (or equivalently when ) is standard in linear bandits to overcome the trivial regret upper bound. We will clarify this in the paper.
Q4. Concern Regarding Dependency
“This dependency weakens the claimed lower bounds...”
A4. We respectfully disagree. The construction of lower bounds in contextual linear bandits is inherently dependent on the number of rounds . In particular, a regret of typically arises from a sample complexity of required to distinguish between instances with a gap of , which results in total regret on the order of . It is impossible to construct a single fixed instance that incurs regret uniformly for all . Conversely, for a fixed instance with a fixed gap, the regret is often upper bounded by as becomes large. Therefore, allowing to scale with is necessary to reflect the true minimax hardness of the learning problem. We will clarify this relationship between gap size, sample complexity, and regret in the revised paper.
Q5. Equivalence Between Fixed and Adaptive Variance Sequences
“If the lower bound depends only on the sum of variances, how different are the adaptive and fixed cases?”
A5. This is a very insightful question. Our work aims to characterize the worst-case regret under general variance sequences, and in doing so, we demonstrate that the upper bound from [1] remains nearly optimal even in the adaptive setting. While fixed variance budgets (e.g., ) provide a coarse characterization, they fail to capture structural differences.
For example, a variance sequence with for the first rounds and afterward yields the same budget as a sequence where the variance is spread uniformly. Yet, the difficulty faced by the algorithm is quite different. In the former case, a standard variance-independent lower bound of applies, but the actual regret may be significantly lower if variance is concentrated later. Our framework accounts for these subtleties.
Q6. Temporal Ordering of Variance
“Why treat all variances equally? Should early-stage variance be weighted more?”
A6. Excellent point. Indeed, temporal ordering does affect learning in contextual linear bandits. However, contextual linear bandit problem involves adaptive decision sets at each round, which makes transferring knowledge from early stages (even with low variance) difficult. Nevertheless, we highlight in Corollary 5.7 that if the decision set is fixed, early low-variance observations can significantly reduce regret, even if the cumulative variance is large. We will make this distinction clearer in the revised manuscript.
Reference
[1] Zhou, D., Simchowitz, M., Gu, Q., & Dubey, A. (2023). Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement Learning: Adaptivity and Computational Efficiency. COLT 2023.
I appreciate the authors’ responses. Regarding Q3, I apologize for my earlier misunderstanding. The assumption now seems reasonable. However, I still do not see why the condition is necessary to obtain the lower bound. What if we do have only ? Could you please elaborate?
Thank you for your follow-up. We appreciate your careful reading and the opportunity to clarify.
Firstly, if there exists a strictly positive gap independent of , then the OFUL algorithm (Theorem 5 in [1]) already achieves a regret bound of for the linear contextual bandit problem, which grows only logarithmically with . This implies that for any fixed , the regret becomes slog-\{K} once is large enough.
While it may be possible to construct a lower bound instance with constant when the total variance satisfies , this no longer holds for general variance sequences. In particular, when for all , prior variance-independent lower bounds show that obtaining a regret lower bound requires setting the gap .
Therefore, to reflect the worst-case difficulty under general variance sequences, we set in our construction. We will clarify this point in the revised version of the paper.
[1] Improved Algorithms for Linear Stochastic Bandits. NeurIPS 2011.
The paper studies the problem of linear contextual bandits in the case in which the variance of the rewards changes at every time-step. Specifically it focuses on designing hard instances to prove lower bound for the case in which the sequence of variances is either given and known in advance, or chosen by an adversary either before or after having observed the decision set. Specifically, the paper provides a lower bound of the order of , with , number of rounds, when the sequence is fixed and known, an high probability bound of the order of , if the adversary can choose the sequence adaptively without observing the decision set. Lastly, they prove that in the case the adversary can observe the decision set beforehand there is an instance in which the regret is bounded by , the dimension of the context space.
优缺点分析
The paper studies a specific problem part of the vast literature on multi armed bandits which is of general interest to the community. Specifically it focuses on a less studied version of the problem in which the noise distribution is allowed to change at every time step. Hence, the topic studied an the result provided are novel.
On the other side, the paper could really be improved in terms of clarity: the writing results oftentimes confused and hard to read. Specifically, using alternatively the notation and to indicate the variance appear confusing, but mostly the choice of using instead of as the number of rounds, as it is well-established in the field, is, in my opinion, very counter intuitive.
Furthermore, I have some doubts about the soundness of the proofs that I defer to the Questions section.
问题
-
Could you further elaborate what is Theorem 5.4 showing? I understand the proof but to my understanding it is only showing a particularly favorable instance of the problem in which the regret obtained is very low, and I do not understand the claim maid at lines 307-308 that this is meant to break the lower bound.
-
It is not clear to me what are the hard instances used to derive the bound of Theorem 4.1.
-
At line 224, isn't the value provided a lower bound on the variance of the group , instead of an upper-bound as stated? Furthermore what is the value which bounds the variance in all the groups?
局限性
yes
最终评判理由
I have decided to update my score following the discussion with the authors, specifically, thank to their commennt better highlighting the significance of Theorem 5.4
格式问题
I have not notice any major formatting issues, but there are the following typos:
The authors left a comment at lines 585-586 that should be deleted.
At line 231, I suppose the last appearing should be indexed .
Thank you for your helpful suggestions. We address each of your questions below:
Q1. Use of K instead of T for denoting the number of rounds:
A1. We respectfully disagree with this point. While T is commonly used for the number of rounds in multi-armed bandits, our work focuses on the stochastic contextual linear bandit setting, which differs significantly. In linear bandit literature, K is frequently used to denote the number of rounds, and the problem complexity primarily depends on the dimension d, not the number of arms. We follow this convention to maintain consistency with related works in the linear bandit domain.
Q2. Clarification on Theorem 5.4:
A2. Theorem 5.4 demonstrates the necessity of the limitation imposed in Theorem 5.2—namely, that the adversary must select the variance sequence before observing the decision sets. If the adversary were allowed to observe the decision set beforehand (i.e., a stronger adversary), the lower bound construction could be circumvented. This highlights the tightness of our assumptions and clearly delineates the conditions under which the lower bound holds.
Q3. Clarification on Theorem 4.1 and the hard instance construction:
A3. The construction of hard instances is detailed in Section 4.2. Specifically, we partition the variance spectrum into logarithmic bins of the form . For each bin, we construct a hard-to-learn instance based on a modified version of the instance used in [1], adapting it to the corresponding variance level. This leads to a variance-dependent lower bound for each range. Detailed proofs and constructions can be found in Appendix C.3.
Q4. Line 224: role of and definition of :
A4. Thank you for pointing this out. The term is defined in Equation (3.1) as the maximum variance for round . In our lower bound construction, we identify the variance bin that contains , and assign the corresponding contextual instance for that bin to round . Thus, represents the variance used in the hard instance associated with bin , and it provides a lower bound on the variance for all rounds falling within that bin. We will revise the manuscript to clarify this explanation.
Q5. Typo noted.
A5. Thank you for catching this. We will correct the typo in the final version.
Reference
[1] Zhou, D., Gu, Q., & Szepesvári, C. (2021). Nearly minimax optimal reinforcement learning for linear mixture Markov decision processes. In Conference on Learning Theory (COLT).
I thank the author for their attentive replies to the points I have raised.
I think that the paper could be improved by further highlighting the fact that the lower bounds derived are not to be interpreted in a min-max sense, which would further justify the 'collaborative' adversary considered. Apart from this, I think all my concerns were addressed, so I am willing to raise my score.
Thank you for your thoughtful follow-up and for considering raising your score.
We appreciate your suggestion and agree that clarifying the interpretation of our lower bounds would help readers better understand the result. We will revise the manuscript accordingly.
The authors investigate variance-dependent lower bounds for contextual bandits. In the stochastic setting, their results improve the dependence on the variances and the dimension compared with existing bounds. They also study adversarial settings and establish impossibility results in certain cases.
优缺点分析
The paper addresses variance-dependent analysis in multi-armed bandits, an important topic in the field. While the contribution is solid, several points remain unclear.
- In Equation (3.1) the authors assume bounded errors. Does this assumption exclude Gaussian noise and location-shift models? In addition, how is the variance of the worst-case mean rewards defined, given that the variance may also depend on the means?
- For best-arm identification, prior work such as Lu et al. (2021), Lalitha et al. (2023), and Kato et al. (2024) provides variance-dependent lower bounds. Can those results be applied to your setting, or can your results be transferred to theirs?
- Ito et al. (2022) develop variance-tight algorithms for linear bandits. Are their techniques applicable to your framework?
Related references:
- Lu, P., Tao, C., and Zhang, X. “Variance-Dependent Best Arm Identification,” UAI 2021.
- Lalitha, A., Kalantari, K., Ma, Y., Deoras, A., and Kveton, B. “Fixed-Budget Best-Arm Identification with Heterogeneous Variances,” UAI 2023.
- Kato, M., Okumura, K., Ishihara, T., and Kitagawa, T. “Adaptive Experimental Design for Policy Learning.”
- Ito, S., Tsuchiya, T., and Honda, J. “Adversarially Robust Multi-Armed Bandit Algorithm with Variance-Dependent Regret Bounds,” COLT 2022.
问题
See above.
局限性
N/A.
最终评判理由
I have read the authors’ rebuttal, and for the most part, they have addressed my concerns adequately. However, I am not confident in assessing the strength of the problem setting and the significance of the results. While the research is interesting, I remain uncertain about how meaningful the findings truly are. Since there do not appear to be any fatal flaws in the proofs, I will maintain my original score and vote for a borderline accept.
格式问题
N/A.
Thank you for your positive feedback. We address your comments below:
Q1. Assumption of bounded errors in Equation (3.1):
A1. In our work, we aim to establish a theoretical lower bound to characterize the fundamental difficulty of the stochastic contextual linear bandit problem. To ensure consistency and avoid gaps in assumptions, we adopt the same bounded noise assumption as the upper bound analysis in [1] Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement Learning: Adaptivity and Computational Efficiency (COLT 2023). Nevertheless, our results can be extended to sub-Gaussian noise distributions, including Gaussian noise, with appropriate modifications. We will clarify this point in the revised manuscript.
Q2. How variance is defined in our setting:
A2. In our setting, we study stochastic contextual linear bandits where the decision set can change adaptively across rounds. As such, assigning a fixed variance to each action (as is common in fixed-arm bandit settings) becomes challenging. Instead, we assume that in each round, the environment generates a variance level, and all actions in that round share the same variance. This round-wise variance captures the environment's uncertainty and simplifies analysis while preserving generality. We will revise the paper to better explain this assumption.
Q3. Relation to variance-dependent best-arm identification literature:
A3. Thank you for pointing out these related works. The BAI literature (e.g., Lu et al. 2021; Lalitha et al. 2023; Kato et al. 2024) typically assumes a fixed and known action set, where each arm has a constant variance across all rounds. This allows for variance-specific instance-dependent lower bounds. In contrast, our setting considers adaptive and potentially changing action sets, where variances can vary across rounds but are identical within a round. Thus, while the overall theme of variance-dependent analysis is shared, the assumptions and techniques differ. We will include a discussion of these connections and distinctions in the related work section.
Q4. Applicability of Ito et al. (2022):
A4. We carefully reviewed Ito et al. (2022), which focuses on variance-dependent regret bounds in adversarial multi-armed bandits. While they use a linear formulation via a probability distribution over fixed arms, their setting remains within the multi-armed bandit framework. In contrast, our setting involves contextual linear bandits, where each action corresponds to a context vector and reward is linear in that context. Therefore, while the techniques share high-level ideas, such as variance-aware exploration, they are not directly applicable to our contextual formulation. We will clarify this distinction and cite their work appropriately.
Thank you for your reply.
Reply to A1 and A2
Thank you for the clarification. I had misunderstood the setting slightly, but I now correctly understand your point.
Reply to A3
I understand your explanation. That said, I still personally question the importance (or practical relevance) of considering such adaptive settings from the perspective of variances; in other words, I am not fully convinced by the motivation. However, the theoretical contributions are sound and persuasive.
Reply to A4
Thank you for your comment. I was simply curious about the applicability of your method to the setting in Ito et al. (2022). If their setting is indeed different from yours, I agree that a detailed discussion and citation may not be necessary. That said, I would appreciate some clarification comparing your framework with other known settings, since I am not familiar with your specific setup, whereas I have worked on problems like those addressed in Ito et al. (2022). Such clarification would help readers like me better situate your contribution.
Thank you for your further feedback.
For the variance-dependent regret, we have discussed existing upper and lower bounds for linear bandit settings in both the introduction and related work. That said, we agree that including a more explicit discussion of the multi-armed bandit case and clarifying how it differs from our setting would benefit readers—especially those more familiar with the MAB literature. We will incorporate this comparison in the revised version to better situate our contribution.
Thank you for your response and the helpful clarification. I now understand your point. I believe this explanation will assist readers in understanding the contribution. IWhile I am personally not fully convinced of the importance of the problem setting, this is merely a subjective view. As the contribution of the paper is solid and well-executed, I will maintain my current score.
This paper shows lower bounds for linear contextual bandits with bounded rewards under various assumptions. The progress is solid and the strength of this paper.
Overall, this paper belongs to a correct and solid solution to an open problem with significance being hard to tell.
On the other hand, the paper's writing and clarity can be improved. As a minor note, (3.1) is unclear regarding "variance = because it can be interpreted in two ways: (i) the authors are defining the symbol or (ii) the authors are claiming that the variance happens to be and is and predetermined value.
Furthermore, this paper's significance is not super clear. Specifically, given the standard lower bound of , the gap between the upper and lower bound is only vs . (The lower bound from prior work (Theorem 1.2) was focusing on the generic hypothesis class rather than the linear bandit.) So, the only remaining question was whether one can achieve something better than , but of course this 'something' has to be of order when . So, it was highly unlikely that the optimal rate would have something other than .
On the other hand, the techniques developed in this paper could be useful for other research.
Our suggestion for the next submission is to perhaps show the lower bound for the generic hypothesis class or MDPs.