High Probability Bounds for Cross-Learning Contextual Bandits with Unknown Context Distributions
We give a nearly optimal high probability bound for the cross-learning contextual bandits with unknown context distributions.
摘要
评审与讨论
The submission studies contextual bandits with cross learning. Previously, the existing regret bound held in expectation. The submission refines the regret analysis so that the regret bound holds with high probability. The main contribution is to show how the weak dependency structure can be exploited to solve a concentration difficulty in the previous analysis.
优点
- The submission points out the difficulty that prevents the previous work from achieving a bound with high probability (lines 167–176).
- Identify the weak dependency between epochs (line 386).
- Devise a new technique to solve the unbounded issue induced by the weak dependency (the treatment for the Bias5e them in lines 395–411).
缺点
- (1) Section 3.2 contains several subtopics, such as the regret decomposition, the discussion of each decomposed term, and the analysis strategy for the challenging term. A better editorial layout would improve the readability.
- (2) The sentence “Notably, …” in line 111 is confusing. It seems unrealistic to be able to observe the loss for every context . It also does not match the algorithm’s (Algorithm 1) behavior.
问题
- (1) Is the concept class a finite set? If so, what is the reason for assuming a finite concept class ? Practically speaking, the contextual information would be like a vector in a compact set, as it is very unlikely to see two identical users.
- (2) Where is the variable in Theorem 1 that characterizes the property of ? How does this variable appear in the bound proved in this submission?
- (3) Could you please provide more evidence or further discussion of the applicability of the technique developed here so that we can better appreciate its potential?
Dear reviewer 8LjP:
We sincerely thank the reviewer for their suggestions and positive feedback. Below, we provide detailed responses to the reviewer’s comments.
Question: What is the reason for assuming a finite concept class?
Response:
We would like to point out that assuming a finite concept class is a common practice in contextual bandit research. While the real context space is often continuous, it can be discretized into a finite set. In this way, the finite concept class serves as a foundational model, much like the tabular MDP framework in reinforcement learning.
Of course, the discretization process involves a tradeoff: finer discretization reduces discretization error but increases the size of the concept class, leading to larger regret. However, the cross-learning structure in our setting entirely eliminates this issue, providing further justification for focusing on finite concept classes. Please see our response to the next question for more details.
Question: Where is the variable in Theorem 1 that characterizes the property of ?
Response:
We thank the reviewer for raising this excellent question, which touches on one of the most interesting aspects of cross-learning bandits. Indeed, in most cases, the results depend on the size of the concept class. In the vanilla contextual bandit setting, the results typically have a polynomial dependence on the size of the concept class .
However, in our problem, thanks to the cross-learning structure, this polynomial dependence on the size of is entirely eliminated. As a result, our final regret bound is completely independent of the size of , which is why Theorem 1 does not need to explicitly characterize the property of .
As mentioned in the response to the previous question, the cross-learning structure allows us to bypass the discretization issue for finite concept classes. This is because our result is completely independent of the size of the concept class, enabling arbitrarily fine discretization and resolving this concern entirely.
Question: The statement in line 111 that we can observe the loss for every context seems confusing and does not match the behavior of the algorithm.
Response:
We would like to clarify that this is precisely the core of the cross-learning structure. The cross-learning structure explicitly assumes that we can observe the loss for every context. As discussed in the related works section, this structure is common in practice, with examples including bidding in online auctions, sleeping bandits, repeated Bayesian games, and dynamic pricing.
Regarding the algorithm’s behavior, we respectfully disagree with the reviewer’s assessment. The algorithm indeed matches this assumption since it is explicitly designed for the cross-learning structure.
Question: Readability of Section 3.2
Response:
We thank the reviewer for pointing this out. To improve readability, we have restructured the paper. In the revised version, Section 3.2 has been expanded into a standalone chapter, further divided into three subsections, each focusing on a single topic. This restructuring aims to make the paper more organized and easier to follow. The updated manuscript will be uploaded shortly.
We once again thank the reviewer for their valuable suggestions and positive feedback. Your comments have helped us improve the structure and readability of our paper. We hope our responses have addressed your concerns.
Thank you for the feedback. After going through the reviews and all feedback replies, I will keep my score for now. Thank you!
The paper studies cross learning in contextual adversarial linear bandits where the learner observes the losses of all contexts in each round. Recent work in Schneider et al. proposed an algorithm with a regret upper bound only in expectation. The paper studies the same algorithm and proves that the regret upper bound holds with high probability.
优点
- The paper proves a high probability lower bound which is stronger than the in expectation bound in the literature.
- The analysis uses a nice observation that the different epochs in the algorithm are only weakly dependent which enables to prove a small bound for the cumulative bias across all epochs
- While standard martingale inequalities cannot directly upper bound the cumulative bias, a novel technique is proposed to address this
缺点
Can the reduction in [1] be used to map the multi-context to a single context problem? The technique is proposed for non-adversarial losses, however, the action set map from distributional to fixed should not be affected by that. I understand that the paper only focuses on analyzing an existing algorithm. However, a comparison with such technique in the related work is needed to justify the use of such algorithm or suggest alternative techniques to address the problem.
[1] "Contexts can be cheap: Solving stochastic contextual bandits with linear bandit algorithms." The Thirty Sixth Annual Conference on Learning Theory. PMLR, 2023.
问题
please see weaknesses
Dear Reviewer uDyz,
Thank you for your positive and encouraging feedback. We greatly appreciate you bringing to our attention an interesting piece of work that we had previously overlooked—it has truly broadened our perspective.
Unfortunately, the mentioned work cannot be directly applied to our problem. However, whether it could be adapted through certain modifications to address our problem is an intriguing question. We have discussed this point in detail in the related works section of the revised version of our paper, which will be uploaded shortly.
Once again, we sincerely thank you for your positive comments and for pointing out this valuable reference.
Thank you for your response.
This paper addresses the challenge of achieving high-probability regret bounds in the adversarial contextual bandit framework, where the learner encounters varying contexts and must minimize cumulative loss over time. The focus is on "cross-learning" contextual bandits, where learners can observe losses for all possible contexts, not just the current one.
Results leverage weak dependencies between epochs and refine existing martingale inequalities, by exploiting interdependencies in observations. This analysis ultimately shows that the algorithm is effective in adversarial settings, even with unknown context distributions.
优点
The paper proposes a new look at an existing problem and provides a completely novel analysis in their work. The ideas and techniques proposed are completely new and can be of independent interest. This is particularly true for the martingale concentration result.
缺点
See questions.
问题
While the result is good, I am uncertain about its significance because neither the problem nor the algorithm proposed are new. There are no extensions of this result, no experiments. I am not sure if this is standalone result is significant enough to be published at a premier conference.
Dear reviewer NZtQ:
We sincerely thank the reviewer for their valuable suggestions. Below, we address the reviewer’s concerns in detail.
Question: Significance of our results
Response:
We appreciate the reviewer’s accurate understanding of our results and their concerns about our work's significance. We would like to highlight that it is a common practice in adversarial bandit research to focus solely on providing high-probability bounds, as this itself constitutes a significant contribution (e.g., [1,2,3]).
From a technical perspective, as noted by reviewer Sstz, deriving high-probability bounds for existing algorithms in bandit research often requires neat observations and techniques. From a results perspective, high-probability bounds are particularly important in adversarial bandit settings because these scenarios focus on worst-case outcomes, where even low-probability events cannot be ignored. Thus, providing high-probability bounds, rather than just expected regret bounds, is critical for addressing the worst-case nature of adversarial bandits.
Regarding the lack of experiments, we acknowledge that this is a limitation. However, we respectfully note that in theoretical works on adversarial bandits, especially those focusing on high-probability bounds, it is common practice to omit experiments. This is because our proofs are based on rigorous mathematical arguments without relying on any unrealistic assumptions or approximations, and they remain effective under worst-case scenarios. We hope the reviewer will consider this aspect.
To further emphasize our contributions, we have restructured the paper and added a Technical Overview section in the introduction to discuss our technical contributions in detail. The revised version of the manuscript will be uploaded shortly.
Once again, we thank the reviewer for their insightful suggestions, which have helped us improve the structure of our paper and better highlight its contributions. We hope our response has addressed the reviewer’s concerns and that you will consider increasing your support for the paper.
References:
[1] Luo, H., Tong, H., Zhang, M., & Zhang, Y. (2022). Improved High-Probability Regret for Adversarial Bandits with Time-Varying Feedback Graphs. International Conference on Algorithmic Learning Theory.
[2] Neu, G. (2015). Explore no more: Improved high-probability regret bounds for non-stochastic bandits. Neural Information Processing Systems.
[3] Bartlett, P.L., Dani, V., Hayes, T.P., Kakade, S.M., Rakhlin, A., & Tewari, A. (2008). High-Probability Regret Bounds for Bandit Online Linear Optimization. Annual Conference Computational Learning Theory.
Dear Reviewer NZtQ,
We sincerely appreciate your time and attention. We hope our responses have addressed your concerns and that you might consider increasing your support for our paper. If you have any further questions, please don't hesitate to ask us!
The paper studied adversarial context bandits in a special setting where the losses of arm could be observed under all contexts when the algorithm plays arm . The goal, like in classical adversarial bandit problems, is to minimize the regret compared to the loss of the best arm in hindsight. The paper focuses on the setting where the loss sequence is adversarial and the context is stochastic with an unknown distribution. A recent work of SZ [NeurIPS’23] designed an algorithm with expected regret of in this setting, where is the number of arms. This paper conducted a renewed analysis of the algorithm in SZ [NeurIPS’23], and the main result is that the algorithm could actually achieve with high probability.
The main technique of the paper is heavily influenced by the previous work of SZ [NeurIPS’23]. In a nutshell, the low-regret guarantee of the algorithm crucially relies on the concentration of unbiased estimation of . Here, we cannot exactly compute the quantity since the distribution of the context is unknown. The key idea of SZ [NeurIPS’23] is to commit two steps for each EXP3 step and use one of them to estimate the distribution of the context. On top of that, this paper further utilized the weak dependency between epochs, and derived a martingale argument to get high-probability regret.
优点
In general, my opinion of this paper is positive. The paper appears to require a great deal of background to be able to parse. Despite this, I believe the paper did reasonably well in terms of explaining the existing work and its techniques. Getting high probability bounds in adversarial bandits usually requires some neat observations and technical steps. Although I’m not able to follow all the steps in the short time frame, I do think the paper contains some nice technical observations and ideas.
缺点
Although I'm mostly supportive, I think I couldn’t strongly champion the paper due to the following reasons:
- The scope of contribution: although the paper does contain some neat technical observations, the contribution appears to be somehow incremental. After all, this is a new analysis of an existing algorithm, and the new analysis is not something that improves the previous bound (but instead is to get a high-probability bound). Again, I do acknowledge that such contributions are non-trivial. However, I do not think it’s enough for me to champion the paper.
- If the paper is going to be mainly accepted due to the techniques: then, I do not think the paper contains a substantial amount of new ideas. I appreciate the technical observations, and I agree that the steps are non-trivial. However, if the conceptual message is not as strong, and the merits of the paper mainly lie in the techniques, then the bar would inevitably be higher.
- For a conference like ICLR, the lack of experiments could be an issue. I am not letting this affect my score since I often advocate learning theory papers. However, I do want to raise this point since it is common for ML conferences to ask for experiments.
问题
- Should the definition of regret on page 3 be reversed? As in, you are subtracting the loss of the best arm with the loss of a policy, which should be a negative value (if with positive regret). This would change the decomposition of the regret on page 5 as well, but it seems nothing would affect the correctness.
- I think the full algorithm description of the algorithm in SZ [NeurIPS’23] (or some simpler version of the description) could be shown much earlier in the paper. This would be helpful for readers who are not familiar with the previous algorithm.
- Also, stating the main theorem in a preliminary section looks very non-standard to me. I’m not letting this affect my score, but please consider re-organizing this.
- The meaning of 'with high probability' was never explained in the paper -- as in, it could mean with probability or with probability . I think your bound gives the former, and this should be stated explicitly.
Dear Reviewer Sstz,
Thank you for your positive, thorough, and thoughtful review. Your feedback has greatly helped us improve our paper. Below, we provide detailed responses to your comments:
Question: Should the definition of regret on page 3 be reversed?
Response:
We respectfully point out that our definition is correct. Our regret definition subtracts the loss of the arm chosen by the algorithm from the loss of policy , rather than subtracting the loss of the best arm from the loss of a policy. This ensures that the regret is typically non-negative.
Question: Suggestions on improving the writing of the paper
You suggested that we describe the algorithm earlier, state the main theorem outside the preliminaries, and explicitly explain the meaning of high probability.
Response:
We greatly appreciate your detailed suggestions regarding the paper's structure and presentation, as they are immensely helpful for improving the clarity of our writing. We have carefully considered your suggestions and revised the manuscript as follows:
-
Reorganized the algorithm description:
The algorithm description is now a standalone section, divided into two subsections. The first subsection revises Section 2.2 of the original paper to explain the intuition behind the algorithm. The second subsection revises Section 3.1 of the original paper to formally describe the algorithm. -
Moved the theorem statement out of the preliminaries:
In the revised manuscript, we move the theorem statement out from the preliminaries. Instead, we provide an informal version of the theorem in the Introduction section for early context, and provide the formal version of the theorem in the Main Result and Analysis section. -
Clarified the meaning of high probability:
We restructured the theorem's statement to explicitly define the high-probability guarantee. Specifically, our high-probability result means that, for any probability parameter , the algorithm achieves a regret bound with probability at least , where the bound depends on only in the form of .
The revised manuscript incorporating these changes will be uploaded shortly.
Once again, we sincerely thank you for your positive, thorough, and thoughtful review. Your suggestions have significantly improved the structure and readability of our paper. We hope our response addresses your concerns.
Thanks for the response and the updated paper. I took a look at the revised paper, and I believe the presentation of the paper has been improved. The paper is now much more accessible for readers unfamiliar with the work of SZ [NeurIPS’23].
Due to my comments on the combined novelty for scope + techniques, I am keeping my overall evaluation as it is. However, I increased my confidence score by 1 to give better support for the paper.
The paper proposes an algorithm that achieves high probability regret bound (which is stronger than the expected regret bound) for the cross-learning contextual bandits under unknown context distribution by developing refined martingale inequalities.
优点
- The paper clearly presents the challenging point with detailed technical expressions and the novelty of the analysis.
缺点
- The explanation is only focused on technical side without the explanation of the algorithm. I suggest authors to spend more time including more explanations and organizing the paper.
问题
- What is the intuition behind the algorithm?
- How the indicator function resolves the unbounded martingale inequalties?
Dear reviewer NaxM:
Thank you for your valuable feedback. We address the comments below in detail:
Question 1: Lack of explanation of the algorithm
The reviewer suggested that we proposed an algorithm, but did not elaborate on its intuition, making it challenging to understand.
Response:
We would like to clarify that we did not propose a new algorithm. Instead, we provided a new and in-depth analysis of an existing algorithm, strengthening its result from an expected regret bound to a high-probability regret bound. This point was explicitly stated multiple times in the manuscript and was well understood by other reviewers.
We are grateful for the reviewer’s feedback. To address this concern, we have rewritten the paper to make it clearer that our contribution lies in the re-analysis of an existing algorithm. A revised version of the paper will be uploaded soon. In the new version, we have included a dedicated section to introduce the existing algorithm in detail. Furthermore, to better clarify the intuition behind the work, we provide an explanation of the algorithm's underlying principles at the beginning of this section.
Question 2: How the indicator function resolves unbounded martingale inequalities
Response: As noted near the end of the original manuscript, we use the indicator function to associate the original random variable sequence with a new random variable sequence. This new random variable sequence forms a bounded martingale, allowing us to apply standard martingale concentration inequalities. Furthermore, through this indicator-based association, we demonstrate that the original and new random variable sequence coincide with high probability. This enables us to transfer the concentration inequalities from the new sequence back to the original.
Once again, we sincerely thank the reviewer for the constructive comments, which have helped us improve the clarity and readability of the manuscript. The revised version will be uploaded soon. We hope our response has addressed the reviewer’s concerns and that you will consider increasing your support for the paper.
Thank you for detailed response. I increase my score to 5 for the revised manuscript but unfortunately, I have concerns to raise score more due to theoretical novelties. While deriving high-probability bound is an interesting problem, the technical novelty to derive the bound is limited. The revised paper heavily relies on the results from Schneider & Zimmert 2013 and the novel part is limited to replace the random variable with the conditionally expected term with high probability. Specifically, the paper replaces the Bias5e with Bias5e (which can be found in other literature with more refined analysis (see e.g., Thoerem 7 in [1])) and follows the analysis in Schneider & Zimmert 2013. Because the paper focuses on theoretical analysis without any novel algorithms or experiments, I believe it should include at least one newly developed lemma that may apply to many settings not only for the cross-learning contextual bandits to be published at a top-tier conference.
[1] Shamir, Eli, and Joel Spencer. "Sharp concentration of the chromatic number on random graphs G n, p." Combinatorica 7 (1987): 121-129.
Dear Reviewer,
Thank you very much for your time, attention, and for improving our score.
However, we respectfully disagree with your comments regarding the theoretical novelty of our work. You stated that the novelty of our paper is "limited to replacing the random variable with the conditionally expected term with high probability ... replaces the with ." We would like to clarify that this is not the core theoretical novelty of our work. As we emphasized in the abstract, the core novelty of our paper lies in "making extensive use of the weak dependency structure between different epochs." Our primary contribution is in recognizing the weak dependency of across different , which allows us to effectively bound . The point you mentioned about "replacing the random variable with the conditionally expected term with high probability ... replaces the Bias5e with Bias5Fe" is an additional novelty that supports this main argument.
This core novelty—"making extensive use of the weak dependency structure between different epochs"—stems from an in-depth analysis of cross-learning contexual bandit itself. We hope you will take this into account and reevaluate the theoretical novelty of our paper. We also hope that this will encourage you to increase your support for the paper.
Thank you again for your valuable feedback.
Dear Reviewers,
We sincerely thank all the reviewers for their valuable suggestions. Based on your feedback, we have restructured the paper, and the revised version has been uploaded. In the new version, we have made the following changes:
- Section 1: We added a Technique Overview subsection to provide a clearer explanation of our technical contributions. Additionally, we discussed the paper mentioned by reviewer uDyZ in this section and included the informal version of the theorem.
- Section 2: The new Section 2 now focuses solely on the problem formulation. Specifically, we have completely removed the theorem statement from this section.
- Section 3: A new Section 3 has been created, incorporating content from the old Section 2.2 and Section 3.1. This section provides a clearer explanation of the existing algorithm.
- Section 4: The new Section 4 is a revised version of the old Section 3. It has been divided into three new subsections, each dedicated to addressing a specific technical issue, thereby improving readability. Notably, the formal version of the theorem is now included in this section.
We sincerely thank the reviewers once again for their valuable feedback.
This paper explores adversarial context bandits, specifically focusing on a scenario where the losses of each arm are observable under all contexts when the algorithm selects that arm. The objective is to minimize regret by comparing the algorithm's performance to the best arm in hindsight, similar to classical adversarial bandit problems. The study examines a setting where the loss sequence is adversarial and the context is stochastic with an unknown distribution.
A prior work by SZ (NeurIPS'23) introduced an algorithm that achieved expected regret in this setting, where the number of arms was a factor. The main contribution of this paper is a refined analysis of the SZ algorithm, demonstrating that it can achieve a much lower regret with high probability. The paper builds on SZ's approach, which hinges on the concentration of unbiased estimation of the loss, despite the unknown context distribution. SZ's key idea involves a two-step process for each EXP3 step, with one step dedicated to estimating the context distribution. This paper extends SZ's method by leveraging weak dependencies between epochs and incorporating a martingale argument to achieve high-probability regret bounds.
The paper provides a new analysis of an existing algorithm, focusing on obtaining high-probability regret bounds. While the technical observations are non-trivial, the contribution is incremental, as it does not improve prior bounds or introduce substantial new ideas. Reviewers highlighted that the paper's primary merits lie in its techniques, which, while appreciated, may not meet the bar for ICLR without a stronger conceptual message. Additionally, concerns were raised about the lack of experimental validation, unconventional organization (e.g., the placement of the main theorem in the preliminary section), and unclear definitions (e.g., "with high probability"). Although these issues do not compromise correctness, they detract from the paper’s clarity and impact, leading to hesitations about its suitability for acceptance. Based on the common consensus, therefore, I would recommend that the authors submit to the next suitable venue to address all the concerns once all the necessary modifications are incorporated.
审稿人讨论附加意见
See "The revised version of the paper"
Reject