Catoni Contextual Bandits are Robust to Heavy-tailed Rewards
摘要
评审与讨论
This paper studies contextual bandits with heacy-tailed reward. If the variance is known, the authors use catoni estimator to achieve an near optimal regret upper bound. If the variance is unknown, but the variance of variance is bounded by variance times a constant factor, then the authors propose another method that first uses catoni estimator to estimate the variance, and then use one more catoni estimator based on the estimated variance. They prove that it also achieves a near optimal regret upper bound.
=====After rebuttal=====
I read the rebuttal. My score remains.
给作者的问题
-
In table 1, it seems that DistUCB has a better regret. In the footnote, it is explained that "DistUCB needs to estimate full reward distribution rather than just the mean", does it mean it will have a larger ? If the function class is restricted to Gaussian or exponential, what is the difference between and ?
-
In Assumption 4.1, it is assumed that . Is it common to have a small in practice? For example, if the noise distribution is exponential, this could be very large.
-
In Theorem 3.1, the regret lower bound depends on , which depends on , i.e., the used policy. This is a little strange. In my opinion, the regret lower bound should be independent with the used policy. Can you explain more about this?
-
Why Eq. (4) holds? Do you miss a "[ ]"?
-
I can understand why equation (3) makes sense. But why we need to use this kind of "indirect" estimator? What if we just let be that one with minimum ? Could you explain more about this?
论据与证据
Yes
方法与评估标准
Yes
理论论述
I do not check the detailed proofs in appendix.
实验设计与分析
N/A
补充材料
I do not check the detailed proofs in appendix.
与现有文献的关系
It could be helpful in real-world applications with heavy-tailed noises, for example, the waiting time in routing systems. But no such experiments are provided in the paper.
遗漏的重要参考文献
No.
其他优缺点
Strengths:
-
The paper's algorithm do have better regret upper bounds when the variances of arms are limited.
-
The concentration for Catoni estimator is a novel contribution.
Weaknesses:
-
In Algorithm 1, choosing the action and estimating the seem require to use a double-oracle, and can be inefficient. Even if Algorithm 3 do some improvement on estimating the , it can still be inefficient when choosing .
-
There is no experiments, and I am wondering the applicabiity of these results.
-
Some parts are not clear enough, please see questions below.
其他意见或建议
N/A
Thanks for your insightful suggestions.
Q1: In Algorithm 1, choosing the action and estimating the can be inefficient.
A1: The double oracle is a standard way for optimism in online RL and bandits with general function approximation [1,2,3]. In linear function approximation, we can compute bonus efficiently [4] and choose . However, there is no counterpart for general function approximation even when the reward range has a small bound. How to calculate the bonus efficiently is an open question even with bounded rewards.
Q2: There is no experiments. Applicability?
A2: Our work primarily focuses on theoretical analysis, and serves as a first step to use the Catoni estimator for nonlinear function approximation. We provide rigorous proofs demonstrating that the weighted Catoni estimator is robust against heavy-tailed noise. Although the proposed theoretical algorithm is not computationally efficient, our analysis serves as a first step toward designing robust algorithms for general function class. We believe the Catoni estimator and the variance-aware weighting technique offer valuable insights for practical use. Developing an efficient, practical version of the algorithm is a question for future work.
Q3: In table 1, it seems that DistUCB has a better regret. Does footnote mean it has a larger If the function class is restricted to Gaussian or exponential, what is the difference between and ?
A3: Note that DistUCB does not achieve a better regret bound due to its polynomial dependence on the reward range . Moreover, the order on is generally incomparable to . DistUCB assumes realizability of the reward distribution, i.e. true reward distribution belongs to the considered class of distributions. In contrast, we only assume realizability of the reward mean . Therefore, the footnote indicates that their assumption is stronger than ours.
According to Definition 5.2 in the DistUCB paper, measures the complexity of a class of distributions, while measures the complexity of a class of mean functions. These two notions are incomparable. Even in the Gaussian case, the total variation distance between and is incomparable to . Further, the noise distributions may be complicated and not even satisfy the realizability.
Q4: Assumption 4.1, is it common to have a small c in practice?
A4: The fourth moment bound is required in most prior works (e.g. Li, et. al. and Huang, et. al) that handle the unknown variance case. Our assumption is equivalent to theirs when is bounded. Additionally, bounded second and fourth moment is already a substantially weaker condition than a bounded range within .
Q5: In Theorem 3.1, the regret lower bound depends on the used policy.
A5: The intuition is as follows. Consider two bandit problems, each with two arms. These two bandits differ only in the reward distribution of the second arm. In both bandits, the first arm has a zero variance, and the second arm has variance . The maximum regret between these two instances depends on how often the second arm is chosen, which is given by . Therefore, we obtain a lower bound directly related to the variances of the selected actions. The construction is not policy dependent, but a simple argument that there are two problem instances, and any algorithm incurs a large regret in one or the other.
Q6: Why Eq. (4) holds? Do you miss a "[ ]"?
A6: We miss a "[ ]" and will correct it in the revision.
Q7: Why "indirect" estimator in (3)? What if we just let be that one with minimum ?
A7: We have explained the intuition in Lines 203-219 right. Specifically, the reason is that directly solving
does not work well. If we use the optimization above
and use Lemma 3.2 with that is on second-order of noise, then, we need to deal with the fourth moment of noises in the analysis even for the known variant case since Lemma 3.2 needs bounded variance of . In our approach, the indirect estimator helps to cancel the higher order of and make the term easier to analyze.
[1] Jin C, et. al. Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms. NeurIPS, 2021.
[2] Liu Q, et al. When is partially observable reinforcement learning not scary?. COLT, 2022.
[3] Agarwal A, et. al. VO L: Towards Optimal Regret in Model-free RL with Nonlinear Function Approximation. COLT, 2023.
[4] Abbasi-Yadkori Y, et. al.. Improved algorithms for linear stochastic bandits. NeurIPS, 2011.
This paper studied the setting of variance-aware contextual bandit (or second order bandit). Specifically, this paper aims to develop algorithms whose regret is upper bounded by the variance of noise. Suppose the noise is between , all previous literatures which also studied this question obtained regret bounds that have polynomial dependence on . This paper adopts the Catoni estimator from the robust statistics literature, and develops an algorithm whose regret upper bound is only log dependent on .
给作者的问题
I have the following questions:
- In the case where the variances are all known, the algorithm in the literature I referred above can achieve regret . However, the setting therein requires the noise scale bounded by 1. Is it possible to improve your results so that the dominating term also scales with , while keeping the dependence on the scala of noise to be logarithmic?
- In the current setup, the variance cannot depend on the action chosen. If the variance depends on the action chosen at each time step, will the results still work?
论据与证据
Yes.
方法与评估标准
Yes.
理论论述
Yes, I checked the proofs for theoretical claims. All arguments seem to be correct, except from typos.
实验设计与分析
No experiments in this paper.
补充材料
Yes, I reviewed the supplementary materials, which contain the proof.
与现有文献的关系
Previous literatures which studied the same problem all have regret bound polynomially depending on the noise scale, this paper developed an algorithm whose regret upper bound is only log dependent on the noise scale.
遗漏的重要参考文献
The following reference is also related to this paper. I suggest the authors provide comparison to results therein.
[1] Z Jia, J Qian, A Rakhlin, CY Wei "How Does Variance Shape the Regret in Contextual Bandits?"
其他优缺点
Strength:
- The algorithm and analysis technique allows us to obtain a regret bound that only has logarithmic dependence on the noise scale. This provides a different perspective of variance-aware algorithms for contextual bandits.
- This paper is well-written, except for a few typos.
Weakness:
- The techniques used in this paper are not new. This paper is not the first paper that adopts the Catoni estimator in designing bandit algorithms.
其他意见或建议
Typos:
- Line 272, in the left: to
- Line 754, right hand side: to
- In the description of Lemma 3.6 (also Lemma B.5 and the proof therein): the minimizer to the maximize
Suggestions:
- I suggest the authors use a different notation in , since it is confusing to
Thanks for your constructive advice!
Q1: The Jia, et. al. is also related to this paper. I suggest the authors provide a comparison to results therein.
A1: We will cite their work and provide a comparison in the revision, and want to mention that Jia, et. al. is contemporary with our work. The main difference is that they assume that the noise is bounded such that , and we consider heavy-tailed rewards. Besides, although we both get variance-dependent bounds, the focuses are distinct: they aim to obtain better regret bounds when the eluder dimension is larger than the number of actions, while we aim to obtain logarithmic dependence on the reward range. Regardless of the dependence on reward range, for the weak adversary with revealed variance, our upper bound is incomparable to theirs . For the strong adversary, we both use the peeling technique and thus, our bound is superior on the dependence of reward range.
Q2: The techniques used in this paper are not new. This paper is not the first paper that adopts the Catoni estimator in designing bandit algorithms.
A2: Although the Catoni estimator has previously been applied in bandit problems, it has not yet been used with nonlinear function approximation, so we provide new methods and analysis.
To the best of our knowledge, our newly designed estimator in (3), which is based on excess risk, is the first one capable of handling heavy-tailed noise in a nonlinear function class. Other robust estimators, such as Huber regression and the median-of-means estimator (Lugosi and Mendelson, 2019), are limited to linear vector space structures.
Additionally, we introduce a novel analysis method to establish concentration results. Existing analyses for linear heavy-tailed bandits rely explicitly on linear structures and do not generalize to the nonlinear case considered here. Furthermore, for the unknown variance scenario, we employ a peeling technique, which allows our approach to avoid the need for approximating the noise variance using a function class.
Q3: In the case where the variances are all known, the algorithm in the literature I referred above can achieve regret . However, the setting therein requires the noise scale bounded by 1. Is it possible to improve your results so that the dominating term also scales with , while keeping the dependence on the scala of noise to be logarithmic?
A3: That is an interesting question for future work. Currently, we develop algorithms based on the OFUL structure, while Jia et al. build on the SquareCB approach. It might be feasible to connect the concentration arguments in future work. We will add this discussion to the final version.
Q4: In the current setup, the variance cannot depend on the action chosen. If the variance depends on the action chosen at each time step, will the results still work?
A4: We would like to claim that our regret bounds in both Theorem 3.4 and 4.2 depend on the variance of the chosen actions at each time step. is defined in Line 128 left and is the variance of the noise of chosen actions.
Typos: We will correct them in the revision.
Suggestions: Thanks for the suggestions. We will use a difference notation to avoid confusion.
[1] Jia, Zeyu, et al. How Does Variance Shape the Regret in Contextual Bandits?. NeurIPS, 2024.
In this work, the authors proposes a novel algorithm to tackle the contextual bandit problem in the presence of heavy-tailed noise assumptions. In particular, they assume that the variance of the noise is finite and deal with the scenario in which (i) the noise variance is known to the learner, improving existing literature bounds and (ii) the noise variance is unknown to the learner, obtaining nearly matching regret bounds.
给作者的问题
I don't have any relevant question, except for the one already stated in the related works box.
论据与证据
Yes, every claim is supported by a proof.
方法与评估标准
There is no experimental evaluation in the paper.
理论论述
I went through some of the proofs. Each of them seems correct to me.
实验设计与分析
There is no experimental evaluation in the paper.
补充材料
I quickly went through some of the proofs.
与现有文献的关系
As the authors discuss, their results strictly improve the ones from previous literature, especially in the known variance scenario. Regarding the adaptation to variance, it is less clear what different assumptions were made by previous works. In this work, the authors make a reasonable fourth-moment assumption, which already exists in the bandit literature [1]. Moreover, in heavy-tailed bandits, a dedicated sub-literature on adaptation to the unknown noise variance/1+epsilon moment exists [2,3,4]. I would be interested in knowing how this work relates to them: here, the focus is on the contextual scenario, but what happens if I cast these results to the unstructured MAB setting? Will they improve over the existing works?
References [1] LATTIMORE, Tor. A scale free algorithm for stochastic bandits with bounded kurtosis. Advances in Neural Information Processing Systems, 2017, 30. [2] HUANG, Jiatai; DAI, Yan; HUANG, Longbo. Adaptive best-of-both-worlds algorithm for heavy-tailed multi-armed bandits. In: international conference on machine learning. PMLR, 2022. p. 9173-9200. [3] GENALTI, Gianmarco, et al. -Adaptive Regret Minimization in Heavy-Tailed Bandits. In: The Thirty Seventh Annual Conference on Learning Theory. PMLR, 2024. p. 1882-1915. [4] CHEN, Yu, et al. uniINF: Best-of-Both-Worlds Algorithm for Parameter-Free Heavy-Tailed MABs. arXiv preprint arXiv:2410.03284, 2024.
遗漏的重要参考文献
All the essential and related literature on contextual MABs has been discussed. For some additional literature on adaptation to the noise variance in HT MABs see the previous box.
其他优缺点
The paper is well-written, and the approach's strength is clear and provable. Authors make sure to compare with the existing literature and highlight the extent of their improvement. Moreover, the proofs seem correct.
其他意见或建议
there's a typo in Table 3, parameter : prameter -> parameter.
Thanks for your helpful advice!
Q1: In heavy-tailed bandits, a dedicated sub-literature on adaptation to the unknown noise variance/1+epsilon moment exists [2,3,4]. I would be interested in knowing how this work relates to them: here, the focus is on the contextual scenario, but what happens if I cast these results to the unstructured MAB setting? Will they improve over the existing works?
A1: We will cite [2,3,4] and include a discussion in the revised version. However, we want to highlight that our formulations and techniques differ significantly. Our work focuses on the contextual setting with general function approximation, whereas [2,3,4] consider the standard multi-armed bandit (MAB) setting. It is definitely possible to obtain sharper gap-dependent bounds in the MAB setting, but in the most general case, we can get bounds in the MAB literature, while one has lower bounds even in the linear setting with large action spaces [1]. Consequently, the results are not directly comparable. We will add this discussion to the final version.
Regarding algorithms, we adopt the OFUL framework combined with weighted Catoni estimators and peeling techniques. In contrast, [2] uses a skipping method based on Follow-the-Regularized-Leader, and [3,4] design adaptive algorithms capable of handling unknown and unknown moment bounds.
Q2: there's a typo in Table 3, parameter : prameter -> parameter.
A2: Thanks for the correction. We will correct it in the revision.
This paper introduces contextual bandit algorithms that are robust to heavy-tailed rewards by leveraging Catoni’s mean estimator from robust statistics. The authors propose two algorithms: Catoni-OFUL for the known-variance setting, and VACB, for the unknown-variance setting. Both algorithms achieve regret bounds that depend on the cumulative reward variance and scale only logarithmically with the reward range , improving upon prior work that exhibits polynomial dependence on . In the unknown-variance case, the authors avoid direct variance estimation by employing a peeling-based approach combined with a plug-in estimator derived from Catoni’s method. A matching lower bound is established in the known-variance setting, demonstrating the optimality of the leading-order term in the regret bound.
给作者的问题
-
In the linear reward setting with known variances, does Catoni-OFUL incur slightly worse regret compared to existing algorithms such as AdaOFUL (as shown in Row 2 of Table 1)? Could the authors clarify whether this is due to the generality of the function class or an artifact of the analysis?
-
Theorem 4.2 achieves a variance-dependent regret bound that matches the known-variance case (Theorem 3.4) up to a slightly worse dependence on the eluder dimension. Could the authors provide more insight into whether this additional dependence is intrinsic to the peeling-based approach, or whether it might be improved with a different algorithmic or analytical technique?
论据与证据
Most of the claims are well-supported.
- Extensive comparisons (see Table 1) with existing algorithms highlight the advantages of the proposed methods in heavy-tailed reward settings.
- The concentration results for Catoni’s estimator are carefully stated and integrated into the analysis with appropriate rigor.
- The proof sketches and supporting lemmas provide a clear logical path from algorithm design to the stated regret bounds.
One minor concern: the challenges of solving the min-max optimization in Eq. (3) are acknowledged but not explored in depth. While Algorithm 3 is proposed as a more efficient alternative, the paper does not discuss its trade-offs in detail. For example, what are the theoretical or empirical pros and cons of this variant? Could a similar approach be extended to the unknown-variance setting?
方法与评估标准
N/A
理论论述
Overall, the proofs are technically sound based on the provided sketches and lemmas.
For the known-variance case: The authors leverage a uniform concentration inequality (Lemma 3.2) for Catoni’s estimator. This result is a non-trivial extension of prior Catoni bounds and is central to constructing valid confidence sets.
For the unknown-variance case: The authors show how to control variance-normalized excess loss without exact variances. Lemma 4.4 demonstrates that the plug-in estimator for the cumulative variance (based on Catoni’s method) remains accurate up to logarithmic factors. The authors then carefully control the contribution to regret from different uncertainty levels using a peeling argument.
实验设计与分析
There is no empirical evaluation in this paper. While acceptable for a theory-focused paper, a small experiment could have illustrated practical benefits of robustness. Nevertheless, the theoretical evaluation is comprehensive and grounded.
补充材料
I briefly go through Appendix B. Proofs for the Known Variance Setting.
与现有文献的关系
N/A
遗漏的重要参考文献
No major omissions.
其他优缺点
Novel and principled use of Catoni estimator in a contextual bandit setting.
其他意见或建议
See above.
Thanks for your constructive suggestions!
Q1: The challenges of solving the min-max optimization in Eq. (3) are acknowledged but not explored in depth. While Algorithm 3 is proposed as a more efficient alternative, the paper does not discuss its trade-offs in detail. For example, what are the theoretical or empirical pros and cons of this variant? Could a similar approach be extended to the unknown-variance setting?
A1: Theoretically, the two algorithms have the same regret bound. Computationally, Algorithm 3 is more efficient since it picks one function from the constructed candidate set instead of solving a min-max optimization. Despite this advantage, we chose to present Algorithm 1 in the main text because it is simpler, clearer, and easier to explain, and its design is better aligned with OFUL-style algorithms which readers might find familiar.
For the extension to the unknown-variance case, Algorithm 3 can be extended to the unknown-variance case with similar techniques and obtain almost the same result. We will add the extension in the revision.
Q2: There is no empirical evaluation in this paper. While acceptable for a theory-focused paper, a small experiment could have illustrated practical benefits of robustness. Nevertheless, the theoretical evaluation is comprehensive and grounded.
A2: Our work primarily focuses on theoretical analysis, and serves as a first step to use the Catoni estimator for nonlinear function approximation. We provide rigorous proofs demonstrating that the weighted Catoni estimator is robust against heavy-tailed noise. Although the proposed theoretical algorithm is not computationally efficient, our analysis serves as a first step toward designing robust algorithms for general function class. We believe the Catoni estimator and the variance-aware weighting technique offer valuable insights for practical use. Developing an efficient, practical version of the algorithm is an important question for future work.
Q3: In the linear reward setting with known variances, does Catoni-OFUL incur slightly worse regret compared to existing algorithms such as AdaOFUL (as shown in Row 2 of Table 1)? Could the authors clarify whether this is due to the generality of the function class or an artifact of the analysis?
A3: The regret bounds match in the dominant term and are worse in for the non-dominant term. The reason for this difference is that the linear vector space allows a smaller variance-based weight for regression, with on the denominator according to Lemma B.1 and B.2 of Li, et. al. Their arguments do not generalize to the nonlinear structure, however, and we can only use () in weights to balance the orders in Lemma B.2. The expense of using a larger weight is to get additional dependence on the non-dominating term (Lines 1013-1026). How to match this non-dominating term is an interesting direction for future work.
Q4: Theorem 4.2 achieves a variance-dependent regret bound that matches the known-variance case (Theorem 3.4) up to a slightly worse dependence on the eluder dimension. Could the authors provide more insight into whether this additional dependence is intrinsic to the peeling-based approach, or whether it might be improved with a different algorithmic or analytical technique?
A4: This additional dependence comes from the peeling approach in general function approximation and also appears in [1,2]. The linear analysis for peeling in Zhao. et. al is restricted to the linear vector space structure and special form of uncertainty and thus can not be extended to nonlinear space. Hence, in our analysis Lines 332-337, we can only bound by thus leading to the worse order, while is upper bounded by without the additional order on d.
[1] Pacchiano, A. Second order bounds for contextual bandits with function approximation, 2024.
[2] Jia, Zeyu, et al. How Does Variance Shape the Regret in Contextual Bandits?. NeurIPS, 2024.
This paper provides a novel use of Catoni estimator to adapt to variances in contextual bandits. While using Catoni estimator is not new in bandits, using it for nonlinear function approximation settings is new and it was done in nontrivial and interesting ways. This paper is well-written as well.
For the final version, please address comments from the reviewers including typos.