PaperHub
6.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
5
4
4
4
3.8
置信度
创新性3.0
质量3.0
清晰度2.8
重要性2.8
NeurIPS 2025

Improved Regret Bounds for Linear Bandits with Heavy-Tailed Rewards

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29

摘要

We study stochastic linear bandits with heavy-tailed rewards, where the rewards have a finite $(1+\epsilon)$-absolute central moment bounded by $\upsilon$ for some $\epsilon \in (0,1]$. We improve both upper and lower bounds on the minimax regret compared to prior work. When $\upsilon = \mathcal{O}(1)$, the best prior known regret upper bound is $\tilde{O}(d T^{\frac{1}{1+\epsilon}})$. While a lower with the same scaling has been given, it relies on a construction using $\upsilon = d$, and adapting the construction to the bounded-moment regime with $\upsilon = \mathcal{O}(1)$ yields only a $\Omega(d^{\frac{\epsilon}{1+\epsilon}} T^{\frac{1}{1+\epsilon}})$ lower bound. This matches the known rate for multi-armed bandits and is generally loose for linear bandits, in particular being $\sqrt{d}$ below the optimal rate in the finite-variance case ($\epsilon = 1$). We propose a new elimination-based algorithm guided by experimental design, which achieves regret $\tilde{\mathcal{O}}(d^{\frac{1+3\epsilon}{2(1+\epsilon)}} T^{\frac{1}{1+\epsilon}})$, thus improving the dependence on $d$ for all $\epsilon \in (0,1)$ and recovering a known optimal result for $\epsilon = 1$. We also establish a lower bound of $\Omega(d^{\frac{2\epsilon}{1+\epsilon}} T^{\frac{1}{1+\epsilon}})$, which strictly improves upon the multi-armed bandit rate and highlights the hardness of heavy-tailed linear bandit problems. For finite action sets of size $n$, we derive upper and lower bounds of $\tilde{\mathcal{O}}(\sqrt d (\log n)^{\frac{\epsilon}{1+\epsilon}}T^{\frac{1}{1+\epsilon}})$ and $\tilde\Omega(d^{\frac{\epsilon}{1+\epsilon}}(\log n)^{\frac{\epsilon}{1+\epsilon}} T^{\frac{1}{1+\epsilon}})$, respectively. Finally, we provide action-set-dependent regret upper bounds and show that for some geometries, such as $l_p$-norm balls for $p \le 1 + \epsilon$, we can further reduce the dependence on $d$.
关键词
linear banditsheavy-tailedexperimental designregret analysisonline learning

评审与讨论

审稿意见
5

The paper studies the stochastic linear bandit problem and improves both the upper and lower bounds on regret (for general and finite-action cases) by the geometry-aware experimental design and finer reward constructions.

优缺点分析

Strengths:

  1. The paper is generally well-written with clear definitions of constraints and objectives.

  2. The geometry-aware design is novel and improves the regret upper bounds in terms of the dimension of the enclosing action space.

Weakness:

  1. Numerical study is limited.

问题

  1. Why not providing numerical studies to compare the actual regret with those of previous algorithms listed on Page 3 in terms of both d and T? (Sometimes tighter theoretical bounds do not necessarily mean better real-time performance.) (answered)

  2. What is the regret upper bound of the proposed algorithm for the MAB model (studied by Bubeck et al., 2013)? If it is worse than that in Bubeck et al. (2013), please explain. (answered)

局限性

Yes.

最终评判理由

The authors clarified all concerns raised by the reviewer and provided some numerical studies.

格式问题

N/A

作者回复

We thank the reviewer for their feedback. Our responses to the main concerns are given as follows.

On the Specialization to Multi-Armed Bandits

The reviewer appears to have overlooked Corollary 2, where we already consider the MAB setting via the simplex action set. (Note that while choosing an action from the simplex in Rn\mathbb{R}^n may appear different from selecting one of nn arms, the two are equivalent when randomized decisions are allowed; see line 251.) This yields a regret bound of O~(nϵ1+ϵT11+ϵ)\tilde{\mathcal{O}}(n^{\frac{\epsilon}{1 + \epsilon}} T^{\frac{1}{1 + \epsilon}}), which is optimal as it matches the lower bound from [1]. To our knowledge, this is the first work on linear bandits with heavy-tailed distributions that adapts optimally to the MAB action set. We will make this point more explicit in the paper.

On Lack of Numerical Experiments

This paper is theoretical in nature and aims to understand the information-theoretic hardness of the linear bandit problem under heavy-tailed noise. Our focus is not on proposing a new algorithm for improving empirical performance over existing baselines, but rather on characterizing the fundamental statistical limits of regret minimization in this setting (similar to other important work in the area such as [1, 2, 3]). We show that sublinear regret is achievable even in high-dimensional regimes (dTϵ1+ϵd \ge T^{\frac{\epsilon}{1 + \epsilon}}), where prior work provided no non-trivial (o(T)o(T)) guarantees. Similarly, our application to infinite-dimensional kernel bandits (see Appendix C of the supplementary material) provides major improvements over existing works that only attained sublinear regret in highly restricted parameter regimes.

Nonetheless, to address the reviewer’s concern, we conducted a simple numerical experiment with T=100,000T = 100{,}000 and N=2dN = 2d, varying the dimension dd. We set the true parameter vector as θ=1d1\theta = \frac{1}{\sqrt{d}}\mathbf{1} and used an action set formed from a subset of the normalized hypercube. The reward noise was drawn from a shifted Pareto Type II distribution with parameters α=2\alpha = 2 and σ=1\sigma = 1. In the table below, we report average regret over 10 repetitions of our algorithm, compared with the CRTM algorithm from [4] as a baseline, using ϵ=0.5\epsilon = 0.5 as input. The results indicate that as the dimension increases, our algorithm consistently outperforms the baseline. We will include the plots corresponding to this experiment in the appendix.

ddMED-PE Regret (Our Algorithm)CRTM Regret (Baseline)
102429.262030.47
204275.893792.00
407450.688300.18
8013477.5615752.45
16023009.3528878.34
32044490.3159833.02

References

[1] Bubeck et al. "Bandits with heavy tail". IEEE Transactions on Information Theory, 2013.

[2] Hsu and Sabato. "Heavy-tailed regression with a generalized median-of-means". ICML, 2014.

[3] Chen et al. "uniINF: Best-of-both-worlds algorithm for parameter-free heavy-tailed MABs", ICLR 2025.

[4] Xue et al. "Efficient algorithms for generalized linear bandits with heavy-tailed rewards", NeurIPS 2023.

评论

Thanks for the clarification!

评论

We thank the reviewer for the time and effort spent reviewing our paper. As the discussion phase is about to end, we would like to make sure our responses have sufficiently addressed your concerns. We look forward to your feedback.

审稿意见
4

Problem Setting: This paper considers the standard heavy-tailed linear bandit setting, where the noise ηt\eta_t has a bounded (1+ϵ)(1+\epsilon)-moment. In this scenario, the goal is to design a policy that achieves low regret despite the presence of heavy-tailed noise.

Methods: Unlike previous approaches, which mainly adopt robust estimator (truncation/MOM/Huber regression) + UCB strategy, the strategy proposed in this paper is novel. The algorithm is divided into four main components: i) Pure Exploration: Construct a sampling distribution over the current arm set and collect a budget of samples based on this distribution; ii) Dealing with Heavy-Tailed Noise: For each arm that is sampled multiple times, apply a robust mean estimator to obtain a stable estimate of its reward; iii) Parameter Estimation: Using the estimated rewards of the arms, fit the parameter vector θ\theta by minimizing the estimation error; iv) Exploitation: Based on the estimated parameter, perform elimination over the arm set to ensure that only arms with sufficiently high estimated rewards are retained.

Theoretical Guarantee: The paper provides regret guarantees in three different scenarios: i) General Setting: Without additional assumptions, the regret is improved from O(dT11+ϵ)\mathcal{O}(d T^{\frac{1}{1+\epsilon}}) to d3ϵ+12ϵ+2T11+ϵd^{\frac{3 \epsilon+1}{2 \epsilon+2}} T^{\frac{1}{1+\epsilon}}, and a matching lower bound is also provided; ii) Finite Arm Setting: For a finite action set, the regret is improved from dlognT11+ϵ\sqrt{d \log n} T^{\frac{1}{1+\epsilon}} to d(logn)ϵ1+ϵT11+ϵ\sqrt{d}(\log n)^{\frac{\epsilon}{1+\epsilon}} T^{\frac{1}{1+\epsilon}}, tightening the dependence on the number of arms nn; iii) Geometry-Dependent Bounds: The paper further provides geometry-dependent regret bounds, making the regret sensitive to the structure of the arm set.

优缺点分析

Strengths:

  1. The theoretical analysis is solid, with concrete improvements in regret bounds.

  2. The paper proposes a novel adaptive regret bound for heavy-tailed linear bandits that adapts to the geometry of the arm set. Prior adaptive methods have primarily focused on the variance of the noise.

  3. The algorithm design and theoretical insights are intuitive. Previous approaches to heavy-tailed linear bandits have largely relied on techniques like truncation, median-of-means, or Huber regression, which focus on mitigating outliers in observed rewards or residuals. However, these methods generally do not take into account the effects of high-dimensional feature spaces. In contrast, the approach in this paper explicitly incorporates high-dimensional geometry into both algorithm design and analysis.

Weaknesses:

  1. While the theoretical improvements are concrete, they mainly focus on factors other than the core difficulty of heavy-tailed noise itself. Specifically, the contributions emphasize improving the dependence of the regret on dimension dd, the number of arms nn, or the geometry of the arm set, rather than addressing new challenges posed by heavy-tailed noise.

  2. The paper does not include any experiments. Although the main contribution lies in theoretical regret bounds, prior work on heavy-tailed linear bandits typically includes empirical validation. Given that this paper proposes a new algorithmic framework with novel design elements, it would be particularly valuable to empirically demonstrate its effectiveness.

问题

  1. The estimation procedure in this work can essentially be viewed as a MOM-style approach. It relies on repeated sampling of each arm to construct a robust mean estimate of the corresponding reward, resulting in a set of clean data pairs (a,W(a))(a, W^{(a)}). Given that the heavy-tailed noise has already been mitigated through this robust estimation step, one natural question is: why not directly apply a standard least-squares procedure to estimate θ\theta from the denoised data? Instead, the paper solve a minimax objective: θ^argminθmaxaAθaW(a)\widehat{\theta}_{\ell} \leftarrow \arg \min _\theta \max _{a \in \mathcal{A}}\left|\theta^{\top} a-W^{(a)}\right| It would be helpful for the authors to clarify the motivation behind this choice. For example, is this formulation more favorable from a theoretical analysis perspective, or does it offer improved robustness even after reward-level noise has been smoothed?

  2. The primary theoretical contribution of the paper lies in improving the regret’s dependence on the dimension dd, which suggests that the setting of large or high-dimensional dd is of central interest. However, many components of the algorithm—especially in the infinite arm case—have computational complexity that depends heavily on dd.

In particular, I suggest the authors discuss the complexity (in terms of both dd and TT) of the following core subroutines:

  • learning the exploration distribution over the arm set (e.g., solving the experimental design optimization),
  • computing the robust mean estimators for all active arms,
  • solving the bilevel minimax problem for parameter estimation, and
  • executing the arm elimination step.

Such a complexity discussion would help assess the practical feasibility of the method, particularly in high-dimensional regimes where the algorithm is theoretically most attractive.

局限性

As I mentioned in the Weaknesses section, this work doesn't include experiments, which I believe is not entirely appropriate. The heavy-tailed linear bandit problem is not a purely theoretical topic—many prior works in this area (especially those involving robust estimators like truncation, MOM, or Huber regression) include experimental validation to support their theoretical claims.

I strongly suggest including at least a small set of experiments. In particular, it would be valuable to empirically demonstrate that as the dimension dd increases, the proposed algorithm exhibits a significant advantage over baseline methods. Such results would not only strengthen the paper’s contribution but also help validate the practical impact of the improved regret bounds in high-dimensional settings.

最终评判理由

This work presents solid theoretical analysis and clear improvements in theoretical results. Although it initially lacked experimental validation, the authors have added experiments in the rebuttal and committed to including them in the final paper. Therefore, I will maintain my score and recommend acceptance.

格式问题

N/A

作者回复

We thank the reviewer for their feedback. Our responses to the main concerns are given as follows.

On Impact of Heavy-Tailed Noise on Regret Bounds

The reviewer notes that our work focuses on dependencies on dd, nn, and the geometry, rather than addressing new challenges introduced by heavy-tailed noise. The parameter ϵ(0,1]\epsilon \in (0, 1] serves as the primary complexity measure for heavy-tailed noise, and it appears explicitly in both the estimator error and the regret bounds. Understanding how these dependencies vary with ϵ\epsilon is in fact one of the main fundamental challenges posed by heavy-tailed noise. Addressing this required a significant rethinking of the experimental-design-based strategy (see λ\lambda_{\ell}^* in Algorithm 1) and its analysis, beyond the previously studied finite-variance case (ϵ=1\epsilon = 1).

On Lack of Numerical Experiments

This paper is theoretical in nature and aims to understand the information-theoretic hardness of the linear bandit problem under heavy-tailed noise. Our focus is not on proposing a new algorithm for improving empirical performance over existing baselines, but rather on characterizing the fundamental statistical limits of regret minimization in this setting (similar to other important work in the area such as [1, 2, 3]). We show that sublinear regret is achievable even in high-dimensional regimes (dTϵ1+ϵd \ge T^{\frac{\epsilon}{1 + \epsilon}}), where prior work provided no non-trivial (o(T)o(T)) guarantees. Similarly, our application to infinite-dimensional kernel bandits (see Appendix C of the supplementary material) provides major improvements over existing works that only attained sublinear regret in highly restricted parameter regimes.

Nonetheless, to address the reviewer’s concern, we conducted a simple numerical experiment with T=100,000T = 100{,}000 and N=2dN = 2d, varying the dimension dd. We set the true parameter vector as θ=1d1\theta = \frac{1}{\sqrt{d}}\mathbf{1} and used an action set formed from a subset of the normalized hypercube. The reward noise was drawn from a shifted Pareto Type II distribution with parameters α=2\alpha = 2 and σ=1\sigma = 1. In the table below, we report average regret over 10 repetitions of our algorithm, compared with the CRTM algorithm from [4] as a baseline, using ϵ=0.5\epsilon = 0.5 as input. The results indicate that as the dimension increases, our algorithm consistently outperforms the baseline. We will include the plots corresponding to this experiment in the appendix.

ddMED-PE Regret (Our Algorithm)CRTM Regret (Baseline)
102429.262030.47
204275.893792.00
407450.688300.18
8013477.5615752.45
16023009.3528878.34
32044490.3159833.02

On Choice of the Regression Estimator

Standard least squares estimation is not robust to heavy-tailed noise. While mean estimation under heavy-tailed noise is a well-developed area with tight theoretical guarantees, regression in this setting remains less understood. Accordingly, we estimate the scalar value aθa^\top \theta for each arm aa separately using robust mean estimators, and then construct θ^\hat{\theta} to minimize the worst-case distance to these estimates.

Computational Complexity Considerations

As stated in Lemma 2, computing a design for general action sets is computationally efficient using algorithms such as Frank–Wolfe, with complexity O(dloglogd)\mathcal{O}(d \log \log d). In addition, computing the estimator for each arm using the truncated sample mean has linear complexity. Finding the minimum-distance estimator θ^\hat\theta involves solving a linear optimization problem, which can be done efficiently (in the infinite-dimensional setting, this can be handled through a dual problem formulation; see Appendix C and [5], Section 2.1, for more details). The main computational bottleneck is iterating over the set of active arms at each round to update estimates and perform elimination, which is standard in finite-arm algorithms. We will expand the discussion of computational complexity in the final version for greater clarity.

References

[1] Bubeck et al. "Bandits with heavy tail". IEEE Transactions on Information Theory, 2013.

[2] Hsu and Sabato. "Heavy-tailed regression with a generalized median-of-means". ICML, 2014.

[3] Chen et al. "uniINF: Best-of-both-worlds algorithm for parameter-free heavy-tailed MABs", ICLR 2025.

[4] Xue et al. "Efficient algorithms for generalized linear bandits with heavy-tailed rewards", NeurIPS 2023.

[5] Camilleri et al. "High-dimensional Experimental Design and Kernel Bandits". ICML, 2021.

评论

Thank you for your detailed response. I suggest including experimental results in the revised version. I will maintain my current score.

审稿意见
4

This paper studies linear bandit with heavy tailed rewards. In particular, this paper shows that previous minimax lower bound is not constructed under the same assumption with the upper bound. Therefore, although existing studies claim that upper and lower bounds already match, there is still some gap that need to be further closed. This paper further provide an improved algorithm.

优缺点分析

Strengths:

Although existing studies claim that linear bandits with heavy tailed rewards have already been solved, in the sense that upper and lower bounds match, this paper shows that the previous analysis is actually not tight. I think that this is novel.

The paper is well-written in general

Weaknesses:

The proof skips many steps. For example, in the proof of Theorem 1, although I have tried to derive these equations by myself and I finally manage to validate the correctness of these proof, it takes me a while. I suggest to write the proof in appendix.

The upper and lower bounds do not match (although it is already a significant improvement)

问题

I am wondering whether the lower bound or the upper bound is still not tight. My guess is that the upper bound is not tight. Let ϵ=0\epsilon = 0, then your lower bound O(d1+3ϵ2(1+ϵ)T11+ϵ)O(d^\frac{1+3\epsilon}{2(1+\epsilon)} T^\frac{1}{1+\epsilon}) becomes O(d12T)O(d^\frac{1}{2} T). However, according to your problem setting in Section 1.1, the real expected reward is bounded, thus even if we select the action with minimum reward (this is the worst policy), the regret is still O(T)O(T). Therefore I think that the upper bound is not tight and can be further improved. Hope that authors can provide some ideas on further improvements.

局限性

I do not have comments on limitations. In general, this is a good paper.

最终评判理由

After reading authors' rebuttal, I decide to maintain my score. In general, this paper makes an improvement compared to existing works.

格式问题

No concerns

作者回复

We thank the reviewer for their feedback. Our responses to the main concerns are given as follows.

On Compact Proof Steps

We believe that the reviewer may have missed the uploaded supplementary document, which contains the full proofs. In other words, the suggestion "I suggest to write the proof in appendix." has already been done. Please let us know if we misunderstood your meaning.

On Intuition for Tightness/Looseness of Bounds

We agree that the upper bound is likely to still have room for improvement, though we also expect that attaining such improvements would require major challenging modifications to both the algorithm and the analysis. In Lemma 2 we upper bound the moment‑based design criterion M1+ϵ(λ;A,γ,β)M_{1+\epsilon}(\lambda;\mathcal{A},\gamma,\beta) by comparing it with the classical GG‑optimal design value, which geometrically corresponds to the minimum‑volume enclosing 2\ell_{2} ellipsoid. This surrogate introduces the factor d1+ϵ2d^{\frac{1+\epsilon}{2}}. However, because the natural geometry for heavy‑tailed rewards is determined by the 1+ϵ\ell_{1+\epsilon} norm, the appropriate covering body is the minimum‑volume enclosing 1+ϵ\ell_{1+\epsilon} superellipsoid. Translating this geometry into a design argument suggests that the optimal dimension factor in the regret should scale as dϵd^{\epsilon}, which would match the lower bound. However, we leave this challenging direction for possible future work, and re-iterate that our paper advances the both the best known upper bounds and best known lower bounds. We will further highlight the above open problem in the revised paper.

评论

Thanks for your response. I missed the supplementary material. I also agree with your intuition about tightness/looseness of bounds. Therefore, I would like to maintain my score.

审稿意见
4

This paper focuses on stochastic linear bandits with heavy-tailed rewards, where rewards have a finite (1+ϵ)(1+\epsilon)-absolute central moment. It addresses the limitations of prior work by improving both upper and lower bounds on minimax regret. The authors propose a new elimination-based algorithm (MED-PE) guided by experimental design, achieving a better regret bound with improved dependence on dimension dd for ϵ(0,1)\epsilon \in (0,1) and recovering optimal results for ϵ=1\epsilon=1. They also establish stricter lower bounds, revealing a dimension-dependent gap between multi-armed and linear bandits. Additionally, results for finite action sets and specific geometries (e.g., lpl_p-norm balls) are provided, narrowing the gap in existing literature.

优缺点分析

Strength: The paper provides new lower bounds for the general case or the finite-arm case under the heavy-tailed linear bandit setting. This work also gives a new phase elimination-style algorithm to achieve a better regret rate. Overall, the paper is technically sound.

Weakness: The paper is mathematically heavy. Although the introduction is generically well-written, the presentation of the main context can be greatly improved. The current version is a bit hard to follow.

问题

  1. It would be better if authors could provide intuitive explanations of the lower of the general case. Why its dependence of dd is d2ϵ/(1+ϵ)d^{2\epsilon / (1 + \epsilon)}? Where does this ``2" come from? Also need to explain the lower bound of finite-arm case intuitively, where does the extra (logn)ϵ/(1+ϵ)(\log n)^{\epsilon/(1 + \epsilon)} come from?

  2. Need to provide intuitions on why the new algorithm can have an upper bound of order d(3ϵ+1)/(2ϵ+2)d^{(3 \epsilon + 1)/(2\epsilon + 2)}? Currently, I could not have any insight on this.

  3. Quantities like M(A)\mathcal M(\mathcal A), M1+ϵ(λ;A,γ,β)M_{1 + \epsilon}(\lambda; \mathcal A, \gamma, \beta) are not well explained. What is the intuition to minimize the moment of aTA(γ)(λ)1xa^T A^{(\gamma)}(\lambda)^{-1} x.

  4. Algorithm 1 is not well explained in the main context. What does aTA(γ)(λl)1xsysa^T A^{(\gamma)} (\lambda_l)^{-1} x_s y_s stand for?

  5. In the appendix, the probability of y(x)=0y(x) = 0 should be 1....2dΔ1 - .... - 2 \sqrt{d} \Delta instead of 1....3dΔ1 - .... - 3 \sqrt{d} \Delta after line 736? The second line of eqnarray after line 739, it misses γ1/ϵ\gamma^{1/\epsilon} in the first term?

局限性

Yes

最终评判理由

The authors have addressed most of my questions. So I increased by rating.

格式问题

NO.

作者回复

We thank the reviewer for their feedback. Our responses to the main concerns are given as follows.

Intuition on Regret Terms

While the expressions d1+3ϵ2(1+ϵ)d^{\frac{1+3\epsilon}{2(1+\epsilon)}} and d2ϵ1+ϵd^{\frac{2\epsilon}{1+\epsilon}} may seem complicated/arbitrary, they can be understood as being what we obtain by substituting n=2O(d)n = 2^{\mathcal{O}(d)} into less complicated nn-dependent expressions, where nn is the number of arms. For the upper bound with d1+3ϵ2(1+ϵ)d^{\frac{1+3\epsilon}{2(1+\epsilon)}}, the nn-dependent expression is given in Corollary 1, which is easier to interpret: The d\sqrt{d} factor is analogous to a matching term in the sub-Gaussian case, and the (logn)ϵ1+ϵT11+ϵ(\log n)^{\frac{\epsilon}{1+\epsilon}} T^{\frac{1}{1+\epsilon}} term captures the increased difficulty compared to the usual Tlogn\sqrt{T \log n} term (i.e., ϵ=1\epsilon = 1). A similar discussion applies for the lower bound (via Theorem 2), as we elaborate on in the next response item.

Further Discussion on Dimension Dependence in the Lower Bound

The standard sub-Gaussian noise setting provides further intuition for our lower bound under heavy-tailed noise. In the sub-Gaussian case, the regret lower bound for linear bandits with a finite action set of size n2O(d)n \le 2^{\mathcal{O}(d)} ignoring time horizon is Ω~(dlogn)\tilde\Omega(\sqrt{d \log n}), which arises from the fact that the KL divergence between two Gaussian distributions with a mean gap of Δ\Delta and constant variance is approximately Δ2\Delta^2. Optimizing over Δ\Delta yields the stated lower bound. (see [1] Chapter 24)

In the heavy-tailed noise setting, a simple calculation shows that the KL divergence scales as Δ1+ϵϵ\Delta^{\frac{1 + \epsilon}{\epsilon}} under a bounded 1+ϵ1 + \epsilon absolute moment. Substituting this into the standard information-theoretic lower bound argument and optimizing over Δ\Delta yields a dependence of dϵ1+ϵlog(n)ϵ1+ϵd^{\frac{\epsilon}{1 + \epsilon}} \log(n)^{\frac{\epsilon}{1 + \epsilon}} for finite action sets. Taking n=2O(d)n = 2^{\mathcal{O}(d)} then recovers the general dimension-dependent lower bound presented in the paper.

Further Discussion on the Upper Bound and Algorithm

The moment-based quantity M1+ϵ(λ;A,γ,β)M_{1+\epsilon}(\lambda; \mathcal{A}, \gamma, \beta) is the central action-set-dependent term appearing in both our estimator and regret bounds. It arises naturally due to the assumption that the noise has a bounded 1+ϵ1 + \epsilon absolute moment. Some intuition for this quantity is discussed in lines 198–202. The objective of minimizing the moment of aA(γ)(λ)1xa^\top A^{(\gamma)}(\lambda)^{-1} x in the experimental design is motivated by Lemma 1, where a smaller 1+ϵ1 + \epsilon moment directly leads to a tighter estimator error bound (see line 213).

Furthermore, M1+ϵM^*_{1 + \epsilon} denotes the worst-case moment over all subsets V\mathcal{V} of the action set (via maxV\max{\mathcal{V}}), and the best design λ\lambda for that subset (via minλ\min_{\lambda}), as defined in line 223. The matrix A(γ)(λ)A^{(\gamma)}(\lambda) is introduced in line 5 of Algorithm 1 and again in line 221 of the main text. The expression A(γ)(λ)1xyA^{(\gamma)}(\lambda)^{-1} x y can be interpreted as a single-sample least squares estimator, which is then robustified through a robust mean estimation subroutine (e.g., truncated means) for each arm. We will expand the discussion of these moment-based quantities in the final version.

On Questions Regarding Appendix B

The inequality following line 739 uses the fact that γ1\gamma \le 1, implying γ1ϵ1\gamma^{\frac{1}{\epsilon}} \le 1. We thank the reviewer for pointing out the typo in line 736; it should read 2dΔ- 2 \sqrt{d} \Delta.

References

[1] Lattimore and Szepesvári. Bandit Algorithms. Cambridge University Press, 2020.

评论

Thank you for your time and your response. The response helps. I choose to increase the score to 4.

最终决定

The paper analyzes the linear bandit problem with heavy-tailed rewards, and tightens the existing regret lower bounds by refining previous analyzes. The paper also proposes a new algorithm which operates in phases, selecting arms according to a newly proposed experimental design (which minimizes the confidence interval around a robust estimator of the parameter) and eliminating suboptimal arms at the end of each phase. The regret upper bound of the proposed algorithm nearly matches the minimax rate.

Overall, reviewers agree that the paper tackles an important problem by refining existing lower bounds for the linear bandit problem with heavy-tailed rewards, and proposing an algorithm with a worst-case regret that nearly matches this bound.