PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
5
4
5
4
2.8
置信度
创新性2.3
质量3.3
清晰度3.5
重要性2.5
NeurIPS 2025

Sparse Optimistic Information Directed Sampling

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29

摘要

关键词
regret minimizationsparse linear modelsInformation Directed SamplingBayesian methods

评审与讨论

审稿意见
5

The authors propose a new method (SO-IDS) that introduces sparsity to the IDS framework. It is an adaptation of the OIDS algorithm in Neu, Papini, and Schwartz [2024]. The key idea is to reduce the per-round computational burden by restricting sampling to a small, adaptively chosen subset of arms, rather than computing over the full action set. The authors propose two sparse variants: Uniform SO-IDS and Greedy SO-IDS. They prove that SO-IDS preserves the core theoretical performance guarantees of IDS in terms of regret bounds while significantly improving computational efficiency. The approach is validated empirically on synthetic and real-world bandit problems.

优缺点分析

Strengths: (1) Very solid technical paper. The authors provide solid theoretical regret guarantees and proof to justify the soundness of the sparse sampling design. (I do need to mention a caveat that I did not read the full proof, and I am not very familiar with the prior work by Neu, Papini, and Schwartz in terms of the technical details.) (2) A new perspective to improve IDS efficiency through sparsity. (3) The algorithm shows significantly better computational efficiency without sacrificing performance -- this highly relevant for large-scale online learning applications, where sparsity can allow more efficient design.

Weaknesses: (1) The proof part can be further streamlined to improve readability. (2) The greedy SO-IDS algorithm is empirically strong, but a better (more sound) motivation could be provided. (3) It is unclear how the framework generalizes to structured or infinite action spaces (e.g., linear or contextual bandits).

问题

  1. Can the sparse sampling strategy extend to linear or contextual bandit settings, where the action space is large or continuous? If so, how would sparsity be enforced?

  2. Can the authors better explain how the framework developed in this paper depends on & differs from the prior work by Neu, Papini, and Schwartz 2024?

  3. How sensitive is the performance to the size of the subset of arms considered each round? Could the authors suggest a principled or adaptive way to select this size?

局限性

This paper is theoretical and there is no specific discussion on limitations or potential societal impacts.

最终评判理由

I’ll retain my positive rating.

格式问题

NA.

作者回复

The authors thank the reviewer for their insightful comments and questions.

Regarding the comment about the motivation for the greedy SO-IDS algorithm, could the reviewer please provide us with more detail about what they mean by the greedy SO-IDS algorithm?

Regarding the comments about how the framework generalizes to structured or infinite action spaces, such as linear or contextual bandits, the setup we consider directly extends to linear bandits without sparsity and contextual bandits (see Neu, Papini and Schwartz, 2024 for instance). The definition of the information ratio naturally extends to contextual bandits by replacing all losses by their contextual counterparts. For linear bandits, the 2-information ratio can be shown to be bounded by dd. The other difference is the choice of the prior and the maximum likelihood estimator. In the absence of sparsity, a uniform prior is suitable, and for the MLE analysis, the same type of covering number argument will apply.

The framework we develop in this work builds upon the analysis of the optimistic posterior and the OIDS algorithm from Neu, Papini and Schwartz (2024). The main differences are as follows:

  • We specialize their result to the sparse linear bandit settings. This requires the adaptation of some of the arguments (choice of the priors, covering of the sparse parameter space). In this setting we are able to match the performance of Bayesian SIDS and get the first frequentist algorithm that get the best worst-case regret in both the data-rich and data-poor regime.
  • We replace the Best-loss-gap by the difference between the regret and the surrogate regret. Conceptually, it reflects the fact that we are not interested in our posterior placing a lot of mass on parameters with high potential rewards but on making sure that the surrogate regret is an upper bound on the true regret. Moreover, this simplifies the analysis and allows us to more easily prove data-dependent bounds.
  • We provide an analysis of the optimistic posterior valid with time dependent(and data-dependent) learning rates. This is one of the main limitations of the work of Neu, Papini and Schwartz (2024) (and Zhang, 2022), which we remove. While the results presented here are specialized to sparse linear bandits, this analysis is very general and can easily be extended to contextual bandits or even to decision-making with side observations.

Would the reviewer be able to provide us with more details regarding the last question about the size of the subset of arms considered at each round? The algorithm considers all possible arms at each round. On the computational side, it is known that the IDS policy is supported on at most 2 actions so a search algorithm can identify the IDS policy with a computational complexity O(K2)O(K^2). Moreover, one can approximately optimize the IDS policy by playing a mixture of an informative and a greedy action which only requires O(K)O(K) complexity to compute.

评论

The authors have addressed my main question, which is the difference from the prior work. For the size of the arm, the comment from the computational perspective is helpful - that was something I intended to ask. To reiterate: as I indicated in my confidence level, I am not an expert in this field and there are certain parts of the paper that I cannot evaluate confidently. My original rating is high compared to other reviewers, so I will retain this positive rating.

审稿意见
4

This paper applies the OIDS algorithm by Neu et al. (2024) to the sparse linear bandit problem and establishes a cumulative regret bound of O~(min{sdT,sT2/3})\tilde{\mathcal{O}}(\min\lbrace \sqrt{sdT}, sT^{2/3} \rbrace).

优缺点分析

Strengths

  1. The algorithm achieves both O~(T)\tilde{\mathcal{O}}(\sqrt{T}) and O~(T2/3)\tilde{\mathcal{O}}(T^{2/3}) regret bounds, making it the first algorithm to do so in the sparse linear bandit setting.

  2. The paper proposes adaptive learning rate schedules that do not require knowledge of the time horizon, which is not common in optimization-based sequential decision-making algorithms.

Weaknesses

  1. In Appendix E.3, it appears that the bound for the 3-surrogate information ratio only holds under the sparse optimal action assumption. While the authors argue that the regret lower bound is invariant under this assumption, it is unclear why the upper bound on the information ratio should also remain the same. Without this assumption, it seems the ss factor must be replaced with a dd factor, resulting in a significantly worse regret bound of O~(dT2/3)\tilde{\mathcal{O}}(dT^{2/3}) than the stated O~(sT2/3)\tilde{\mathcal{O}}(sT^{2/3}) bound. Additionally, the O~(sT2/3)\tilde{\mathcal{O}}(sT^{2/3}) bound is still looser than the previously known O~((sT)2/3)\tilde{\mathcal{O}}((sT)^{2/3}) bound [Hao et al. 2020] by a factor of s1/3s^{1/3}.

  2. The paper’s independent contribution is not sufficiently explained. The summary of contributions at the end of the introduction gives an impression that the main contribution is setting an appropriate prior and computing the information ratio for the sparse linear bandit problem, and the rest follows directly from Neu et al. (2024). The method of choosing history-dependent learning rates in Theorem 3, which is another emphasized contribution, is the same with the analysis of AdaGrad in Orabona (2019) and is somewhat hard to say it is novel - although I admit it was never explicitly mentioned in the sequential decision-making context.

问题

  1. In Appendix C.1, why does the prior assign additional probability to k=1,,s1k = 1, \ldots, s-1-sparse vectors? If the sparsity level ss is known, wouldn't a uniform distribution over Θ\Theta be more natural?

  2. In the proof outline of Theorem 1 in the main text, the equation after line 245 contains a term E[ηλT(LT(1)(θ0)LT(1)(θT))]\mathbb{E} \left[ \frac{\eta}{\lambda_T}( L_T^{(1)}(\theta_0) - L_T^{(1)}(\theta_T)) \right]. While this term is not bounded in the outline, it seems that it should be included in the statement of Lemma 4. Could the authors clarify this point?

局限性

yes

最终评判理由

I had two concerns about the original submission, and they are mostly resolved by the authors' answers.

  1. The analysis requires an assumption (sparse optimal arm assumption) to achieve the O(poly(s)T2/3)O(\text{poly}(s)T^{2/3}) regret bound, but was not introduced in the main paper and not properly discussed even in the appendix. The authors promised that they would make the assumption explicit and, instead of presenting their result as the best-of-both-worlds bound, they would present it as an extension of Hao et al. (2021), extending from the Bayesian regret to the frequentist regret. I am satisfied with this change and have no further concerns regarding this point. I only hope the other reviewers are aware of this change as one of the main contribution (or at least its presentation) has been retracted.

  2. The independent contribution of this paper was vague, which was a question asked by other reviewers as well. The authors have provided kind explanations and promised revision. To this end, due to the promised change of presentation mentioned above, I think the authors would have to be careful and explicit about describing what was achieved by Hao et al. (2021), Neu et al. (2024), and what is newly achieved in this paper. While the main contribution seems to follow from combining the analysis of the two papers, there are still additional contributions as the authors have clarified, which I think is sufficient for acceptance.

In conclusion, assuming that the contribution of the paper is explained well in the revision without any overstatements or understatements, I believe that the paper merits acceptance and have increased my score accordingly.
That said, although the authors might disagree, my current impression is that the contribution consists of several minor (yet meaningful) improvements rather than having a major one, and therefore I have not raised my score further.

格式问题

None.

作者回复

The authors thank the reviewer for their insightful comments and questions.

It is true that our bound on the 3-information ratio requires the sparse optimal action property. We will make this clear in the main text of the paper. It might be the case that the sparse optimal action condition is actually not necessary for obtaining dimension-independent bounds on the 3-information ratio. However, as far as we are aware, it is not known whether this is an artifact of the current analysis or a fundamental property of the 3-information ratio.

Thank you for pointing out the sub-optimal scaling in ss in the data-poor regime. We have inherited this scaling from the work of Hao, Lattimore and Deng (2021), and in particular, from their s2s^2 bound on the 3-information ratio. Note that this is also where the sparse optimal action property is required. This leads to a data-poor regret bound scaling with ss rather than s23s^\frac{2}{3}. Since the initial submission of our work, we were able to obtain a tighter bound of order ss for the 3-information ratio. This directly translates to a (sT)23(sT)^\frac{2}{3} regret bound by setting lambda accordingly (not only improving the results in the present paper but also the rates for the Bayesian algorithm of Hao, Lattimore and Deng). We will update the paper accordingly for the final version.

We will try to highlight here the independent contribution of this work. One of our main contributions is to allow the use of time-varying learning rates and data-dependent learning rates for the optimistic posterior. This is not possible in Neu, Papini and Schwartz (2024) or Zhang (2022). All existing bounds derived from the analysis of the optimistic posterior require prior knowledge of the time horizon (or the best loss in hindsight). This result is not a direct application of the methods highlighted in the Adagrad paper and requires additional work. One could think that the regular analysis of FTRL could be applied to allow for time varying learning rates by using the monotonicity of the dual to perform the usual telescoping of the potential. Unfortunately this is not possible as only some part of the loss is scaled by the time varying learning rate (compared to the whole loss in the usual analysis). This seemingly simple fact completely breaks the analysis and makes it unclear how to proceed. Our solution is to replace the monotonicity of the dual (which doesn't hold here) by Lemma 9 which will still allow the telescoping of the potential. To apply Lemma 9, we need one of the losses to be non-negative. To ensure that this is the case, we introduce the Maximum likelihood Estimator at every time step. None of those results are standard and they required significant work to prove. We completely agree with the reviewer that going from Theorem 2 to Theorem 3 relies on setting up the learning rate to scale with the information ratio in a way that is similar to the analysis of AdaGrad. We note that this step is only possible because we are able to use time-dependent and data-dependent learning rates.

The analysis of Lemma 3 doesn't require previous knowledge of the sparsity and the result continues to hold even if we don't restrict the prior to a given sparsity level. As such, if we only had an upper bound on the sparsity level, we could use it to compute the prior and still adapt to the true sparsity. The true sparsity is only needed in computing the 3-information ratio, which is why we restrict the prior to s-sparse parameters. If we only wished to use the 2-information ratio, we could use the prior without any sparsity level restriction. We agree with the reviewer that the bound of Lemma 3 would continue to hold with a uniform prior on the sparsity level ss but we would lose the nice properties pointed out before.

Thank you for the great catch on line 245, this likelihood term is indeed controlled by Lemma 4 as can be seen in its proof in Appendix B but the term is missing from the statement of Lemma 4, we'll update it for the final submission.

评论

I deeply appreciate the authors' detailed response. While most of my concerns regarding the paper have been resolved, there is still one major concern I would like to discuss further.


I remain concerned about the requirement of the sparse optimal action assumption for achieving the O(sT2/3)\mathcal{O}(sT^{2/3}) bound. While I understand that the assumption is inherited from Hao et al. [2021], simply mentioning it in the theorem statement does not seem sufficient as it appears to undermine the core contribution highlighted in the introduction.

The paper frequently emphasizes that SOIDS achieves a best-of-both-worlds regret bound, simultaneously matching the O(sdT)\mathcal{O}(\sqrt{sdT}) bound by Abbasi-Yadkori et al. [2012] in the data-rich regime and the O(poly(s)T2/3)\mathcal{O}(\text{poly}(s)T^{2/3}) bound by Hao et al. [2020] and Jang et al. [2022] in the data-poor regime. However, both Hao et al. [2020] and Jang et al. [2022] do not require the sparse optimal action assumption, so I do not believe this is a fair comparison. The fact that the emphasized optimality only holds under a certain setting is never mentioned in the paper, which could be slightly misleading. Some might even question whether it is appropriate to call the result a "best-of-both-worlds" bound, since the proposed bound in the data-poor regime does not fully recover the previously known bounds, scaling with O(dT2/3)\mathcal{O}(dT^{2/3}) without the assumption.

I respectfully ask for the authors’ thoughts on this point and how they plan to make the necessity of the sparse optimal action assumption clear in the revised version.


As for my other questions, I am happy to say that the authors’ rebuttal has sufficiently addressed them.
I am glad to hear that the bound on 3-IR has been improved. I am curious about how such a bound is achieved and how the new bound depends on CminC_{\min}. Could the authors share the improved analysis, or at least provide a sketch of it?

I am grateful for the detailed summary of the contributions. I believe further emphasizing these contributions in the main paper would greatly strengthen the work.

I also appreciate the explanation about the prior and its dependence on the sparsity level ss. I was slightly confused because, while the main paper clearly states that ss is assumed to be known, the prior construction did not appear to reflect that. I have read the rebuttal to the other reviewers, and I understand what can be achieved when ss is unknown. It might be helpful if the authors explicitly clarify which parts of the algorithm or parameter choices require knowledge of ss and which do not, although the current presentation is already reasonably clear.

评论

Thank you for your response. We’re pleased to hear that most of your concerns have been resolved.

Regarding your remaining concerns about the sparse optimal action condition, we would propose to make the following changes. First, rather than describing our regret bound in Theorem 2 as a best-of-both-worlds result, we will instead describe it as an extension of the Bayesian regret bound for sparse IDS in Hao, Lattimore and Deng (2021), which does not require Bayesian assumptions. We feel that this is reasonable, since the Bayesian regret bound in Hao, Lattimore and Deng (2021) also requires the sparse optimal action condition. Second, in Section 2, we will add a numbered definition/assumption block that states the definition of the sparse optimal action condition. We will move our discussion about how restrictive this assumption is to Section 2 as well. In all of our theorem/lemma/etc. statements that use the sparse optimal action condition, we will reference this numbered definition/assumption. Finally, we remark that the sparse optimal action condition is not required for Theorem 1, so the first core contribution listed in the introduction (analysis of the optimistic posterior with time-dependent and/or data-dependent learning rates) is not undermined.

The improvement to the 3-IR bound is very simple. The first equation on page 18 in Hao, Lattimore and Deng (2021) uses the bound a2s\Vert a\Vert_2 \leq s for any ss-sparse action aa with infinity norm less than 1. However, any such action also satisfies the bound a2s\Vert a\Vert_2 \leq \sqrt{s}. Replacing the looser bound results in a bound on the 3-IR of order ss instead of s2s^2. The dependence on CminC_{\min} is the same.

评论

Thank you for the detailed explanation.

Presenting the current work as a frequentist version of Hao et al. (2021) would indeed be valid and meaningful. I also see that the improvement to the 3-IR is already in the paper, and only the final bound was loose for no reason. I am satisfied with the answers have no questions left about the paper. The authors agreed with me that a fundamental shift of perspective is necessary in the presentation, and fortunately, the technical results and proofs remain intact and solid.

I would like to make one more remark. (Please note that this is not a request for additional clarification; the authors have sufficiently addressed these points during the discussion period, and my point is that including them in the paper would be highly necessary.)
If the paper had originally been submitted with the explained presentation, I think I -- and probably most of the other reviewers -- would have been curious about the difficulty involved in extending the Bayesian guarantee of IDS to the frequentist guarantee of OIDS under the same problem setting. In other words, the authors would have had to explain that the work is more than simply replacing the IDS analysis in Hao et al. (2021) with the OIDS analysis of Neu et al. (2024). In addition, I don't think the original manuscript makes it clear that many parts of the analysis come from Hao et al. (2021) (e.g., using both 2- and 3-IRs, bounding the 3-IR, etc.), but it would become inevitable to clarify this if the authors revise the presentation as described in their last comment.

I will update my score after the discussion, assuming that the paper's independent contribution will be clearly explained in the revised version.

审稿意见
5

This paper addresses the sparse linear bandit problem. The minimax regret in this setting depends on whether the problem instance falls within the data-rich regime or the data-poor regime. In the former, the time horizon TT is significantly larger than the dimension dd and the optimal regret is of order sdT\sqrt{s d T} with ss being the sparsity level, while the latter concerns high-dimensional problems, where the optimal regret has a worse T2/3T^{2/3} rate but can be dimension-independent in some cases. While achieving near-optimal regret in both regimes by one adaptive algorithm has only been shown in the Bayesian setting (via the information directed sampling approach), this paper presents the first algorithm that achieves this feat in the non-Bayesian (frequentist) regime. The adopted approach extends the optimistic information-directed sampling algorithm of Neu et al. (2024) to the sparse linear bandit setting. It features a modified choice of posterior distribution and employs a time-dependent (or history-dependent) learning rate.

优缺点分析

The primary contribution here is achieving non-Bayesian analogues of the regime-adaptive Bayesian bounds of Hao et al. (2021). In this manner, the paper addresses a more practically relevant setup and shows that regime adaptivity is also possible there. Additionally, the adopted time-dependent and history-dependent learning rates lend the algorithm more flexibility compared to similar approaches.

On the other hand, I feel that little has been highlighted in terms of the shortcomings of the results. For example, the data-poor bound scales with ss instead of s2/3s^{2/3} like in the upper bound of Hao et al. (2020) or s1/3s^{1/3} in the lower bound presented in the same paper. This is not a huge drawback as the focus here is on dimension-independence (indeed, this is also the case in Hao et al. 2021); however, this should still be mentioned to clarify what is meant by 'near-optimality'. Likewise, it is important to emphasize that CminC_\text{min} (featured in the data-poor bounds) can be dimension-dependent, depending on the geometry of the action set. Concerning the instance-dependent guarantee of Theorem 3, I find its significance a bit questionable; even if it improves upon the worst-case bound, which (interpretable/relevant) property of the instance can be exploited by this result? Overall, even if novelty is present in some parts of the algorithm and the analyses, the obtained results arguably constitute a modest leap from the current state-of-the-art.

Minor correction: in line 94, I believe it should be Cmin1/3C_\text{min}^{-1/3}.

问题

  • Aside from the time-dependent learning rate, was the new adjustment in the definition of the optimistic posterior (compared to Neu et al. 2024) crucial for obtaining regime-adaptive bounds in the present setup? You mentioned that it simplifies the analysis, could you expand more on that? I also wonder if the 'optimistic' label is still accurate here as the posterior seems to lend importance to parameters under which the learner would have incurred large regret up to the present time.
  • How strict is the assumption that the sparsity level is known? Can it be dropped at the cost of a slightly worse bound as in Hao et al. (2020)?

局限性

Aside from the brief discussion on computational challenges in the last two sections, limitations of this work are scarcely discussed as I mentioned in the strengths/weaknesses section.

最终评判理由

After reading the other reviews and the rebuttal, my evaluation remains on the positive side. However, I think it is important that the authors highlight in the revised version the shortcomings that have been raised so far. Namely, the gap between your bound and the lower bound in terms of the dependence on ss (or how to improve it as claimed), the sparse optimal action assumption and its implications. Also, I think it is important to detail rigorously in the paper how Theorem 3 implies a first order regret bound as you claimed in the rebuttal, or some interpretable data-dependent improvement; otherwise, the significance of that result would remain unsupported. Finally, the paper can also benefit from including the discussion you provided in your rebuttal regarding the case when ss is unknown.

格式问题

No formatting concerns.

作者回复

The authors thank the reviewer for their insightful comments and questions.

Thank you for pointing out the sub-optimal scaling in ss in the data-poor regime. We have inherited this scaling from the work of Hao, Lattimore and Deng (2021), and in particular, from their s2s^2 bound on the 3-information ratio. This leads to a data-poor regret bound scaling with ss rather than s23s^\frac{2}{3}. Note that the data-poor regret bound in Hao, Lattimore and Deng (2021) also scales with ss whenever the number of actions is large or infinite, which is arguably the most interesting situation. Though the lower bound in Hao, Lattimore and Wang (2020) indeed scales with s1/3s^{1/3}, the best achievable scaling is s2/3s^{2/3}. This follows from the lower bound in Jang, Zhang and Jun (2022). Since the initial submission of our work, we were able to obtain a tighter bound of order ss for the 3-information ratio. This directly translates to a (sT)23(sT)^\frac{2}{3} regret bound by setting lambda accordingly (not only improving the results in the present paper but also the rates for the Bayesian algorithm of Hao, Lattimore and Deng). We will update the paper accordingly for the final version.

We will comment on the fact that the dependence of the data-poor regret bound on CminC_{\min} can introduce undesirable dimension-dependence.

Regarding the instance-dependent bounds that we provide, it is well known that an algorithm tuned to optimize a worst-case regret bound might perform poorly in practice, since worst-case bounds are often very conservative. Conversely, a bound scaling with the actual information ratio might yield a more practical algorithm. Thought it is not the subject of the present work, it can be shown that OIDS with data-dependent learning rates enjoys first-order bounds for contextual bandits without prior knowledge of the best cumulative loss. In the contextual bandit setting, it is known from the work of Neu, Papini and Schwartz (2024) that the information ratio scales with the surrogate loss of the selected action, and adapting to that scaling will give a first-order bound in which the time horizon is replaced by the cumulative loss of the best action. This can lead to significant improvements in "easy" instances in which the cumulative best loss is small. Previous work requires knowledge of this quantity to tune the learning rate. On the contrary, with our instance-dependent analysis we can adapt to this quantity on the fly without prior knowledge.

There is indeed a typo on line 94, It should be Cmin1/3C_{\min}^{-1/3}. Thank you for spotting this.

The new adjustment to the optimistic posterior allows us to deal with the "Bellman error/Underestimation error" and the "Feel-good term/ Best loss gap" in a unified manner. For the data-independent bound (Theorem 2), it is possible to directly bound the underestimation error as in Neu, Papini and Schwartz (2024), which would lead to the same scaling as obtained here (and allow us to drop the requirement that the learning rate should be smaller than 12\frac{1}{2}). For the instance-dependent version, it might be possible to get a similar bound without this adjustment to the optimistic posterior, but we don't know for sure. In any case, we find it quite insightful to frame everything this way, so that Theorem 1 directly relates the true regret to the surrogate regret.

The assumption of known sparsity can be relaxed somewhat. If there is a known upper bound on the sparsity, then the sparsity level can be replaced this upper bound, and the regret bounds would then scale with this upper bound. If there is no known upper bound on the sparsity, the best regret bound that we are aware of is of the order sdTs\sqrt{dT} (Jin, Jang and Cesa-Bianchi, 2024). We can match this regret bound by making two modifications to SOIDS. First, the subset selection prior can be modified such that it has support for subsets of all sizes and does not require knowledge of ss. This can be achieved by replacing the sum in the denominator by k=1d2k\sum_{k=1}^d2^{-k} and verifying that Lemma 3 still holds. The only other place in which our sdT\sqrt{sdT} regret bound requires knowledge of the sparsity is in the value of λt\lambda_t. If one chooses λt=1/dT\lambda_t = 1/\sqrt{dT}, then we no longer require knowledge of ss, and get a regret bound of the order sdTs\sqrt{dT}. The situation for the (sT)2/3(sT)^{2/3} regret bound is more complicated. If the subset selection prior is extended to have support for subsets of all sizes, then samples from the prior are no longer ss-sparse with probability 1, and so the sparse optimal action property might not be satisfied. Since samples from the "full" prior are still sparse with high probability, one can perhaps adapt our bounds on the 3-information ratio such that when λt\lambda_t is set independently of ss, one has a regret bound of the order sT2/3sT^{2/3}. The situation is the same for the Bayesian version of sparse IDS (Hao, Lattimore and Deng, 2021).

评论

Thank you for your responses.

Taking the rest of the discussion into account, my evaluation remains on the positive side. However, I think it is important that the authors highlight in the revised version the shortcomings that have been raised so far. Namely, the gap between your bound and the lower bound in terms of the dependence on ss (or how to improve it as claimed), the sparse optimal action assumption and its implications. Also, I think it is important to detail rigorously in the paper how Theorem 3 implies a first order regret bound as you claimed in the rebuttal, or some interpretable data-dependent improvement; otherwise, the significance of that result would remain unsupported. Finally, the paper can also benefit from including the discussion you provided in your rebuttal regarding the case when ss is unknown.

审稿意见
4

This paper presents Sparse Optimistic Information Directed Sampling (SOIDS), an extension of the Optimistic Information Directed Sampling (OIDS) algorithm for sparse linear bandits. The key contribution is achieving optimal worst-case regret bounds in both data-rich O(sdT)O\bigl(\sqrt{sdT}\bigr) and data-poor O(sT2/3)O\bigl(sT^{2/3}\bigr) regimes simultaneously, without Bayesian assumptions. The authors adapt the optimistic posterior framework to allow time-dependent learning rates and provide theoretical guarantees showing SOIDS is the first frequentist algorithm to achieve best-of-both-worlds performance in sparse linear bandits. The paper includes experimental validation demonstrating competitive performance against specialized algorithms in both regimes.

优缺点分析

Strengths:

  • Solid Theoretical contribution: First frequentist algorithm achieving optimal regret in both data-rich and data-poor regimes for sparse linear bandits. Provides both worst-case bounds and instance-dependent guarantees.
  • Novel technical approach: Extends optimistic posterior framework to handle time-dependent learning rates, enabling adaptive algorithms.
  • Clear presentation: Paper is well-written with good intuition for the main ideas.

Weaknesses:

  • Limited experimental validation: Only basic synthetic experiments with small dimensions and specific parameter structures.
  • Restrictive assumptions: The assumption of known sparsity level s is restrictive in practice. The authors should discuss potential adaptations for unknown sparsity.

问题

  1. In Theorem 2, the learning rate λt = min(1/2, max(λt^(2), λt^(3))) switches between two regimes. Could you provide more intuition about why this specific switching mechanism is necessary? How sensitive is the algorithm's performance to the exact switching point?
  2. Theorem 3 provides an instance-dependent bound for the data-rich regime. Is it possible to obtain a unified instance-dependent bound that smoothly interpolates between both regimes, as mentioned in Section 4.3?

局限性

yes.

格式问题

No.

作者回复

The authors thank the reviewer for their insightful comments and questions.

Regarding known sparsity: this is a common assumption to make in this setting. The sparsity level could be replaced by a known upper bound on the sparsity, and the bounds would then scale with this upper bound. If there is no known upper bound on the sparsity, the best regret bound that we are aware of is of the order sdTs\sqrt{dT} (Jin, Jang and Cesa-Bianchi, 2024). We can match this regret bound by making two modifications to SOIDS. First, the subset selection prior can be modified such that it has support for subsets of all sizes and does not require knowledge of ss. One can simply replace the sum in the denominator by k=1d2k\sum_{k=1}^d2^{-k} and verify that Lemma 3 still holds. The only other place in which our sdT\sqrt{sdT} regret bound requires knowledge of the sparsity is in the value of λt\lambda_t. If one chooses λt=1/dT\lambda_t = 1/\sqrt{dT}, then we no longer require knowledge of ss, and get a regret bound of the order sdTs\sqrt{dT}. The situation for the (sT)2/3(sT)^{2/3} regret bound is more complicated. If the subset selection prior is extended to have support for subsets of all sizes, then samples from the prior are no longer ss-sparse with probability 1, and so the sparse optimal action property might not be satisfied. Since samples from the "full" prior are still sparse with high probability, one can perhaps adapt our bounds on the 3-information ratio such that when λt\lambda_t is set independently of ss, one has a regret bound of the order sT2/3sT^{2/3}. The situation is the same for the Bayesian version of sparse IDS (Hao, Lattimore and Deng, 2021).

Regarding the specific choice of the switching mechanism, the basic idea can be understood intuitively as follows. The optimal learning rate for the data-rich regime (T>>poly(ds)T >> \mathrm{poly}(\frac{d}{s})) is λt(2)\lambda_t^{(2)} and the optimal learning rate for the data-poor regime is λt(3)\lambda_t^{(3)}. Therefore, for rounds tt such that t>>d/st >> d/s, we should use λt(2)\lambda_t^{(2)} and for all other rounds we should use λt(3)\lambda_t^{(3)}. Using the expressions for λt(2)\lambda_t^{(2)} and λt(3)\lambda_t^{(3)} in Theorem 2, the inequality λt(2)λt(3)\lambda_t^{(2)} \geq \lambda_t^{(3)} can be rearranged to obtain (roughly) td3/st \geq d^3/s. This means that if we take λt=max(λt(2),λt(3))\lambda_t = \max(\lambda_t^{(2)}, \lambda_t^{(3)}), we will be using the correct learning rate for each regime.

For a more technical explanation, the following might help. The idea is that we want our regret bound to adapt to each possible regime. In the data poor regime, we want to obtain the rate RT(3)(sT)23R_T^{(3)} \propto (sT)^ \frac{2}{3}, and in the data rich regime, we want to match the RT(2)sdTR_T^{(2)} \propto\sqrt{sdT} rate. The bound obtained at the beginning of the analysis of Theorem 2 (Equation (12) line 258) scales with both the 3 information-ratio and the 2-information ratio. The tricky part is now to chose the learning rate. If we choose the learning rate λ(2)1sdT\lambda^{(2)} \propto \frac{1}{\sqrt{sdT}} to optimize the term with the 2-information ratio, we would obtain the data-rich rate of RT(2)sdTR_T^{(2)} \propto \sqrt{sdT}. If we choose the learning rate λ(3)(1sT)23\lambda^{(3)} \propto \left(\frac{1}{sT}\right)^\frac{2}{3} to optimize the term with the 3-information ratio, we would obtain the data-poor rate of RT(3)(sT)23R_T^{(3)} \propto (sT)^\frac{2}{3} . The relevant question is now whether or not there is a way to chose the learning rate to get both rates at the same time. The essential observation now is that the bigger learning rate always matches the best regret rate. More specifically we have that λ(2)λ(3)\lambda^{(2)} \lessapprox \lambda^{(3)} if and only if RT(2)RT(3)R_T^{(2)} \gtrapprox R_T^{(3)}. This can be seen by comparing the ratios of the learning rates and the regret rates. As a result, if we're trying to obtain the best regret rate, we should always pick the bigger learning rate out of λ(2)\lambda^{(2)} and λ(3)\lambda^{(3)} which is done by picking the maximum. When a time varying learning rate is used, the story is similar with a slightly more complicated analysis of the regret coming from having used the wrong learning rate in the past.

The question of a unified instance-dependent bound is extremely interesting. The analysis of Theorem 2 relied on using the fact that the learning rates only switch once, starting with the learning rate λt(3)\lambda_t^{(3)} and then at some point switching to λt(2)\lambda_t^{(2)}. Such a behavior cannot be guaranteed when λt(2)\lambda_t^{(2)} and λt(3)\lambda_t^{(3)} are data-dependent, because the information ratios might behave differently than their respective upper bounds. It turns out that this switching argument is not necessary and can be replaced by considering the last time that either of the two learning rates is used. Using this approach, one can simplify the proof of Theorem 2 and prove a unified instance-dependent bound that smoothly interpolates between the two regimes. We discovered this technique shortly after submission and will include it in the final version.

评论

Thank you for your responses. My evaluation remains positive, but I strongly agree with Reviewer swR4 that the revision should clearly address the gap between the upper and lower bounds in terms of the ss.

最终决定

This work extends the OIDS algorithm to the sparse linear bandit setting that enjoys a dimension-dependent regret of O(dsT)O(\sqrt{dsT}) and a dimension independent regret of O((sT)2/3)O((sT)^{2/3}) simultaneously. This is the first work to do so in a non-Bayesian setting.

All reviewers agree that achieving the regret bounds simultaneously for the data-rich and data-poor setting is a great contribution. Further, reviewers were happy with the extension of the optimistic posterior framework to handle time-dependent learning rates.

Multiple reviewers had expressed concerns about the scaling of ss in the dimension independent regret, the novelty of the technical contributions and the requirement that ss is known in advance by the algorithm. After a careful rebuttal these concerns have been addressed and I am happy to recommend the paper for acceptance. I would urge the authors to implement the updated regret bound sketched out in the rebuttal, together with a more careful discussion about the novelty of the technical contributions.