PaperHub
5.5
/10
Poster4 位审稿人
最低3最高7标准差1.7
7
5
7
3
4.0
置信度
正确性2.5
贡献度2.5
表达3.0
NeurIPS 2024

Improved Regret of Linear Ensemble Sampling

OpenReviewPDF
提交: 2024-05-14更新: 2025-01-13
TL;DR

We prove a $O(d^{3/2}\sqrt{T})$ frequentist regret bound for linear ensemble sampling.

摘要

In this work, we close the fundamental gap of theory and practice by providing an improved regret bound for linear ensemble sampling. We prove that with an ensemble size logarithmic in $T$, linear ensemble sampling can achieve a frequentist regret bound of $\tilde{\mathcal{O}}(d^{3/2}\sqrt{T})$, matching state-of-the-art results for randomized linear bandit algorithms, where $d$ and $T$ are the dimension of the parameter and the time horizon respectively. Our approach introduces a general regret analysis framework for linear bandit algorithms. Additionally, we reveal a significant relationship between linear ensemble sampling and Linear Perturbed-History Exploration (LinPHE), showing that LinPHE is a special case of linear ensemble sampling when the ensemble size equals $T$. This insight allows us to derive a new regret bound of $\tilde{\mathcal{O}}(d^{3/2}\sqrt{T})$ for LinPHE, independent of the number of arms. Our contributions advance the theoretical foundation of ensemble sampling, bringing its regret bounds in line with the best known bounds for other randomized exploration algorithms.
关键词
Linear BanditEnsemble Sampling

评审与讨论

审稿意见
7

This paper proposes a simple ensemble sampling (ES) approach and its frequentist analysis for the linear bandit problem. Specifically, it shows that the proposed algorithm achieves O~(d3/2T)\tilde{O}(d^{3/2}\sqrt{T}) regret when the ensemble size mm is Ω(KlogT)\Omega(K \log T). This regret bound improves the dependency on dd and the ensemble size compared to existing regret bounds for ES. In the regret analysis, it proposes a framework that can comprehensively handle various algorithms and a novel analytical technique to ensure the optimism of ES. It also shows that linear perturbed-history exploration (LinPHE) is a special case of ES.

优点

  • This paper proposes a simple ES, demonstrating that the proposed algorithm achieves O~(d3/2T)\tilde{O}(d^{3/2}\sqrt{T}) regret when the ensemble size mm is Ω(KlogT)\Omega(K \log T).
    • This regret bound improves the dependency on dd compared to existing work.
    • Furthermore, this bound matches the regret bound achieved by the Thompson sampling.
  • This paper proposes a new analytical framework that can uniformly handle algorithms like ES, LinPHE, LinUCB, and LinTS (Theorem 2).
  • This paper introduces a novel theoretical analysis to ensure the optimism of ES (Lemma 2).
  • This paper shows for the first time the theoretical connection that LinPHE is a special case of ES.
  • This paper provides a detailed and comprehensive comparison with existing work.

缺点

  • The theoretical analysis in this paper does not derive the regret upper bound for ensemble sampling in linear contextual bandits.
    • Additionally, the reason for this is not discussed in detail.

问题

While I do not verify all the proofs in detail, I believe this paper is deserving of acceptance.

Questions

  • In my understanding, applying the theoretical analysis in this paper to linear contextual bandits is challenging due to the difficulty in extending Lemma 2, making it hard to derive the regret bound for ES. Is my understanding correct?

Minor comments

  • Algorithm 2: Zt.1Z_{t.1} -> Zt,1Z_{t,1}

局限性

The authors discuss the limitations of this paper in Appendix K.

作者回复

We appreciate your positive feedback and particularly recognizing the theoretical value of our work. We are more than happy to address any questions.

Linear Contextual Bandits: Please note that none of the existing linear ensemble sampling literature [15, 20, 8] has discussed linear contextual bandits with changing context in each round, if that is what you mean. Additionally, many linear bandit papers, such as [1, 3, 11], do not separately address the contextual case, even though their analyses are applicable to contextual cases. Therefore, we first addressed on the problem setting that the previous ensemble sampling literature has focused on — linear bandit. Your comment about Lemma 2 is correct, as it would not be directly applicable when the arm set, specifically the optimal arm, is given adversarially. Techniques in our work are developed to derive a sharper regret bound for linear bandits. It would be an interesting future direction to derive regret for the contextual setting. We will include a discussion on this. Thank you for pointing it out.

评论

Dear Reviewer vWv8

We again thank you for recognizing the value of work. To put our contributions in perspective, let's compare the face value of our results with those in the relevant literature:

PaperRegret BoundEnsemble SizePublication
Lu and Van Roy [15]InvalidInvalidNeurIPS 2017
Phan et al. [A]Invalid (Lemma 13)InvalidNeurIPS 2019
Qin et al. [20]O(dTlogK)O(\sqrt{dT\log{K}}) Bayesian regretO(KT)O(KT)NeurIPS 2022
Janz et al. [8]O(d5/2T)O(d^{5/2}\sqrt{T}) Frequentist regretO(dlogT)O(d\log{T})-
Our workO(d3/2T)O(d^{3/2}\sqrt{T}) Frequentist regretO(KlogT)O(K\log{T})Currently under review at NeurIPS 2024
(Phan et al. [A] contains an extended result on ensemble sampling in their appendix using the results from Lu and Van Roy [15], but since the latter is incorrect, so is the former).

Even when considering just the face value of the results presented in each paper, overlooking our work would discourage further progress in the field of ensemble sampling.

Key Points:

  • Prior to our work, none of the previous studies in ensemble sampling (Lu and Van Roy [15], Phan et al. [A], Qin et al. [20], or Janz et al. [8]) had achieved the O(d3/2T)O(d^{3/2}\sqrt{T}) frequentist regret bound with sublinear TT dependence on ensemble size.

  • Simply plugging in an ensemble size of O(KlogT)O(K \log T) into previous results (e.g., instead of O(dlogT)O(d \log T) as in Janz et al. [8]) does not trivially achieve the O(d3/2T)O(d^{3/2}\sqrt{T}) frequentist regret bound. In fact, regardless of how the ensemble size is chosen (whether O(dlogT)O(d \log T), O(KlogT)O(K \log T) or even O(KT)O(KT)) and what algorithmic tweaks are applied (such as symmetrization in Janz et al. [8]), our O(d3/2T)O(d^{3/2}\sqrt{T}) bound remains the sharpest frequentist regret bound.

  • This is the key contribution of our work: No prior analysis in ensemble sampling had successfully achieved the O(d3/2T)O(d^{3/2}\sqrt{T}) frequentist regret bound (with sublinear ensemble size in TT). We presented and proved a novel method to reach this bound for the first time.

Even at face value, our results clearly stand out compared to the existing literature. We sincerely and kindly ask you to reconsider these points when evaluating our submission.

Beyond the face value of our primary results, which we believe are already significant, we also introduce a general regret analysis framework (Theorem 2) for linear bandit algorithms. This framework not only generalizes the regret analysis for randomized algorithms like ensemble sampling but also applies to other optimism-based deterministic algorithms. This could be of significant interest beyond ensemble sampling, which we would like to bring your attention to.

Considering the significant value of these contributions, we strongly believe in the value of our work and the potential impact for the future research. Thank you for your support. If you have any questions, please let us know.

Sincerely,
Authors


Reference:

  • Phan et al. [A]: Phan, M., Abbasi Yadkori, Y., & Domke, J. (2019). Thompson sampling and approximate inference. Advances in Neural Information Processing Systems, 32.

评论

I appreciate the authors' responses.

I remain of my opinion that the paper deserves to be accepted because it makes a significant theoretical contribution, but after reading other reviews, I believe that explicitly adding the following perspectives will help a broader readers understand this topic.

Comparison to Janz et al. [8]

  • While this paper assumes a finite number of arms, Janz et al. [8] can handle an infinite number of arms.
  • When d<Kd < K, the upper bound on the required sampling size obtained in Janz et al. [8] can be smaller than that in this paper.

Comparison to LinTS

  • While this paper assumes a finite number of arms, LinTS (e.g., Abeille and Lazaric [2]) can handle an infinite number of arms.
  • When focusing on dd and TT, the regret bound of LinTS is O(d3/2TlogdlogT)O(d^{3/2}\sqrt{T \log d}\log T), while the regret bound shown in this paper is O((dlogT)3/2T)O((d \log T)^{3/2}\sqrt{T}).
    • Since this is a minor difference, I believe that there is no problem in claiming that the regret bound obtained in this paper "matches" that of LinTS.
  • Reasons for the above differences.
    • In my understanding, the difference is caused by the decoupling technique (a novel technical tool by the authors) discussed in lines 222-230.
评论

Thank you very much for your continued support! We will definitely add the discussion as suggested in the revision. We appreciate you recognizing the contributions of our work.

审稿意见
5

The paper explores ensemble sampling for linear bandit problems and enhances the existing regret bounds by a factor of dd, aligning the scaling with respect to dd to that of Thompson sampling algorithms. The algorithm is somewhat simpler than the existing work [8] by permitting any policy for selecting the estimator, and without requiring the symmetry needed in [8].

优点

The paper improves the existing regret bounds for ensemble sampling by a factor of dd.

The analytical details are effectively presented, supplemented by several insightful remarks.

缺点

The scaling of the ensemble size with KK presents some issues. Utilizing a discretization argument over a continuous domain with a fine grid, where the distance between points is O(1/T)O(1/\sqrt{T}) — to preserve the regret bound, results in K=O(Td/2)K=O(T^{d/2}) arms. This is polynomial in TT and exponential in dd. Assuming a finite, small number of arms could be crucial for achieving the improved results. Therefore, it might not be fair to directly compare these results with those in [8], which seems applicable to continuous domains with many arms

The abstract is challenging to read and understand because terms like 'K' and 'LinPHE' are used without prior introduction.

问题

Could the authors please comment on the scaling of the ensemble size with KK? Is it appropriate to compare your results with those presented in [8]? Is a small KK crucial for obtaining the improved results? Additionally, what implications would there be in continuous domains or in general where KK is large?

局限性

Yes, some limitations are discussed in Appendix K.

作者回复

We appreciate your overall positive feedback and particularly recognizing the theoretical value of our work. We sincerely hope you take our response into account in reassessing our work.


Scaling of KK

We respectfully believe there is a misunderstanding in your comment. We are more than happy to clarify any confusion. We emphasize that our regret bound does NOT depend on the number of arms KK at all, even logarithmically. More crucially, our regret does not have any dependence on the ensemble size mm at all (not even logarithmic). We do not assume the number of arms to be small, as long as it is finite (which is only needed for the ensemble size requirement). Even for instances with K=O(Td/2)K = O(T^{d/2}) arms as you mention, our improved regret bound remains valid with the same order.

Assuming you understand the key statistical efficiency (regret guarantees) independent of KK or mm, which is our main focus, let us compare our result with Janz et al. [8].

The ensemble size shown as sufficient by Janz et al. [8] was Ω(dlogT)\Omega(d \log T) to derive the unfavorable O~(d5/2T)\tilde{O}(d^{5/2} \sqrt{T}). Their regret bound depends super-linearly on the ensemble size mm (see Remark 3), which is counter-intuitive. Sure, Janz et al. [8]'s proposed ensemble size does not scale with KK (but still scales with the dimensionality dd which is a contrast with ours). However, what does it get you? Only a very sub-optimal and counter-intuitive result despite a more complicated algorithm with symmetrization. Furthermore, consider when dd is higher—the situation gets even worse, both for the ensemble size and especially the regret bound with much worse super linear dependence O(d5/2)O(d^{5/2}).

On the other hand, we present a new way of treating ensembles with Ω(KlogT)\Omega(K \log T) (using novel analysis techniques) and achieve O~(d3/2T)\tilde{O}(d^{3/2} \sqrt{T}), matching the worst-case regret of LinTS for the first time! We propose a simple algorithmic approach (with a new way of looking at the ensemble) and significantly novel analysis (compared to [8]) to improve the regret of ensemble sampling. Therefore, it is absolutely appropriate to compare our results with those in [8].

We believe we should welcome any new way of implementing ensemble sampling if it can provide improved regret guarantees.


Thank you for your feedback on the abstract. We will make the necessary edits in the final version.

评论

Dear Reviewer TPGY,

To put our contributions in perspective, let's compare the face value of our results with those in the relevant literature:

PaperRegret BoundEnsemble SizePublication
Lu and Van Roy [15]InvalidInvalidNeurIPS 2017
Phan et al. [A]Invalid (Lemma 13)InvalidNeurIPS 2019
Qin et al. [20]O(dTlogK)O(\sqrt{dT\log{K}}) Bayesian regretO(KT)O(KT)NeurIPS 2022
Janz et al. [8]O(d5/2T)O(d^{5/2}\sqrt{T}) Frequentist regretO(dlogT)O(d\log{T})-
Our workO(d3/2T)O(d^{3/2}\sqrt{T}) Frequentist regretO(KlogT)O(K\log{T})Currently under review at NeurIPS 2024
(Phan et al. [A] contains an extended result on ensemble sampling in their appendix using the results from Lu and Van Roy [15], but since the latter is incorrect, so is the former).

Even when considering just the face value of the results presented in each paper, overlooking our work would discourage further progress in the field of ensemble sampling.

Key Points:

  • Prior to our work, none of the previous studies in ensemble sampling (Lu and Van Roy [15], Phan et al. [A], Qin et al. [20], or Janz et al. [8]) had achieved the O(d3/2T)O(d^{3/2}\sqrt{T}) frequentist regret bound with sublinear TT dependence on ensemble size.

  • Simply plugging in an ensemble size of O(KlogT)O(K \log T) into previous results (e.g., instead of O(dlogT)O(d \log T) as in Janz et al. [8]) does not trivially achieve the O(d3/2T)O(d^{3/2}\sqrt{T}) frequentist regret bound. In fact, regardless of how the ensemble size is chosen (whether O(dlogT)O(d \log T), O(KlogT)O(K \log T) or even O(KT)O(KT)) and what algorithmic tweaks are applied (such as symmetrization in Janz et al. [8]), our O(d3/2T)O(d^{3/2}\sqrt{T}) bound remains the sharpest frequentist regret bound.

  • This is the key contribution of our work: No prior analysis in ensemble sampling had successfully achieved the O(d3/2T)O(d^{3/2}\sqrt{T}) frequentist regret bound (with sublinear ensemble size in TT). We presented and proved a novel method to reach this bound for the first time.

Even at face value, our results clearly stand out compared to the existing literature. We sincerely and kindly ask you to reconsider these points when evaluating our submission.

Beyond the face value of our primary results, which we believe are already significant, we also introduce a general regret analysis framework (Theorem 2) for linear bandit algorithms. This framework not only generalizes the regret analysis for randomized algorithms like ensemble sampling but also applies to other optimism-based deterministic algorithms. This could be of significant interest beyond ensemble sampling, although it seems to have been overlooked in the reviews.

Considering the significant value of these contributions, we strongly believe our work deserves more recognition than just a "Borderline accept." We are eager to clarify any points further. Please feel free to reach out with any questions.

Sincerely,
Authors


Reference:

  • Phan et al. [A]: Phan, M., Abbasi Yadkori, Y., & Domke, J. (2019). Thompson sampling and approximate inference. Advances in Neural Information Processing Systems, 32.

审稿意见
7

This paper proposes a neater version of linear ensemble sampling and streamlines the analysis of OFUL-inspired algorithms for linear bandits. The authors proved that this version of linear ensemble sampling has its high-probability regret upper bound the same order as LinTS, i.e., O~(d3/2T)\tilde{O}(d^{3/2}\sqrt{T}) so long as the ensemble size is at least linear in KK (the cardinality of the arm set) and logarithmic in TT. Their analysis introduces a lightweight way for the community to deal with the dependency between the sequence of perturbations and the sequences of selected arms while preserving a sublinear ensemble size. This paper also tries to subsume linear perturbed-history exploration under the streamlined analysis framework.

优点

  1. The proposed algorithm is the first to achieve the same statistical performance as LinTS up to logarithmic factors.
  2. The proposed analysis framework in Theorem 2 identifies key components in OFUL-inspired algorithms, which is of broader interest to the community of online learning.
  3. The reformulations (of Algorithm 1) made in the analysis are insightful and easy to follow, especially the reindexed viewpoint of the sequence of selected arms in the proof of Claim 1.

缺点

  1. The presentation in Section 6 is very confusing. As far as I can tell, Theorem 1 requires the ensemble size to be at least linear in KK, and the justification of Claim 1 in the proof of Theorem 1 relies on KK \leq \infty at least superficially. However, the authors claim that, Corollary 1, as a corollary of Theorem 1, can implies a regret bound for LinPHE "under the infinite arm setting" in Line 349 and even in the Abstract. It could be highly possible that both Theorem 1 and Corollary 1 are correct, but the authors should clarify the relationship between the two results further. I am willing to reconsider my evaluation if the authors can provide a more detailed and satisfactory explanation.
  2. Though this paper is mathematically significant, it is not rigorously clear why should we prefer linear ensemble sampling over LinTS if we focus on the linear bandits setting (instead of more complex models like neural networks). For example, the authors mention that it might be intractable to compute the posterior of LinTS and Theorem 1 should be applicable to perturbations following "any symmetric non-degenerate subGaussian distribution"; then mathematically speaking, linear ensemble sampling should be preferred over LinTS only if there does exist a symmetric non-degenerate subGaussian distribution (serving as the perturbation in linear ensemble sampling) whose counterpart in LinTS has an intractable posterior. However, the authors do not provide any concrete examples or discussions on this point. I am willing to reconsider my evaluation if the authors can provide a more detailed and satisfactory explanation.
  3. Line 244: There seems to be certain typo in the indices range for ηt\eta_t.
  4. The presentation between Line 527 and Line 528 is inconsistent with the definition of Nk,tN_{k,t}. (Back to front)
  5. Minor typo in Line 240: "lienar" -> "linear".
  6. Minor typo between Line 480 and Line 481: βt\beta_t -> βt1\beta_{t-1}.
  7. Minor typo in the proof of Lemma 4: d+...\sqrt{d} + \sqrt{...} -> d+2...\sqrt{d} + \sqrt{2} \cdot \sqrt{...}.

问题

  1. The ensemble size mm in Theorem 1 depends on TT, which seemingly makes the algorithm inherently incompatible with double-trick-style techniques and thus less practical, right?
  2. See the second point in the Weaknesses section.

局限性

See Weaknesses.

作者回复

We appreciate your overall positive feedback and particularly recognizing the theoretical value of our work. We strongly and sincerely believe that the first two points you mentioned as weaknesses are not weaknesses. We sincerely hope you take our response into account in reassessing our work.


Corollary 1 and Theorem 1

Although Theorem 1 shows that an ensemble size of O(KlogT)\mathcal{O}(K \log T) is sufficient to guarantee an O(T)\mathcal{O}(\sqrt{T}) regret bound, it does not state that an ensemble size smaller than O(KlogT)\mathcal{O}(K \log T) must fail. This is a sufficient condition, not a necessary one.

Please note that Corollary 1 derives a regret bound for LinPHE that is independent of the number of arms. What Corollary 1 highlights is that an ensemble of size TT with a round-robin selection rule is also sufficient to attain the same regret bound, following the same logic as Theorem 1. We strongly believe this is a novel finding.

The proof of Corollary 1 follows the analysis framework of Theorem 1 but skips the latter part, including the use of Claim 1. This adjustment makes it applicable to the infinite arm setting for LinPHE (or the arm-independent regret bound of LinPHE even for the finite arm setting). The latter part of the proof in Theorem 1, particularly Claim 1, focuses on decoupling the dependency between the sequence of selected arms and perturbations, which are already independent in the case of LinPHE.

We refer to the discussion in Corollary 1, especially the last few sentences. Appendix G, where we present a concise proof of Corollary 1 and provide additional discussion, should also help resolve your concerns. We are more than happy to include a more detailed discussion about this point.


Ensemble Sampling and LinTS

First of all, thank you for acknowledging the mathematical significance of our work. Regarding your question on whether one should prefer ensemble sampling over LinTS, let's review what was known prior to our work, which we believe the reviewer is well aware of.

  • Lu and Van Roy [15] attempted to provide a regret analysis of ensemble sampling but admitted to an erroneous analysis, invalidating their results.
  • Qin et al. [20] analyzed the Bayesian regret, a weaker notion of regret than the frequentist regret analyzed in our work. Moreover, their result required the ensemble size to be linear in TT, which is highly prohibitive in practice.
  • Janz et al. [8] showed that the ensemble size only scales logarithmically with TT, an improvement over Qin et al. [20]. However, their regret bound is O(d5/2T)\mathcal{O}(d^{5/2}\sqrt{T}), with an additional dd dependence compared to linear Thompson Sampling [2, 3], and a super-linear dependence on the ensemble size mm. This result is not only loose but also counter-intuitive (the larger the ensemble size, the worse the performance, raising the question of why to use ensemble sampling).

Prior to our work: Hence, if you had asked the same question prior to our work, the answer would have been: there is no theoretical evidence that ensemble sampling can perform on the same level as LinTS in the worst case, as the worst-case regret guarantees [8] were worse than those of LinTS. Furthermore, the previously proposed algorithm's algorithmic complexity and super-linear dependence on the ensemble size mm in the regret bound made the use of a larger ensemble size impractical.

Our work: Our work provides the first theoretical evidence that, even in the worst case, ensemble sampling can be a viable alternative to LinTS, matching the same regret bound of O(d3/2T)\mathcal{O}(d^{3/2}\sqrt{T}) (with no dependence on mm in the regret bound at all) for the first time.

Please note that, as mentioned in the paper, ensemble sampling has already been used in practice and exhibits competitive performance in complex settings [16, 17, 18, 19, 23, 24]. However, the theoretical study of its statistical behavior has been lagging behind, not even in the linear bandit setting [15, 20, 8]. As the linear bandit problem serves as a foundational framework for many other problems, an improved analysis of linear ensemble sampling is absolutely necessary, particularly because existing studies have shown clearly inferior results compared to LinTS either in the regret bounds or in the requirements for ensemble size.

Regarding the "intractable posterior", our statement refers to more "complex problems" (we did not mention LinTS is intractable in the linear model assumption). What is important is that TS in more general and complex problem settings is much more difficult to implement, whereas ensemble sampling is easily generalizable to more complex problems.


[W3-7] Minor Typos

We appreciate your thorough inspection of the manuscript. We will make edits incorporating your feedback.


Answer to Question: Double-trick?

One can apply the doubling trick by restarting the algorithm with doubling periods, each time with a larger ensemble. The regret bound we obtained should still be valid when the doubling trick is employed, possibly with some larger constant factors, but the main order should remain the same. Of course, as you are well aware, the doubling trick has its own drawbacks, but these apply to the previous algorithm in Janz et al. [8] as well. This is not a particular issue with our algorithm.

评论

Dear Reviewer QJqu,

To put our contributions in perspective, let's compare the face value of our results with those in the relevant literature:

PaperRegret BoundEnsemble SizePublication
Lu and Van Roy [15]InvalidInvalidNeurIPS 2017
Phan et al. [A]Invalid (Lemma 13)InvalidNeurIPS 2019
Qin et al. [20]O(dTlogK)O(\sqrt{dT\log{K}}) Bayesian regretO(KT)O(KT)NeurIPS 2022
Janz et al. [8]O(d5/2T)O(d^{5/2}\sqrt{T}) Frequentist regretO(dlogT)O(d\log{T})-
Our workO(d3/2T)O(d^{3/2}\sqrt{T}) Frequentist regretO(KlogT)O(K\log{T})Currently under review at NeurIPS 2024
(Phan et al. [A] contains an extended result on ensemble sampling in their appendix using the results from Lu and Van Roy [15], but since the latter is incorrect, so is the former).

Even when considering just the face value of the results presented in each paper, overlooking our work would discourage further progress in the field of ensemble sampling.

Key Points:

  • Prior to our work, none of the previous studies in ensemble sampling (Lu and Van Roy [15], Phan et al. [A], Qin et al. [20], or Janz et al. [8]) had achieved the O(d3/2T)O(d^{3/2}\sqrt{T}) frequentist regret bound with sublinear TT dependence on ensemble size.

  • Simply plugging in an ensemble size of O(KlogT)O(K \log T) into previous results (e.g., instead of O(dlogT)O(d \log T) as in Janz et al. [8]) does not trivially achieve the O(d3/2T)O(d^{3/2}\sqrt{T}) frequentist regret bound. In fact, regardless of how the ensemble size is chosen (whether O(dlogT)O(d \log T), O(KlogT)O(K \log T) or even O(KT)O(KT)) and what algorithmic tweaks are applied (such as symmetrization in Janz et al. [8]), our O(d3/2T)O(d^{3/2}\sqrt{T}) bound remains the sharpest frequentist regret bound.

  • This is the key contribution of our work: No prior analysis in ensemble sampling had successfully achieved the O(d3/2T)O(d^{3/2}\sqrt{T}) frequentist regret bound (with sublinear ensemble size in TT). We presented and proved a novel method to reach this bound for the first time.

Even at face value, our results clearly stand out compared to the existing literature. We sincerely and kindly ask you to reconsider these points when evaluating our submission.

Beyond the face value of our primary results, which we believe are already significant, we also introduce a general regret analysis framework (Theorem 2) for linear bandit algorithms. This framework not only generalizes the regret analysis for randomized algorithms like ensemble sampling but also applies to other optimism-based deterministic algorithms. This could be of significant interest beyond ensemble sampling, although it seems to have been overlooked in the reviews.

Considering the significant value of these contributions, we strongly believe our work deserves more recognition than just a "Borderline accept." We are eager to clarify any points further. Please feel free to reach out with any questions.

Sincerely,
Authors


Reference:

  • Phan et al. [A]: Phan, M., Abbasi Yadkori, Y., & Domke, J. (2019). Thompson sampling and approximate inference. Advances in Neural Information Processing Systems, 32.

评论

Thank you for your nice rebuttal. The authors have sufficiently emphasized the significance of their work and addressed my mathematical concern on Corollary 1. Given the line of (precedent) theoretical works on ensemble sampling, and the technical contributions of this paper, I have raised my score to 7.

  • By the way, I kindly encourage the authors to delete the last column of the penultimate line of the markdown table in their rebuttal text in that "pointing out that certain cited paper, i.e., xxx et al. has been submitted to XXX 2024" does not quite conform to the double-blind principle.
评论

Thank you very much for recognizing the value of our work! We have modified our comments.

审稿意见
3

An analysis of Ensemble Sampling in the linear setups with closed-form incremental update and finite action set.

优点

Provide a new analysis of linear Ensemble sampling.

缺点

  1. Lack of Practical Implications for Ensemble Sampling in Complex Settings:
  • The paper does not discuss how Ensemble Sampling can be effectively applied in complex, real-world scenarios. Practical considerations and potential challenges are not addressed, which limits the practical utility of the proposed approach.
  1. Over-Claims:
  • The authors claim to close the gap between theory and practice by providing an improved regret bound for linear ensemble sampling. However, they do not clearly define this gap. The paper lacks simulations or rigorous arguments to support the existence or closure of this gap. For instance, the paper does not present practical scenarios or empirical results that highlight the gap.
  1. Incorrect Claims: The authors incorrectly claim that their work matches the state-of-the-art results for randomized linear bandit algorithms. Specifically:
  • The best-known regret bound for exact Thompson Sampling (TS) in a finite-arm (K actions) setting is O(dTlogKlogT)O(d \sqrt{T \log K \log T}) [1].
  • The paper proves a regret bound for linear Ensemble Sampling of O((dlogT)3/2T)O((d \log T )^{3/2} \sqrt{T}).
  • This bound does not match the state-of-the-art results or even the exact TS bound.
  • Additionally, exact TS is not the state-of-the-art for randomized linear bandit algorithms in terms of regret bound.
  • The authors' claim that their result matches state-of-the-art randomized linear bandit algorithms is incorrect as their bound does not even match the exact TS.

[1] Agrawal, S., & Goyal, N. (2013, May). Thompson sampling for contextual bandits with linear payoffs. In International conference on machine learning (pp. 127-135). PMLR.

  1. Mischaracterization of Perturbed History Exploration (PHE):
  • In linear Ensemble sampling, JtJ_t sampled from uniform distribution.
  • Perturbed history exploration can be viewed as deterministic choosing TT different ensemble models with entirely independent set of perturbed noise one by one. This observation is not new.
  • Perturbed history exploration is not a special case of the linear ensemble sampling algorithm as one is a deterministic algorithm and the other is randomized algorithm.
  • Therefore, it is not valid to use the result from LinPHE to claim that "linear Ensemble sampling use min(KlogT,T)\min (K \log T, T ) ensembles, allowing the number of arms to be infinite".

问题

  1. See weakness.

  2. Can you provide more in-depth justification of assumption you made in line 270 for the proof of theorem 1. Intuitively, it seems right by sampling these R.V. in the beginning whose randomness are from algorithm side. But it needs rigiours justification.

  • The random variables {Wj}\{W^j\} are sampled independently in the beginning.
  • JtJ_t and ZtjZ_{t}^j are independent of the history before step t.
  • However, (1) (Xs,Ys)(X_s, Y_s) depend on JtJ_t for sts \ge t.
  • and (2) (Xs,Ys)(X_s, Y_s) depend on ZtjZ_t^j for s>ts > t.
  1. Can you provide empirical evidence to support the scaling on ensemble size? Is it really scaling linearly with size of action set KK? What is the scaling factor with ambient dimension dd? What theory-practice gap are you closing?

局限性

Any implication on scalable posterior sampling in complex setting?

作者回复

We appreciate the time and effort you have invested in evaluating our work. However, we believe there are several fundamental misunderstandings in your review that we would like to address.

The theory-practice gap: Variants of ensemble sampling have already been shown to perform effectively in practice, such as in online recommendation [16,24,23] and deep reinforcement learning [17,18,19]. However, the theoretical understanding of ensemble sampling has been lagging behind, even for the linear bandit problem, as recognized by previous theoretical works.

  • Lu and Van Roy [15] attempted to provide a regret analysis of ensemble sampling but admitted to an erroneous analysis that invalidates their results (see 15).
  • Qin et al. [20] analyzed the Bayesian regret, a weaker notion of regret than the frequentist regret analyzed in this work. More importantly, their result requires that the ensemble size be linear in TT, which is highly prohibitive in practice.
  • Janz et al. [8] showed that the ensemble size only scales logarithmically with TT, an improvement over [20] on the requirement for the ensemble size. However, their regret bound is unsatisfactory O(d5/2T)\mathcal{O}(d^{5/2}\sqrt{T}) with an additional dd dependence compared to linear Thompson Sampling [2,3], and a super-linear dependence on the ensemble size mm in the regret bound. That is, their result is not only loose but also very counterintuitive (the larger the ensemble size, the worse the performance, which questions the motivation of ensemble sampling).

To summarize, there was no frequentist (worst-case) regret bound matching that of other comparable linear TS (or LinPHE) with the number of ensembles not scaling linearly with TT. Despite previous attempts, there has been a significant gap. Our work aims to close this gap by demonstrating the first non-vacuous theoretical guarantee of linear ensemble sampling whose regret bound matches that of LinTS for the first time and ensemble size is still logarithmic in TT, without any performance degradation as the ensemble size grows. We provide this tighter regret bound through new analysis and with an even simpler algorithm than that of Janz et al. [8].


Correct Claim: It is crucial to acknowledge that both O~(d3/2T)\tilde{\mathcal{O}}(d^{3/2}\sqrt{T}) and O~(dTlogK)\tilde{\mathcal{O}}(d\sqrt{T \log K}) bounds are well-established in the finite-arm setting [3], and neither is inherently superior.

  • FACT: O~(d3/2T)\tilde{\mathcal{O}}(d^{3/2}\sqrt{T}) is a valid best-known upper bound of LinTS [2,3], even for the finite-arm setting.

For instance, when d=2d=2 and K=8K=8, O~(d3/2T)\tilde{\mathcal{O}}(d^{3/2}\sqrt{T}) is clearly smaller than O~(dTlogK)\tilde{\mathcal{O}}(d \sqrt{T \log K}) for any TT.

Experts in the bandit literature agree that O(d3/2T)\mathcal{O}(d^{3/2}\sqrt{T}) (for the KK-independent bound—here "KK-independent" indicates that the regret is independent of KK, not implying an infinite arm setting) is the best bound for TS-type (not UCB-type or hybrid with UCB) algorithms in linear bandits. Thus, we respectfully disagree with the assertion that our regret bound is worse than Thompson Sampling. We sincerely hope that the reviewer and we can agree on this well-accepted fact within the community. Our regret bound for linear ensemble sampling matches that of LinTS for the first time, marking a significant contribution to the field.


Relationship between ensemble sampling and LinPHE: Although previous studies of ensemble sampling have fixed the ensemble sampling distribution to be merely uniform, it turns out that uniform distribution is not necessary.

  • FACT: The sampling distribution in ensemble sampling does not have to be uniform.

For instance, even if some models' probabilities are increased by a constant factor (and others' decreased), both our analyses and previous work remain valid, achieving the same order of regret bounds.

To our best knowledge, the relationship between perturbed history exploration (PHE) and ensemble sampling has not been rigorously discussed on either side. If you could provide any references revealing the relationship between ensemble sampling and LinPHE as argued in your review, we would appreciate it.

As mentioned earlier, uniform sampling is not a necessary characteristic of ensemble sampling. Therefore, we take a broader perspective of linear ensemble sampling (Algorithm 1) to include any sampling policy, rather than restricting it to a uniform distribution. This generalization reveals a significant connection between LinPHE and ensemble sampling, allowing us to use ensemble sampling analysis techniques to derive a new regret bound for LinPHE.

Most importantly, our Corollary 1 does not "use the result from LinPHE to claim" anything. On the contrary, we apply our analysis framework for linear ensemble sampling to analyze LinPHE, deriving a new regret bound. With the generalized algorithm, Proposition 1 and Corollary 1 together imply that an ensemble size greater than TT is superfluous, as a round-robin selection of the ensemble models guarantees the same regret bound as Theorem 1.

This result opens up new possibilities for ensemble sampling to adopt alternate sampling policies, demonstrating that policies other than uniform sampling can also achieve strong guarantees. This expansion should not be viewed as a weakness of our work; rather, it should be recognized as a significant contribution. Even if the reviewer were to restrictively define ensemble sampling to be associated with uniform distribution only (contrary to the evidence above), the fact that we can apply our novel regret analysis framework to derive a new and unified analysis of both LinPHE and ensemble sampling (two "different randomized algorithm classes" as one wishes to believe) would be even more interesting. Regardless, this new result will be of independent interest to the community.

评论

Dear Authors,

Thank you for your detailed rebuttal. While I appreciate your efforts to clarify your contributions, I still have concerns about some of the claims and comparisons made in your response.

  1. Comparison with Previous Ensemble Sampling (ES) Analysis:

Your justification for the dependency on KK and dd is not sufficiently rigorous. A proper comparison of complexity bounds must consider all dominant factors across different regimes. It's only valid to claim superiority in specific regimes of these factors.

For instance, consider the following scenarios:

a) For d=2d = 2 and K=8K = 8 (omitting constants and logarithmic terms):

  • Your analysis: ES with ensemble size 8 achieves a regret bound of 23/2T2^{3/2}\sqrt{T}.
  • ES [Janz et al., 2023]: ES with ensemble size 2 achieves a regret bound of 25/2T2^{5/2}\sqrt{T}.

b) For d=2d = 2 and K=10000K = 10000:

  • Your analysis: ES with ensemble size 10000 achieves a regret bound of 23/2T2^{3/2}\sqrt{T}.
  • ES [Janz et al., 2023]: ES with ensemble size 2 achieves a regret bound of 25/2T2^{5/2}\sqrt{T}.

These comparisons highlight that it's misleading to claim your analysis leads to improved results compared to ES [Janz et al., 2023]. The improvement appears to be regime-dependent, and this nuance should be clearly stated.

  1. Ensemble Size Scaling: regard the usefulness of the analysis

The implication that the ensemble size needs to scale linearly with KK is counter-intuitive and potentially makes the ES algorithm impractical for large action spaces. This requirement significantly limits the usefulness of ES in real-world scenarios with large KK. Empirical evidence of how the ensemble size scales with KK in practice would be valuable to support your theoretical claims and demonstrate real-world applicability. From Author rebuttal pdf, it looks like the empirical scaling of KK is with a substantially lower order of O(K)O(K). As from this empirical evidence, we might not say the analysis in this work -- leading to O~(K)\tilde{O}(K) scaling -- justifies the usefulness of ES. And more importantly, it is misleading to claim that theory-practice gap is closed.

  1. Comparison with Exact TS:

To claim that your regret order matches exact TS, your analysis should match or improve upon the existing analysis of exact TS across all regimes. A more comprehensive result would be of the form with the same ensemble size configuration:

min(O~(d3/2T),O~(dTlogK))\min(\tilde{\mathcal{O}}(d^{3/2}\sqrt{T}), \tilde{\mathcal{O}}(d\sqrt{T \log K}))

This would demonstrate matching performance across different regimes of dd and KK. Without such a result, the claim of matching exact TS performance is overstated and misleading.

  1. Practical Implications and Theory-Practice Gap:

While you mention practical applications of ES variants in online recommendation and deep reinforcement learning, your current analysis doesn't directly address how these theoretical improvements translate to practical benefits in these complex settings. To substantiate your claim of closing the theory-practice gap, it would be beneficial to provide:

a) Empirical results demonstrating the practical advantages of your approach over existing analysis. Does your analysis provides more accurate characterization of practical performance? b) A clear explanation of how your theoretical improvements address specific challenges in real-world applications.

In conclusion, while your work provides interesting insights into ES, the claims of matching exact TS performance and improving upon previous ES analyses need more nuanced and regime-specific statements. I encourage you to:

  1. Refine your comparisons to accurately reflect the specific conditions under which your results hold and improve upon existing work, for both computation and regret considerations.
  2. Provide matching results between empirical evidence and theoretical claims, especially regarding the scaling of ensemble size with KK in the future revision if you want to claim closing theory-practice gap.
  3. Clarify how your work concretely addresses the theory-practice gap in ensemble sampling applications.

These additions would significantly strengthen your paper and provide a more comprehensive understanding of your contributions to the field.

Sincerely, Reviewer QBvS

评论

Answer to Question : Line 270

First, we would like to clarify that line 270 is only redefining the σ\sigma-algebra F0A\mathcal{F}_0^A to simplify the analysis—this does not alter the algorithm or the problem setting. Sampling all the perturbation values in advance is just an intuitive explanation of this measure, which need not actually take place in the execution of the algorithm. Furthermore, we have already rigorously justified the equivalence of the algorithm under a more complex modification in Appendix F, which we referred in case where the reader expects a full, rigorous justification. To briefly summarize, the algorithm remains equivalent since the perturbation values follow independent Gaussian distributions conditioned on appropriately defined history, whether they are sampled at the beginning or later. We believe that the "sampling i.i.d. R.V.s in the beginning" viewpoint is well-known; for instance, see the description of the stack of rewards model on page 65 of [13]. Note that we do not include the randomness of jtj_t in F0A\mathcal{F}_0^A, so we find your point (1) irrelevant. Your point (2) also does not pose any problems since the dependency on future interactions stays the same regardless of when ZtjZ_t^j is sampled.

评论

Motivation for Studying Ensemble Sampling

In the context of this discussion in linear function approximation, "exact Thompson Sampling" (TS) refers specifically to Linear Thompson Sampling (LinTS). Ensemble Sampling (ES) has emerged as a method to approximate TS, aiming to scale up exploration in complex environments where no conjugacy can be utilized by exact TS. The ultimate goal of scalable exploration is twofold: achieving bounded per-step computation complexity and sublinear regret in complex environment.

While there is already extensive research on linear contextual bandits, including comprehensive analyses of TS in both Bayesian and Frequentist settings, I acknowledge that it is valuable to study the theoretical aspects of ES with linear function approximation. It's worth noting that linear Thompson sampling already offers bounded per-step computation and near-optimal regret.

Current Status of ES Theory and Paths for Advancement

Given the motivation behind ES and existing analyses, to further advance the community's understanding and potentially design better algorithms, we should consider:

  1. Regret Analysis: Continue to refine either Bayesian or Frequentist regret bounds for ES.

  2. Ensemble Size Optimization: Investigate how the choice of ensemble size affects both computational complexity and regret bounds. This aspect is crucial for practical implementations and should not be overlooked in theoretical analyses.

  3. Comparative Studies: Conduct rigorous comparisons between ES and other methods (e.g., LinTS) across various regimes of problem parameters (such as dimension dd and number of arms KK).

  4. Computational Complexity: Analyze the per-step computation time of ES in relation to problem parameters and ensemble size.

  5. Empirical Validation: Provide empirical evidence to support theoretical claims, especially regarding the scaling of ensemble size with problem parameters.

  6. Practical Implications: Clearly articulate how theoretical improvements in ES translate to benefits in real-world applications, particularly in complex environments where exact TS is computationally infeasible.

By addressing these aspects comprehensively, future research can provide a more holistic understanding of ES, potentially leading to algorithmic improvements that balance theoretical guarantees with practical applicability.

These relevant points emphasize again the previous concern on Ensemble Size Scaling

评论

Dear Reviewer QBvS,

There is another serious issue in your argument (https://openreview.net/forum?id=6SSzMq3WTn&noteId=TYYoKwGxDU). You are quoting Qin et al., 2023's Bayesian regret bound to argue that it is sharper than our frequentist regret bound. This is highly concerning, as Bayesian regret and frequentist regret are two distinct measures and are not directly comparable. This is just simply WRONG.

We believe you understand that Bayesian regret is generally considered a weaker notion than frequentist regret. Our manuscript and rebuttal clearly state that the results in Qin et al., 2023 are in the Bayesian regret setting. Note that throughout all our paper and discussions (even the quotes on LinTS from the literature) the notion of regret we study are all in frequentist regret setting.

Sadly, this discrepancy suggests either a serious misunderstanding or a deliberate misrepresentation of our work, which is very troubling. Reviewer QBvS, we believe that you have experienced being an author in submitted papers. Many in our community have faced similar challenges, where reviews do not accurately reflect the value of their work. This is particularly problematic for theoretical papers, where significant effort and rigorous results can be easily overlooked due to ignorance or mistakes.

Lastly, we respectfully and sincerely urge you to reconsider your evaluation of our work and ensure that it is judged fairly based on its theoretical contributions and merits.

评论

Problem-dependent Nature of Thompson Sampling

Thompson Sampling (TS) has gained popularity in the bandit community and beyond due to its simplicity and inherent ability to adapt to various problem setups. This adaptability is evident in both Bayesian and Frequentist analyses of TS.

Bayesian Perspective

Information Ratio and Adaptive Bounds

[Russo and Van Roy, 2016] introduced the concept of the information ratio, which demonstrates that without modifying the TS policy specifically for each problem setups, the information ratio automatically adapts to different problem setups. This led to a class of problem-dependent bounds for TS in their work and inspired many follow-up studies, including tight bound in finite KK-arm setting, infinite arm setting and even changing action set setting.

A notable extension of this approach is found in [Neu et al., 2022], which further develops the information-theoretic analysis of TS for contextual bandits in many problem setups.

Key Reference:

  • Russo and Van Roy, 2016. Information Theoretic Analysis of Thompson Sampling
  • Neu, G., et al. (2022). Lifting the Information Ratio: An Information-Theoretic Analysis of Thompson Sampling for Contextual Bandits.
  • Hamidi, N., & Bayati, M. (2023). The elliptical potential lemma for general distributions with an application to linear thompson sampling. Operations Research, 71(4), 1434-1439.

Frequentist Perspective

Linear Thompson Sampling (LinTS) Analysis

[Agrawal & Goyal, 2013] established a pioneering work in the analysis of frequentist regret for LinTS. Their comprehensive study covers both finite arm settings and infinite arm settings (compact action set).

The elegance of their analysis lies in its problem-dependent nature, featuring a term min(d,logK)\min(\sqrt{d}, \sqrt{\log K}) throughout the proof. This leads to their Theorem 1, which provides adaptive bounds without modifying the LinTS algorithm specifically for each problem setups.

Key Reference:

  • Agrawal, S., & Goyal, N. (2013, May). Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning (pp. 127-135). PMLR.

Implications

The problem-dependent nature of TS, as demonstrated in both Bayesian and Frequentist analyses, underscores its versatility and efficiency across various bandit problems. This adaptability contributes significantly to TS's popularity and effectiveness in real-world applications.

These facts emphasize my previous concern on the claim on matching regret performance of TS.

评论
  • “we present the following corollary in linear bandits, whose main regret O~(d3/2T)\tilde{O}(d^{3/2}\sqrt{T}) is optimal for PS (posterior sampling) algorithms.” Kuang, N. L., Yin, M., Wang, M., Wang, Y. X., & Ma, Y. (2023). Posterior sampling with delayed feedback for reinforcement learning with linear function approximation. Advances in Neural Information Processing Systems, 36, 6782-6824. => Not min(O~(d3/2T),O~(dTlogK)min( \tilde{O}(d^{3/2}\sqrt{T}), \tilde{O}(d\sqrt{T \log K}), but only O~(d3/2T)\tilde{O}(d^{3/2}\sqrt{T})

  • “We propose data-dependent perturbations … that allow EVILL to match the performance of Thompson-sampling-style parameter-perturbation methods.” “the regret of EVILL enjoys guarantees that parallel those available for Thompson sampling.” In the paper, their regret bound is \tilde{O}(d^{3/2}\sqrt{T})” Janz, D., Liu, S., Ayoub, A., & Szepesvári, C. (2024). Exploration via linearly perturbed loss minimisation. In International Conference on Artificial Intelligence and Statistics (pp. 721-729). PMLR. => **Not** min( \tilde{O}(d^{3/2}\sqrt{T}), \tilde{O}(d\sqrt{T \log K}),butonly, but **only** \tilde{O}(d^{3/2}\sqrt{T})$

The list goes on. Note that all of the papers above are NeurIPS, AISTATS, ICLR publications or equivalent. As you can see, to claim the regret matches the regret bound of LinTS, only one of them mostly O~(d3/2T)\tilde{O}(d^{3/2}\sqrt{T}) is considered sufficient to claim to match the regret of LinTS.

Against all these conventions, Reviewer QBvS’s refusal to admit such a convention is not only harsh but also appears to be a deliberate devaluation of our work. Many bandit researchers would agree with this standard, which makes this stance particularly concerning.

In fact, O~(d3/2T)\tilde{O}(d^{3/2}\sqrt{T}) regret bound serves as a gold standard for TS-based and randomized exploration algorithms that even Janz et al. [8] says the following:

“the regret scales with d5/2Td^{5/2}\sqrt{T} up to logarithmic-in-T factors. The latter scaling is slightly worse than that obtained for Thompson sampling (cf. Theorem 17), where the regret scales with d3/2Td^{3/2}\sqrt{T}, again, up to logarithmic-in-T factors.” Janz, D., Litvak, A. E., & Szepesvári, C. (2023). Ensemble sampling for linear bandits: small ensembles suffice. arXiv preprint arXiv:2311.08376.

If O~(d3/2T)\tilde{O}(d^{3/2}\sqrt{T}) is not a gold standard, why not compare with even tighter bounds such as minimax optimal regret for linear bandit O~(dT)\tilde{O}(d\sqrt{T}) in Janz et al. [8]? Reviewer QBvS’s logic and request are more than just demanding, becoming unnecessarily adversarial. Based on convention, we strongly believe that our result deserves credit. It is unreasonable to spend energy arguing about this widely accepted consensus. If the reviewer does not admit it, then they can take the face value: our result achieves O~(d3/2T)\tilde{O}(d^{3/2}\sqrt{T}) for the first time for ensemble sampling. That is a fact.

We sincerely hope that the reviewer and we can come to some common ground (we really do), although such hope is diminishing.

评论

On Matching Regret of LinTS

Another very concerning assertion is forced by Reviewer QBvS, which we truly believe that requires the wisdom of the community. Reviewer QBvS argues that in order to claim that our result matches the regret bound of LinTS [2, 3] we need to match both min(O~(d3/2T),O~(dTlogK))min( \tilde{O}(d^{3/2}\sqrt{T}), \tilde{O}(d\sqrt{T \log K}) ) "across different regimes of dd and KK" and states without it, the claim is “overstated and misleading.” Who sets such a rule?

(By the way, what is “exact TS” that the reviewer keeps referring to anyway? We never used it in our manuscript. We only want to match with the regret of LinTS, particularly O~(d3/2T)\tilde{O}(d^{3/2}\sqrt{T}) that was the goal to start with.)

Such an assertion is contrary to what is practiced in the literature. As many of us know, it is conventionally accepted that if you match one of the terms (not both), it is said that you have the same order and matching bound as LinTS.

Let us show some examples in the literature:

  • “this bound is of order O~(d3/2T)\tilde{O}(d^{3/2}\sqrt{T}) and it entirely matches the result of Agrawal and Goyal [2012b].” Abeille, M., & Lazaric, A. (2017,). Linear thompson sampling revisited. In International Conference on Artificial Intelligence and Statistics (pp. 176-184). PMLR. => Not min(O~(d3/2T),O~(dTlogK)min( \tilde{O}(d^{3/2}\sqrt{T}), \tilde{O}(d\sqrt{T \log K}), but only O~(d3/2T)\tilde{O}(d^{3/2}\sqrt{T})

  • “[regret] of GP-TS is O~(d3/2T)\tilde{O}(d^{3/2}\sqrt{T}), which recovers the regret bounds of their linear, parametric analogues … Linear Thompson sampling (Agrawal and Goyal, 2013)” Chowdhury, S. R., & Gopalan, A. (2017). On kernelized multi-armed bandits. In International Conference on Machine Learning (pp. 844-853). PMLR. => Not min(O~(d3/2T),O~(dTlogK)min( \tilde{O}(d^{3/2}\sqrt{T}), \tilde{O}(d\sqrt{T \log K}), but only O~(d3/2T)\tilde{O}(d^{3/2}\sqrt{T})

  • “Theorem 2 establishes O~(d3/2T)\tilde{O}(d^{3/2}\sqrt{T}) worst-case regret, which matches the regret bounds of TS methods for linear contextual bandits [Agrawal and Goyal 2013, Abeille et al. 2017] up to logarithmic factor.” Oh, M. H., & Iyengar, G. (2019). Thompson sampling for multinomial logit contextual bandits. Advances in Neural Information Processing Systems, 32. => Not min(O~(d3/2T),O~(dTlogK)min( \tilde{O}(d^{3/2}\sqrt{T}), \tilde{O}(d\sqrt{T \log K}), but only O~(d3/2T)\tilde{O}(d^{3/2}\sqrt{T})

  • “We note that the regret of Term I ( O~(d3/2T)\tilde{O}(d^{3/2}\sqrt{T})) has the same bound as that of Abeille et al. (2017)” Moradipari, A., Thrampoulidis, C., & Alizadeh, M. (2020). Stage-wise conservative linear bandits. Advances in neural information processing systems, 33, 11191-11201. => Not min(O~(d3/2T),O~(dTlogK)min( \tilde{O}(d^{3/2}\sqrt{T}), \tilde{O}(d\sqrt{T \log K}), but only O~(d3/2T)\tilde{O}(d^{3/2}\sqrt{T})

  • “Agrawal and Goyal [2013b] also extended the analysis of multi-armed Thompson Sampling to the linear contextual setting and proved a regret bound of O~(d3/2T)\tilde{O}(d^{3/2}\sqrt{T}) where d is the dimension of the context vectors. In this paper, we develop the first variants of Batch Thompson Sampling that achieve the aforementioned regret bounds.” Karbasi, A., Mirrokni, V., & Shadravan, M. (2021). Parallelizing thompson sampling. Advances in Neural Information Processing Systems, 34, 10535-10548. => Not min(O~(d3/2T),O~(dTlogK)min( \tilde{O}(d^{3/2}\sqrt{T}), \tilde{O}(d\sqrt{T \log K}), but only O~(d3/2T)\tilde{O}(d^{3/2}\sqrt{T})

  • “Theorem 3.5 implies the regret of NeuralTS is on the order of O~(d~T)\tilde{O}(\tilde{d}\sqrt{T}) (here there is additional logK\log K factor hidden in O~\tilde{O}, hence it is O~(dTlogK)\tilde{O}(d\sqrt{T \log K})),. This result matches the state-of-the-art regret bound in Chowdhury & Gopalan (2017); Agrawal & Goyal (2013); Zhou et al. (2019); Kveton et al. (2020).” Zhang, W., Zhou, D., Li, L., & Gu, Q. (2021). Neural thompson sampling. ICLR 2021. => Not min(O~(d3/2T),O~(dTlogK)min( \tilde{O}(d^{3/2}\sqrt{T}), \tilde{O}(d\sqrt{T \log K}), but only O~(dTlogK)\tilde{O}(d\sqrt{T \log K})

  • “The regret upper bound is also stipulated to be of the order O~(d3/2(1+t=1Tσt2))\tilde{O}(d^{3/2}(1+\sqrt{\sum_{t=1}^T \sigma_t^2})), considering RR as a constant. Notably, in scenarios where the variance is constant, our LinNATS algorithm recovers the regret guarantee of O~(d3/2T)\tilde{O}(d^{3/2}\sqrt{T}) for for linear TS (Agrawal and Goyal, 2013; Abeille and Lazaric, 2017)” Xu, R., Min, Y., & Wang, T. (2023). Noise-adaptive thompson sampling for linear contextual bandits. Advances in Neural Information Processing Systems, 36. => Not min(O~(d3/2T),O~(dTlogK)min( \tilde{O}(d^{3/2}\sqrt{T}), \tilde{O}(d\sqrt{T \log K}), but only O~(d3/2T)\tilde{O}(d^{3/2}\sqrt{T})

评论

Dear Reviewer QBvS,

We appreciate your very quick (in about 2 hours) and enthusiastically prepared responses to our rebuttal. We are grateful for the opportunity to discuss our work with you. We sincerely hope we can communicate with rationality and reason, without unnecessarily hurting others.

We respectfully urge you to reconsider your evaluation of our work. Before accusing us of misleading or mischaracterizing our research, we sincerely ask that you reflect on whether you might be unintentionally (or intentionally, which would be a serious ethical issue) misinterpreting or misrepresenting our hard and proud work.

With all due respect, we fundamentally disagree with your comments on almost every point. However, we understand that there can be disagreements, and we sincerely hope you also understand this. Our goal is to find some common ground through constructive dialogue.

First, let us clarify some fundamental points. We have a few simple yet crucial questions:

  • Q1: Prior to our work, do you agree that none of the previous works in ensemble sampling (Lu and Van Roy [15], Qin et al. [20], or Janz et al. [8]) had achieved O(d3/2T)O(d^{3/2}\sqrt{T}) frequentist regret with sublinear TT dependence on ensemble size?

  • The answer to Q1 should be Yes.

When we began this research, our goal was to prove that ensemble sampling could achieve O(d3/2T)O(d^{3/2}\sqrt{T}) with sublinear dependence on TT. Achieving this rate has long been considered an open problem. We achieved this goal with a new way of treating ensembles and a novel analysis technique, resulting in the tightest known frequentist regret bound for ensemble sampling.

  • Q2: Do you agree that by simply plugging in an ensemble size of O(KlogT)O(K \log T) (instead of O(dlogT)O(d \log T) as in Janz et al. [8]), the previous analysis does not trivially achieve O(d3/2T)O(d^{3/2}\sqrt{T}) regret?

  • The answer to Q2 should be Yes.

Since the previous analysis does not allow for a tighter regret, we devised novel techniques to derive a new and improved bound.

  • Q3: Setting aside how ensemble size is chosen (whether O(dlogT)O(d \log T) or O(KlogT)O(K \log T)) and what algorithmic tweaks are applied, do you agree that O(d3/2T)O(d^{3/2}\sqrt{T}) regret is an improved and currently the sharpest regret in terms of dd and TT (with no dependence on KK)?

  • The answer to Q3 should be Yes.

All three questions (Q1-Q3) above are not rhetorical. They are simple binary questions. If your answers to all three questions are "Yes," we can further discuss our paper. Otherwise, we would be very concerned that we may not be on the same page at a fundamental level.

This is the key point of our contribution: prior to our work, no previous works in ensemble sampling had provided a successful analysis to achieve O(d3/2T)O(d^{3/2}\sqrt{T}) regret. We presented and proved a new way of doing it.

We feel that there is mischaracterization in your comments in "1. Comparison with Previous Ensemble Sampling (ES) Analysis." We have been very clear about the bound on the ensemble size. Using your logic, one could argue the opposite by comparing different values of dd and KK. However, such comparison is NOT our main point. Our main goal is simple yet fundamental: whether one can get O(d3/2T)O(d^{3/2}\sqrt{T}) regret with sublinear (even logarithmic) dependence on TT in ensemble size. This had been an open question, and we aimed to close this well-defined gap! You are imposing your own definition or agenda of a "gap" -- which is not well-defined nor do we have any reason to commit to -- onto our goal. If your intention is not adversarial, please refrain from doing so.

FACT: Our analysis of ensemble sampling with O(KlogT)O(K \log T) ensemble size achieves O(d3/2T)O(d^{3/2}\sqrt{T}) frequentist regret for the first time. Previous approaches, with any choice of ensemble size, did not achieve this. That is a fact.

Our resulting O(d3/2T)O(d^{3/2}\sqrt{T}) regret is smaller than O(d5/2T)O(d^{5/2}\sqrt{T}). If that is not an improved regret, then what else is it? Can Janz et al. [8] achieve O(d3/2T)O(d^{3/2}\sqrt{T}) regret with any modification to their analysis and algorithm? No! We do not understand your motivation of earnestly defending the result of Janz et al. [8] and adversarially discrediting our contribution by distorting the facts.


On Ensemble Size

We respectfully disagree with your argument here. Even your comment that "The implication that the ensemble size needs to scale linearly is counter-intuitive" is simply false. There is NO implication that the ensemble size “needs to scale linearly” because the ensemble size condition is a sufficient condition, not a necessary condition. We are sure you as an expert understand the difference. What you are asking for is for us to show that what is only proven to be sufficient is somehow empirically manifested as necessary. Your assertion is problematic and what you request of us as empirical evidence is unreasonable.

评论

ES [Qin et al., 2023] achieves regret bound dTlogK\sqrt{d T \log K} for finite action setups with KK-arms, which matches the regret bound of Thompson Sampling in finite action set setups [Russo and Van Roy, 2016].

Quote from the remark below Theorem 1 in ES[Qin et al., 2023].

Comparison with the regret bound for TS The regret bound in Theorem 1 consists of two terms. The first term $(a)$ is exactly the regret bound achieved by TS with exact posterior [Russo and Van Roy, 2016]. Since the action set size is $K$, the entropy of the optimal action $\mathbb{H}\left(A_*\right) \leq \log K$, and as discussed in Russo and Van Roy [2016], when the prior is informative, $\mathbb{H}\left(A_*\right) \ll \log K$. Therefore, the first term $(a)$ is order optimal and improves upon the worst-case regret bound achieved by other algorithms, e.g., upper confidence bound type algorithms. On the other hand, the second term (b) is an incremental term and accounts for posterior mismatch. Note that as the ensemble size $M$ approaches infinity, the second term $(b)$ converges to zero and our regret bound for ES reduces to that for TS with exact posterior. Moreover, as long as the ensemble size $M$ reaches $K T / d$, our regret bound for ES has the same order as that for TS in terms of $T$ and $d$ (up to logarithmic factors).

However, it requires MM reaches KT/dK T / d, which is not satisfiable to justify the need for Ensemble Sampling.

作者回复

Although our main contribution is providing a tighter regret bound for linear ensemble sampling, we have performed experiments as per Reviewer QBvS's request. The results are shown in the attachment.

We strongly believe that our work should be evaluated on its theoretical merit. Our improved regret bound for ensemble sampling matches LinTS for the first time, and our novel analysis should be sufficient for recognition. Furthermore, our regret analysis framework can be generalized to analyze other randomized algorithms such as LinPHE (and derive new regret bounds), which is always a plus, regardless of one's view on LinPHE and ensemble sampling.

评论

We would like to encourage the reviewers to reconsider the following: Resolving and improving upon a long-standing and critical open theoretical problem, which has seen multiple prior attempts, is of immense value to both the NeurIPS community and the broader bandit research field.

For researchers working on ensemble sampling, achieving a O~(d3/2T)\tilde{O}(d^{3/2}\sqrt{T}) frequentist regret bound has been a significant milestone—a holy grail. Our paper introduces a novel approach that successfully achieves this bound for the first time. Our proof techniques are innovative and differ significantly from traditional analyses of TS-based algorithms.

A well-written, rigorous paper that tackles this milestone and achieves the first-ever O~(d3/2T)\tilde{O}(d^{3/2}\sqrt{T}) frequentist regret represents a substantial contribution to the field. The novel proof techniques and ideas presented in our work push the boundaries of existing knowledge, and we are proud of these contributions.

However, it seems that one of the reviewers (Reviewer QBvS) has taken an adversarial stance with arguments that appear unreasonable and potentially aimed at undermining our work. For instance, Reviewer QBvS's claim that Qin et al. (2023)’s Bayesian regret bound is sharper than our frequentist regret bound indicates a fundamental misunderstanding and unfortunately undermines credibility, as these two types of regret are not directly comparable. The reviewer's dismissal of the significance of achieving O~(d3/2T)\tilde{O}(d^{3/2}\sqrt{T}) frequentist regret itself calls into question their capacity and credibility to fairly evaluate our paper.

This, coupled with a series of inconsistent arguments throughout the review process, raises serious concerns about the fairness of our evaluation. Such actions are deeply troubling and threaten the integrity of the reviewing process that NeurIPS aims to uphold.

We respectfully ask the reviewers to consider these points carefully. We hope the review process will be conducted with fairness and integrity, giving due recognition to meaningful contributions.

A paper offering contributions of this magnitude deserves acceptance. We are confident that our work holds significant value for the NeurIPS community. We are also open to making minor revisions to the paper to further clarify its contributions and address any clarification that may have been misunderstood by the reviewers. If you have any questions or comments, please feel free to let us know.

评论

Key Points from the Reviewer's Perspective

  1. Comprehensive Analysis Requirements

    • Analysis of Ensemble Sampling must consider both computation and regret. The issue on how ensemble size is chosen cannot be setting aside.
    • The problem-dependent nature of Thompson Sampling (LinTS) should be taken into account.
  2. Regret Metrics and Bounds

    • Bayesian regret is widely accepted in the analysis of randomized exploration algorithms.
    • Frequentist and Bayesian regrets are different metrics and should not be directly compared.
    • For finite action problems, a matching bound with LinTS in finite action setting is necessary. This is already achieved by [Qin et al., 2022] if setting aside how ensemble size is chosen.
    • Problem-dependent bounds for Ensemble Sampling would be preferable.
  3. Limitations of the Current Work

    • Achieving a matching performance of the infinite-action Frequentist regret bound of LinTS in a finite action setting, while setting aside how ensemble size is chosen, is not sufficient to be considered a "holy grail" in the line of Ensemble Sampling research.
    • The work does not explain the usefulness of Ensemble Sampling with linear function approximation, particularly due to the ensemble size scaling linearly with K, which contradicts empirical evidence a lot.

Reviewer's Assessment and Recommendations

  1. Holistic Evaluation: The analysis should consider both computational aspects and regret bounds. Setting aside the computation issue (how ensemble size is chosen) is not sufficient for a comprehensive evaluation.

  2. Problem-Dependent Analysis: Encourage the development of problem-dependent bounds for Ensemble Sampling, similar to those existing for Thompson Sampling.

  3. Practical Implications: The theory should aim to explain the effectiveness and usefulness of the existing algorithm, especially when empirical evidence contradicts theoretical predictions about ensemble size scaling.

  4. Advancement of the Field: These assessments are intended to advance the community's understanding, not to be adversarial. The bandit community has already seen numerous bounds in linear function approximation and general function approximation.

  5. Criteria for Significant Contribution: Given the motivation behind Ensemble Sampling and existing analysis of Ensemble Sampling under linear function approximation, a truly groundbreaking ("holy grail") theoretical work should either predict better algorithm design or provide significantly better understanding on the effectiveness/usefulness of existing algorithms.

Conclusion

While the achievement of certain regret bounds is noteworthy, a fair assessment must consider the broader context of both theoretical advancements and practical implications due to the motivation behind Ensemble Sampling. The reviewer emphasizes the need for problem-dependent analysis, explanation of empirical observations, and consideration of computational aspects alongside regret bounds. These points are crucial for advancing the field of Ensemble Sampling and ensuring that theoretical work translates into practical improvements in algorithm design and performance -- the motivation behind the line of research around Ensemble Sampling.

评论

Reviewer QBvS's Incorrect Claim Regarding Regret Bounds

Reviewer QBvS continues to falsely suggest that O~(d3/2T)\tilde{O}(d^{3/2}\sqrt{T}) is not a valid finite-arm frequentist regret bound for LinTS.

This is simply false. The more concerning issue is that the reviewer appears to be making this statement intentionally.

Despite our clear explanation and presentation of evidence showing that both O~(d3/2T)\tilde{O}(d^{3/2}\sqrt{T}) and O~(dTlogK)\tilde{O}(d\sqrt{T \log{K}}) are valid frequentist regret bounds for LinTS in finite-arm settings, the reviewer persists in making this incorrect claim. We have already provided ample evidence showing how this is widely accepted in the published literature.

Please note that the proof of the O~(d3/2T)\tilde{O}(d^{3/2}\sqrt{T}) bound for LinTS in [3] does not require the number of arms in the problem setup to be infinite. The O~(d3/2T)\tilde{O}(d^{3/2}\sqrt{T}) frequentist regret bound is valid for both infinite and finite arms. Reviewer QBvS’s argument that O~(d3/2T)\tilde{O}(d^{3/2}\sqrt{T}) is not a frequentist regret bound for finite arms is mathematically incorrect.


Reviewer QBvS's Incorrect Claim, and then Inconsistent Statements

Previously, Reviewer QBvS incorrectly claimed that the Bayesian regret bound of Qin et al. (2023) is sharper than our frequentist regret bound.

As we pointed out, these two measures are not comparable. Now, the reviewer appears to acknowledge that “Frequentist and Bayesian regrets are different metrics and should not be directly compared.” This raises concerns about the consistency of the review. We are not sure whether this mean that Reviewer QBvS acknowledges the earlier mistake of inappropriately comparing two incomparable regret measures. Because we have seen that Reviewer QBvS cannot agree on simple facts.


Reviewer QBvS's Mischaracterizing Our Results

Reviewer QBvS attempts to argue that the ensemble size O(KlogT)O(K\log{T}) in our work is inherently inferior to the ensemble size O(dlogT)O(d\log{T}) in Janz et al. [8], which is not only misleading but also unfair.

To argue against our result, the reviewer chooses parameter values like d=2d = 2 and K=10000K = 10000. In all fairness, please consider the opposite: d=10000d = 10000 and K=2K = 2. Even with much smaller values, like d=50d = 50 and K=2K = 2, the results differ significantly.

  • Our analysis: ES with ensemble size 2logT2\log{T} achieves a regret bound of 503/2T354T50^{3/2}\sqrt{T} \approx 354\sqrt{T}
  • Janz et al. (2023): ES with ensemble size 50logT50\log{T} achieves a regret bound of 505/2T17678T50^{5/2}\sqrt{T} \approx 17678\sqrt{T}

Right, this scenario seems unfair to Janz et al. (2023), as we are sure that everyone can see. But, this is exactly what Reviewer QBvS did by showing only one side. So, let’s be fairer and consider d=50d = 50 and K=50K = 50 where KlogT=dlogTK\log{T} = d\log{T}.

  • Our analysis: ES with ensemble size 50logT50\log{T} achieves a regret bound of 503/2T354T50^{3/2}\sqrt{T} \approx 354\sqrt{T}
  • Janz et al. (2023): ES with ensemble size 50logT50\log{T} achieves a regret bound of 505/2T17678T50^{5/2}\sqrt{T} \approx 17678\sqrt{T}

What do you see?

FACT: For any combination of dd and KK, our regret bound is always sharper than Janz et al. [8].

When it comes to the sufficient condition for ensemble size, O(KlogT)O(K\log{T}) and O(dlogT)O(d\log{T}) are apples and oranges. Is one theoretically higher than the other? No.

Please note, our aim in this example is not to show that our analysis requires a smaller number of ensembles, as that number varies by the problem setting. What we aim to highlight is how Reviewer QBvS selectively showed one side of the comparison, knowingly distorting the facts and our main focus to sharpen the regret bound.


We understand that the review process can sometimes be challenging and that authors may occasionally receive feedback that doesn't fully capture the true value of their work. However, this situation goes beyond mere noise, and we find it deeply concerning. Despite our best efforts to engage in a rational and constructive dialogue with Reviewer QBvS, we have been met with a series of false claims, and it seems these may be influencing other reviewers.

It seems there may be an underlying adversarial approach in Reviewer QBvS's feedback. We do not wish to endlessly argue with Reviewer QBvS, especially when it is unfortunately evident that there is no common understanding. Communicating with Reviewer QBvS has become extremely challenging (not because the questions themselves are challenging but because despite our continued efforts to build some common ground on simple facts, we couldn't), and we would prefer to cease further communication with Reviewer QBvS.

评论

Dear Authors, ACs, SACs and PCs,

I appreciate the authors' passion for their work and their desire to have it fairly evaluated. However, I feel it's necessary to clarify some points and address potential misunderstandings:

  1. Theoretical vs. Practical Implications:

    While achieving a O~(d3/2T)\tilde{O}(d^{3/2}\sqrt{T}) frequentist regret bound is a noteworthy theoretical result, we must be cautious about claiming that this closes the gap between theory and practice. The ensemble size scaling linearly with K in your analysis still contradicts empirical evidence, as reported in your own simulations. This discrepancy needs to be addressed for a comprehensive understanding of Ensemble Sampling's practical utility.

  2. Specific Regimes of Improvement:

    It's crucial to clearly state the specific regimes where your result shows improvement. As emphasized in my previous comments, a fair comparison should consider various problem settings, including both finite and infinite action scenarios, as well as different relationships between ambient dimension dd and number of actions KK. Presenting improvements in specific regimes would provide a more nuanced understanding of your contribution.

    Current result in this work is not sufficient to be entitled "Improved regret of Linear Ensemble Sampling". The first sentence in the abstract is indeed misleading.

    "In this work, we close the fundamental gap of theory and practice by providing an improved regret bound for linear ensemble sampling."

  3. Problem-Dependent Nature of Thompson Sampling:

    The analysis of Thompson Sampling-type algorithms, including Ensemble Sampling, should carefully consider the problem-dependent nature, particularly regarding the size of the action set. The regimes where d >> K, d ≈ K, and d << K may lead to different insights and bounds. This good tradition on the analysis Thompson sampling algorithm should be inherited to Ensemble sampling (serving as the need to approximate Thompson sampling). A comprehensive analysis addressing these various regimes would strengthen the paper's contribution.

  4. Holistic Evaluation:

    A fair assessment of Ensemble Sampling research should consider both computational aspects and regret bounds. The issue of how ensemble size is chosen is integral to the algorithm's practicality and shouldn't be set aside in the analysis.

  5. Advancement of the Field:

    Given the existing body of work on linear function approximation in bandits, a truly groundbreaking contribution should either predict better algorithm design or provide significantly better understanding of the effectiveness of existing algorithms across various regimes.

  6. Consistency in Evaluation:

    If I were the reviewer of [Janz et al., 2023], I would give the same suggestions and assessment. My critiques are not specific to your work but reflect my consistent standards for evaluating research in this field.

I hope these clarifications help in understanding my perspective. My intention is not to undermine the work but to ensure that we advance our field with rigorous, comprehensive, and practically relevant research. I encourage the authors to address these points in future revision, which could significantly strengthen the paper's contribution to the bandit community and even more broader field.

Sincerely, Reviewer QBvS

评论

Dear Reviewer QBvS,

We respectfully disagree with the reviewer's claim that "your analysis still contradicts empirical evidence, as reported in your own simulations." Our simulations, conducted during the rebuttal, demonstrate that ensemble sampling performs comparably to, or even outperforms, widely used and provably efficient algorithms such as LinUCB and LinTS when a sufficiently large number of ensembles is used. Our analysis does not contradict empirical evidence.

A significant issue arises from the reviewer's focus on requesting an empirical case where ensemble size specifically requires linearly increasing KK ensembles, leading to the assertion that "our analysis still contradicts empirical evidence." This claim by the reviewer is logically flawed because, as clearly stated in our paper and reiterated during the discussion, the condition on the ensemble size of KlogTK\log T is a sufficient condition, not a necessary condition. The reviewer seems to conflate sufficient and necessary conditions, persisting in this misunderstanding even after we clarified it during the discussion.

Please think about the following question: why would we even perform and present additional results during the rebuttal if our analysis indeed contradicted empirical evidence? It is disappointing to see non-contradictory results being used to challenge our theoretical contributions, without valid backing (with confusion between the necessary condition and sufficient condition).

Furthermore, our results are presented in terms of frequentist regret, which hence addresses the worst-case setting. The ensemble size condition we provide is indeed a sufficient condition under this worst-case framework. It is important to recognize that asking for empirical evidence to prove the necessity of a condition that is only proven to be sufficient in the worst case is not logically sound.

We also challenge the reviewer's implication that "d >> K, d ≈ K, and d << K may lead to different insights and bounds." As we have clearly stated, our frequentist regret bound of O(d3/2T)O(d^{3/2}\sqrt{T}) holds regardless of the relationship between dd and KK (whether dKd \gg K, dKd \approx K, or dKd \ll K). This is a factual statement, and it would be misleading to suggest otherwise.

We sincerely and respectfully ask the reviewer to acknowledge the factual errors and inconsistent claims made in the review and throughout the previous comments. We understand that reviewers can sometimes make mistakes or overlook certain contributions, and we are open to constructive feedback. However, doubling down on these mistakes without any acknowledgment of the errors should not be the way forward. Our active discussion would have been much more constructive if the reviewer and we had agreed upon simple facts. We are, of course, willing to make minor edits to the presentation and framing if reasonably suggested. If the reviewer's intention is to help strengthen our paper, we would welcome any suggestions, provided that the feedback is grounded in agreed-upon facts. Please note that we stand firm by our rigorously proven theoretical results.

最终决定

This paper makes a significant progress towards emsemble sampling, which is a generic strategy that can be instantiated for any hypothesis class (with minor adjustments). Obtaining a improved frequentists regret and showing that the ensemble size is tilde O(1) is quite an advancement from the existing result. This is in particular meaningful since the prior art by Janz et al. takes a variation of Ensemble sampling that is, depending on interpretation, closer to Thompson sampling than Ensemble sampling.

Please try to improve the quality of the manuscript by incorporating comments from the reviewers (I also recommend adding the nice table of comparison the authors provided during the rebuttal).