Generalized Kernelized Bandits: Self-Normalized Bernstein-Like Dimension-Free Inequality and Regret Bounds
摘要
评审与讨论
The paper proposes and tackles RKHS version of generalized linear bandit (GLB), which unifies both the finite-dimensional GLB and kernelized linear bandits. The main technical novelty is the derivation of a new self-normalized, Bernstein-type concentration bound by combining the "stitching" argument of Howard et al. (2021) with Freedman's inequality. Based on this, the authors propose GKB-UCB that attains a regret bound of , where is the maximal information gain and is the inverse of slope at the optimal arm, reminiscent and strict generalization of the best known regret for finite-dimensional GLB.
优缺点分析
Strengths
- Well-written
- Technically sound, and the proposed approach to building new concentration seems distinct from prior works.
Weaknesses
- No experimental results. Maybe take some inspirations from [1]. I feel that this will strengthen the paper, especially as it highlights efficient implementation of GKB-UCB. I would be interested to see the empirical regret comparison between the non-efficient ver and efficient ver, and other applicable baselines.
问题
- The regret bounds, when instantiated to finite-dimensional scenarios, seem to have extra logarithmic factors. What is the reason for this? Is this an inherent limitation of the algorithm, or simply an artifact of the current proof?
- The authors state that for the analysis purpose, the worst-case radius is used. Why is this necessary? I don't recall seeing similar arguments when proving the regret bounds for finite-dimensional GLBs (please correct me if I'm wrong here)
- The data-driven Freedman's inequality, although itself seems novel, seems to resemble prior such bounds: [1, Theorem 1], [2, Lemma 3] with simpler proofs. Can the authors elaborate on why the current form of the data-driven Freedman is necessarily tight for this paper's purpose?
局限性
yes
最终评判理由
All my concerns have been addressed, and considering the novelty of the tackled problem setting and its success, I maintain my score of 5.
There was some extensive discussion between the authors and Reviewer nkzS, which I have fully reviewed and commented on as well. While I do agree with some of the points that nkzS has raised, I do not agree with the nkzS's overall downplaying of this paper's contribution, as none of the prior works' notions of effective dimension do not consider the heteroskedasticity (caused by GLM) properly. Also, I believe that the authors' rebuttal has sufficiently clarified regarding all the confusions that nkzS has raised.
格式问题
None
We thank the Reviewer for the feedback and for having recognized that the paper is well-written and teachnically sound. Below we address the Reviewer's concerns.
Weaknesses
Although we believe that the core contributions of the paper are mainly theoretical and algorithmic, we agree that some numerical simulations comparing the two versions of our algorithm, as well as comparisons with existing ones for KBs and GLBs, would be of interest. We are running these simulations, and we will include them in the final version of the paper.
Questions
-
Yes, especially in the case of sub-Gaussian noise, we incur an additional factor (see line 340). We believe this is due to the use of Bernstein-like concentration bounds, which are suboptimal when dealing with sub-Gaussian noise. In this case, Hoeffding-like concentration bounds would be more appropriate. However, for our purposes, we need to use Bernstein-like bounds in order to highlight the dependence on the variance—specifically, to obtain tight bounds that depend on in the dominating term.
-
This is necessary because the confidence radius is instantiated for different functions , specifically the optimistic ones (see, for instance, Equation 105). To avoid a dependence on the sequence , which is random, in the regret bound, we chose to use the worst-case version with respect to the choice of , i.e., . This phenomenon does not arise in generalized linear bandit algorithms, since the corresponding confidence radius does not explicitly depend on (the counterpart of in that setting); see, for instance, [A, Section 3.1]. In other words, those algorithms already consider the worst-case confidence radius by design.
-
Informally, our inequality can be seen as an extension of those mentioned by the Reviewer, allowing us to achieve a tight dependence on the square root of the cumulative variance . Specifically:
- [1, Theorem 1]: In the first inequality, the dependence is on a free parameter that must be chosen deterministically, and thus independently of , which is a random quantity. The second inequality instead depends directly on the cumulative variance , rather than its square root—as ours does—which is necessary for achieving tight regret bounds.
- [2, Lemma 3]: This provides an inequality that depends on a parameter , which (similarly to in [1]) must be chosen deterministically and independently of . Our bound can be viewed as an extension that allows to depend on the cumulative variance, enabled by the stitching technique.
References
[A] Abeille, Marc, Louis Faury, and Clément Calauzènes. "Instance-wise minimax-optimal algorithms for logistic bandits." International Conference on Artificial Intelligence and Statistics. PMLR, 2021.
Thank you for your responses. They have addressed all my concerns, and I strongly maintain my score of 5.
I have read through all the discussions, especially with Reviewer nkzS. I have some comments of my own on some of the controversies, much of which as comment/question to the authors and some of which as response to Reviewer nkzS.
Heteroskedasticity
- I didn't get the "explicitly told by agent" heteroskedastic noise part, but because we are in a parametric setup (at least in the context of noise distribution), I don't see why this is a problem.
- is precisely of the form of variance-aware linear bandits. Namely, by definition, as the "optimal" variance (at the optimal arm) across the iterations is the same as , the "analogous" form would be .
- Question to the authors Is it straightforward for the results in this paper to be extended to time-varying arm-sets? In that case, one would obtain a regret bound that scales as , where .
Regarding the controversy on the effective dimension, I looked into it myself by checking relevant kernel/neural bandits literature [1,2,3,4,5] as well as a dueling bandit paper [6] and a logistic bandit paper [7]. Here are my opinions:
- Just in case I was wrong, I , and it seems that the effective dimension is (or should be) non-stochastic, as it takes into account the entire context(arm) sets throughout the horizons, and to my knowledge in contextual bandits, the randomness or adversarialness of the arm sets should be independent from the algorithm's randomness. [4, Sec 4.2] first defines this by taking the worst-case over all possible context set sequences; subsequent works [5, Def 3.2] [3, Def 4.3] [1, Def 5.3] doesn't explicitly take the worst-case, but it still constructs (weighted/Hessian-like) design matrix over the entire context set sequence. [1, Def 5.3] even explicitly states "design matrix containing all possible context-arm feature vectors over the rounds"
- [2] is a bit confusing, as it states that "Note that placing an assumption on above is analogous to the assumption on the eigenvalue of the matrix in the work on linear dueling bandits Bengs et al. (2022)," but then [2]'s definition of (see their Eqn. (4)) seems to be in line with the above; considering the entire context set sequence. I believe this is a typo(..?) on [2]'s side, as Bengs et al. (2022) [6] have two assumptions on the eigenvalues: one critical (Sec. 3.2.1. (A1)) and one not so critical (Corollary 3.3). I believe that [2] was referring to the critical one, which originates from [7], is again independent of the algorithm and is only there to ensure that the context set is sufficiently spanning the whole space.
- Anyhow, long story short, I agree with nkzS that the authors should be a bit more careful when adding in discussions regarding the effective dimension, as I believe that the effective dimension of prior works do NOT depend on the specific algorithm's randomness. The authors should definitely include more detailed discussions regarding comparison of their information gain to prior notions discussed during the rebuttal.
Minor stuff
- I believe that the authors have indeed cited Srinivas et al. (2010) as [29] (cited in the Appendix footnote 10, but still included in the main text's references)
- This is something that I missed as well, but as correctly pointed out by nkzS, the authors should take more care in making the exposition more rigorous, especially regarding infinite-dimensional notions such as Fredholm determinants throughout the main text and the proof.
- (line 245) "The fundamental result in the GLB literature is the bound of [9, Theorem 1]:" => should make it clear that the concentration holds only for Bernoulli.
- (should not impact the decision at all) about a week ago, a preprint (https://arxiv.org/abs/2507.20982) came out that provide some nice overview of the missing references as well as new dimension-free concentration for kernelized logistic regression. For the camera-ready, relevant discussions to this would be nice as well.
P.S. I may be wrong on some stuff, especially regarding the quick literature review. Please feel free to let me know if there are anything that I missed or misinterpreted or mistaken in my response. Thanks!
[1] https://arxiv.org/abs/2505.02069
[2] https://openreview.net/pdf?id=VELhv9BBfn
[3] https://proceedings.mlr.press/v119/zhou20a/zhou20a.pdf
[4] https://proceedings.mlr.press/v119/yang20h/yang20h.pdf
[5] https://arxiv.org/pdf/2010.00827
[6] https://proceedings.mlr.press/v162/bengs22a/bengs22a.pdf
Thank you for supporting our work and for the constructive feedback!
Heteroskedasticity
Thank you for the comment, especially for having highlighted the relationship between and . We will add a comment in the manuscript to note this point.
Question for the authors: We believe that, from a technical perspective, the extension to a time-varying arm set does not pose significant challenges. Indeed, the concentration result is not affected by the time-variant decision set, whereas in the regret analysis, it is sufficient to replace with , as the Reviewer suggested. Optimism works since the current decision will be taken from the current decision set .
Regarding the controversy on the effective dimension
Yes, you are right. The definition of the effective dimension in [4, Sec. 4.2] considers the sum of all the context/decisions in the set over the horizon and is therefore not stochastic. Thank you for pointing out our mistake. Instead, our maximal information gain considers the worst-case sequence of decisions over the horizon. We will include a detailed comparative discussion in the manuscript, summarizing all the considerations raised during the rebuttal regarding the effective dimension.
Minor points
We will fix the citations and decribe the rigorous treatment for infinite-dimensional notions (e.g., Fredholm determinants). Regarding the recent preprint https://arxiv.org/abs/2507.20982, we came across it and noticed that it addresses a similar problem to ours, concerning the concentration inequality. We will include a comparative discussion in the manuscript.
The authors proposed a new nonparametric form subsuming both generalized linear models and function approximation in reproducing kernel Hilbert spaces. Under this framework of general function approximation, they adapted weighted ridge regression to this bandit setting so as to achieve a high-probability regret bound using optimism in the face of uncertainty. The authors also came up with an efficient approximation of their algorithm.
优缺点分析
Strengths
- Most mathematical derivations are clear and easy to follow.
- Theorem 6.2 and its proof align with the intuition of the community of variance-aware online learning.
- The effort in deriving the efficient implementation in Appendix A sets a good role model for the theory community.
Weaknesses on Credit Assignment and Over-claiming
- At the message and technical level, the idea of "kernel + link function" already appears in the bandit literature, e.g., [1]. In particular, [1] achieved nearly the same regret bound as that in this paper using similar (and even simpler) technical components, in which their "effective dimension" is essentially the same as the author's "maximal information gain". [1] appeared on arXiv about a week before the submission deadline and the authors did not discuss it in the paper.
- Table 1 is made to show that this paper is the only one that can tackle heteroskedastic noise besides the work [2]. However, what the authors actually considered is the "variance-revealing" case, in which the potentially heteroskedastic variance of the picked arm is given to the learner. When and are unknown, such an implicit assumption is not entirely reasonable. In contrast, certain papers on linear and generalized linear bandits back to two years again, e.g., [3] and [4], are already able to handle heteroskedastic noise without such type of variance-revealing assumption.
- Both quantities appear in Assumption 3.1 and Assumption 3.2 should not be treated as "constants". In particular, the norm bouund and the kernel bound are important problem-dependent quantities. Therefore, "state-of-the-art" statement in the Abstract of this paper is an over-claiming because the dependency on and is not optimal when specialized to generalized linear bandits, given the true SoTA [5], which the authors also mentioned in the conclusion.
- The authors called the standard peeling technique "stitching" and claimed its application to deriving "empirical" Freedman to be "of independent interest". However, this type of peeling in improving Freedman has been widely used in the variance-aware online learning community. To be concrete, Theorem 9 of [6] and Appendix A of [7] are two examples of such peeling techniques.
Weaknesses on Rigorousness and Correctness
- When deriving equation (68), which is a key part in proving Theorem 6.2, the authors invoked their version of elliptical potential lemma, Lemma C.6. In the last step of deriving Lemma C.6, the authors claimed that is follows from the proof in [8]. However, if we truely follow the derivation of Lemma 12 of [8], two issues arise:
- Even in the finite-dimensional case, the term on the right-hand side of equation (196) does not really align with if the dimension is not one.
- To make matters worse, in the infinite-dimensional case, there is in general no well-defined determinant for an operator. Technically speaking, the only known and widely used notion of determinant for infinite-dimensional real Hilbert space is the Fredholm determinant, which is only applicable for operators of the for where has a finite trace. See, e.g., https://en.wikipedia.org/wiki/Fredholm_determinant for details. Therefore, even though equation (196) might intuitively make sense for infinite-dimensional Hilbert spaces, it is not mathematically clear what the authors actually did to achieve the in equation (196). I suggest the authors to first try to write down the concrete definition of determinant of some necessarily involved operators in deriving this inequality (196), instead of citing [8] in a sloppy way.
References
[1] Bae, S., & Lee, D. (2025). Neural Logistic Bandits. arXiv preprint arXiv:2505.02069. (https://arxiv.org/pdf/2505.02069)
[2] Faury, L., Abeille, M., Calauzènes, C., & Fercoq, O. (2020, November). Improved optimistic algorithms for logistic bandits. In International Conference on Machine Learning (pp. 3052-3060). PMLR.
[3] Zhao, H., He, J., Zhou, D., Zhang, T., & Gu, Q. (2023, July). Variance-dependent regret bounds for linear bandits and reinforcement learning: Adaptivity and computational efficiency. In The Thirty Sixth Annual Conference on Learning Theory (pp. 4977-5020). PMLR.
[4] Di, Q., Jin, T., Wu, Y., Zhao, H., Farnoud, F., & Gu, Q. (2023). Variance-aware regret bounds for stochastic contextual dueling bandits. arXiv preprint arXiv:2310.00968.
[5] Lee, J., Yun, S. Y., & Jun, K. S. (2024). A unified confidence sequence for generalized linear models, with applications to bandits. arXiv preprint arXiv:2407.13977.
[6] Zimmert, J., & Lattimore, T. (2022, June). Return of the bias: Almost minimax optimal high probability bounds for adversarial linear bandits. In Conference on Learning Theory (pp. 3285-3312). PMLR.
[8] Abeille, M., Faury, L., & Calauzènes, C. (2021, March). Instance-wise minimax-optimal algorithms for logistic bandits. In International Conference on Artificial Intelligence and Statistics (pp. 3691-3699). PMLR.
问题
in this paper is essentially the same notion as the "effective dimension" in the neural bandits literature like [1] and corresponds to the dimension in the linear bandits literature, as stated in [2], right?
References
[1] Zhou, D., Li, L., & Gu, Q. (2020, November). Neural contextual bandits with ucb-based exploration. In International Conference on Machine Learning (pp. 11492-11502). PMLR.
[2] Srinivas, N., Krause, A., Kakade, S. M., & Seeger, M. (2009). Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995.
By the way, the missing citation of [2] is really not as expected, because this GP-UCB paper proposes the for the first time in the bandit community.
局限性
Yes.
最终评判理由
I have an in-depth discussion with the authors regarding their contributions. I believe that I have fully understood what the authors try to claim. However, I found that every concern (about their contributions) I raised does hold water and the best the authors have done is mainly reiterating what I concern about by stating "what we intended by ... is ..." or "this is what we intended by ...". Therefore, my conclusion is that the contribution of this submission is nearly not novel and Table 1 in this submission contains significant overclaims.
- I am willing to discuss about this submission with the area chair if necessary.
格式问题
N/A.
We thank the Reviewer for the feedback and for appreciating the clarity of the mathematical derivations and the effort in deriving an efficient implementation. Below, we address the Reviewer's concerns.
Weaknesses on Credit Assignment and Over-claiming
- We thank the Reviewer for reference [1]. First of all, we note that, according to the NeurIPS Call for Papers, [1], being made available online after March 1st, must be considered contemporaneous work. Therefore, the presence of [1] cannot be used as an argument to diminish the novelty of our paper. Nevertheless, we provide below a comparative discussion with our work, which we will include in the final version of the paper:
-
Section 3 of [1] provides a concentration bound (Theorem 3.1) similar to our Theorem 6.2. However, their result (i) assumes a finite-dimensional decision set; and (ii) exhibits a worse dependence on , being of order , whereas ours is of order . Looking at Appendix A of [1], it can be observed that their proof technique is very similar to ours, as both are inspired by Zhou et al. (2022). Thus, [1] is not using simpler techniques, except for avoiding the stitching/peeling argument, which we specifically use to reduce the dependence to .
-
The effective dimension in Definition 5.3 of [1] is not the same as our maximal information gain . Indeed, Definition 5.3 of [1] does not consider the exact variance term , which is instead upper bounded by its maximum value (denoted as in our notation). In contrast, in our definition in Section 5, we retain the variance terms within the definition of information gain, which leads to the bound . Even when considering the worst-case sequence of decisions, leading to the maximal information gain , we still retain the variance factors. Finally, is a random quantity that depends on the specific realization of the algorithm's execution (i.e., the sequence of decisions). For this reason, it is inappropriate to present a regret bound, such as in Theorem 5.5 of [1], that depends on it. Instead, a maximal version based on the worst-case sequence of decisions should be considered (like ), as is standard in approaches like [2].
-
The regret bound of [1, Theorem 7.2] holds under Condition 5.4, which in turn involves several assumptions (5.1 and 5.2) that are not required by our algorithm. Before comparing the regret bounds, a few considerations:
(i) [1, Theorem 7.2] does not show a dependence on the norm of the decisions , since under their Assumption 5.2 it is assumed to be 1. This corresponds, in our notation, to setting ;
(ii) [1, Theorem 7.2] depends on , which is defined in Definition 6.2 and is therefore related to the bound on the norm of the parameters . Thus, apart from constants (like , the number of hidden neurons in the network), corresponds to our parameter .
Given these considerations, we can effectively compare our regret bound with that of [1] (setting and assuming for a fair comparison):
$
\text{Our bound:} \qquad \widetilde{O}\left( B \widetilde{\gamma}_T \sqrt{\frac{T}{\kappa_*}} + B^2 \sqrt{\frac{\widetilde{\gamma}_T T }{{\kappa_{*}}}} \right).
$
$
\text{Bound of [1, Theorem 7.2]:} \qquad \widetilde{O}\left(B^2 \widetilde{\gamma}_T \sqrt{\frac{T}{\kappa_{*}}} + B^{2.5} \sqrt{\frac{\widetilde{\gamma}_T T}{\kappa_{*}}} \right).
$
Thus, our bound exhibits a better dependence on .
-
Regarding the heteroschedasticity, the goal of our Table 1 is to compare the approaches able to tackle generalized linear bandits or kernelized bandits. We thank the Reviwer for the references, however, we think they do not consider these settings. Indeed, [3] focuses on dueling bandits that differ from our setting from the interaction protocol (although a link function is present as well), whereas [4] focuses on linear bandits. Nevertheless, we will cite them in the related works section to mention further works that focus on the heteroschedastic case.
-
As the Reviewer noted, we acknowledged the non-tight dependence on and in the conclusions. In the abstract, we intended to specify the tightness of the dependence only with respect to , , and . To avoid any risk of overclaiming, we will rephrase the sentence in the abstract as:
"Our results match, up to multiplicative constants and logarithmic terms, the state-of-the-art bounds for both KBs and GLBs with respect to , , and , while showing suboptimal dependence on and ." -
First of all, we clarify that the term stitching is widely used in the community dealing with confidence sequences [e.g., A, B]. More importantly, we never claimed that our empirical Freedman inequality in Theorem 6.1 is of independent interest. Indeed, we are aware that similar results already exist, and this is evidenced by the sentence between lines 282 and 283, where we cite a result, [12, Theorem 12], analogous to those provided by the Reviewer (Theorem 9 of [6] and Appendix A of [7]). We will include the references suggested by the Reviewer in the final version. However, we highlight that they compare analogously to [12, Theorem 12] in relation to our inequality. Specifically, as stated between lines 283–284, "our result allows tuning the parameters and to tighten the bound, ultimately leading to an improvement of the constants." We emphasize once again that this is the only improvement over the state of the art that we claim for Theorem 6.1, namely, generality in and , and better constants ( in our case, compared to in Theorem 9 of [6], and in Appendix A of [7]). We will further clarify this in the final version. The phrase independent interest appears in our manuscript at lines 10 and 346, and in both cases, it unequivocally refers to the self-normalized bound of Theorem 6.2, which is novel.
Weaknesses on Rigorousness and Correctness
Thank you for pointing out the issue. As we mentioned in Footnote 12, we focused on the setting of paper [32], as our derivations are valid under Assumptions 1 and 2 of [32], which are the same as those considered in our paper. As the Reviewer correctly points out, in [32], the determinant used is the Fredholm determinant. We will clarify this in our manuscript. Regarding the specific derivation in line (196), we note that the same argument (in the infinite-dimensional case) is employed in the "Proof Sketch for Theorem 2," which relies on Lemma 5 of [32], a result that holds in infinite-dimensional Hilbert spaces. For this reason, we will replace the citation to [8] with a reference to [32, Lemma 5] and mention that we are working with Fredholm determinants.
References
[1] Bae, S., & Lee, D. (2025). Neural Logistic Bandits. arXiv preprint arXiv:2505.02069.
[2] Faury, L., Abeille, M., Calauzènes, C., & Fercoq, O. (2020, November). Improved optimistic algorithms for logistic bandits. In International Conference on Machine Learning (pp. 3052-3060). PMLR.
[3] Zhao, H., He, J., Zhou, D., Zhang, T., & Gu, Q. (2023, July). Variance-dependent regret bounds for linear bandits and reinforcement learning: Adaptivity and computational efficiency. In The Thirty Sixth Annual Conference on Learning Theory (pp. 4977-5020). PMLR.
[4] Di, Q., Jin, T., Wu, Y., Zhao, H., Farnoud, F., & Gu, Q. (2023). Variance-aware regret bounds for stochastic contextual dueling bandits. arXiv preprint arXiv:2310.00968.
[5] Lee, J., Yun, S. Y., & Jun, K. S. (2024). A unified confidence sequence for generalized linear models, with applications to bandits. arXiv preprint arXiv:2407.13977.
[6] Zimmert, J., & Lattimore, T. (2022, June). Return of the bias: Almost minimax optimal high probability bounds for adversarial linear bandits. In Conference on Learning Theory (pp. 3285-3312). PMLR.
[7] Li, G., Cai, C., Chen, Y., Wei, Y., & Chi, Y. (2021). Is Q-Learning Minimax Optimal? A Tight Sample Complexity Analysis. arXiv preprint arXiv:2102.06548.
[8] Abeille, M., Faury, L., & Calauzènes, C. (2021, March). Instance-wise minimax-optimal algorithms for logistic bandits. In International Conference on Artificial Intelligence and Statistics (pp. 3691-3699). PMLR.
[A] Kuchibhotla, Arun K., and Qinqing Zheng. "Near-Optimal Confidence Sequences for Bounded Random Variables." International Conference on Machine Learning. PMLR, 2021.
[B] Chugg, Ben, Hongjian Wang, and Aaditya Ramdas. "Time-Uniform Confidence Spheres for Means of Random Vectors." Transactions on Machine Learning Research.
[32] Justin A. Whitehouse, Aaditya Ramdas, and Zhiwei S. Wu. On the sublinear regret of GP-UCB. In Advances in Neural Information Processing Systems (NeurIPS), volume 36, pages 35266–35276, 2023.
I have carefully reviewed the rebuttal from the authors, especially regarding the elliptical potential part. Base on the rebuttal, could I conclude that the self-normalized concentration inequality in this submission is dimension-free only if the underlying Hilbert space is of countably infinite dimension?
- I raise this question because, in Table 1 of this submission, some self-normalized bounds in previous works for finite-dimensional linear bandits are classified as "not dimension-free". But if the main concentration inequality in this submission is also not dimension-free in the finite-dimensional setting, the comparison in Table 1 might be unfair.
Thank you for carefully reviewing our rebuttal. Yes, indeed, our self-normalized concentration inequality holds for RKHSs with a countably infinite basis. To clarify, by dimension-free, we mean that the bound does not depend on the dimensionality of the space, that is, it does not depend on the cardinality of a minimal basis . We emphasize that this holds in both the countably infinite and finite-dimensional cases. In fact, our self-normalized concentration inequality depends only on the norm of the feature vector only and not on . In contrast, for example, the bounds in Faury et al. (2020) [9, Theorem 1] and Zhou et al. (2021) [36, Theorem 4.1] include an explicit dependence on . We will clarify this point in the final version.
But in the finite-dimensional setting, no matter you are using [32, Lemma 5] or [8, Lemma 12], how could the stuff in the proof be dimension-free?
- If you use [8, Lemma 12] directly in the finite-dimensional setting, there should be a linear dependency on , right?
- If you use [32, Lemma 5], how could it be applicable for the finite dimensional setting?
To be more concrete, in the finte-dimensional setting, is and will be linear in .
References
[8] Abeille, M., Faury, L., & Calauzènes, C. (2021, March). Instance-wise minimax-optimal algorithms for logistic bandits. In International Conference on Artificial Intelligence and Statistics (pp. 3691-3699). PMLR.
[32] Justin A. Whitehouse, Aaditya Ramdas, and Zhiwei S. Wu. On the sublinear regret of GP-UCB. In Advances in Neural Information Processing Systems (NeurIPS), volume 36, pages 35266–35276, 2023.
Thank you for your answer, your concern is now clearer to us. We take this opportunity to further clarify:
-
When claiming the dimension-free property, we were referring to the explicit dependence on in the concentration inequality. Our Theorem 6.2 does not exhibit such a dependence.
-
The Reviewer is referring, instead, to an implicit dependence on , which appears in the bound of the log-determinant term in Theorem 6.2. Specifically,
which can be derived analogously to Lemma C.7. This bound applies to finite-dimensional feature spaces; but other bounds for the term for example, using the information gain, are also possible and meaningful even in the finite-dimensional case.
-
Nevertheless, for generalized linear bandits, it has been shown that dependence on the dimensionality is unavoidable in the worst-case scenario (see, for instance, Theorem 2 of [2]). This is the reason why the bound to contains a dependence on and, in this sense, cannot be dimension-free.
-
Finally, we remark that, even the seminal inequality of Abbasi-Yadkori et al. (2011) [1], which we also classified as "dimension-free" in Table 1, contains a log-determinant term, whose bound depends on (see Theorem 2 of [1] statements 1 and 2).
We hope this clarifies the issue. We will make sure to include this discussion on this point.
References
[1] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems (NIPS), pages 2312–2320, 2011
[2] Marc Abeille, Louis Faury, and Clément Calauzènes. Instance-wise minimax-optimal algorithms for logistic bandits. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 3691–3699. PMLR, 2021.
Let us say just for linear logistic bandits with feature dimension . No matter how you redefine the stuff into new quantities like "maximal information gain", both the self-normalized concentration bounds in this submission and [F] scale with , right? But [F] is classified as "not dimension-free" in Table 1 of this submission. Is that my misunderstanding???
References
[F] Faury, L., Abeille, M., Calauzènes, C., & Fercoq, O. (2020, November). Improved optimistic algorithms for logistic bandits. In International Conference on Machine Learning (pp. 3052-3060). PMLR.
First of all, let us remark that the reason why we classified [F] as "non-dimension-free" is because of the explicit dependence on in the last addendum of Theorem 1: and not because the log-determinant is bounded by to . That said, [F] and our inequality, when restricted to logistic bandits, contain precisely the same log-determinant term: with \\widetilde{V}\_t = \\sum\_{s=1}^{t-1} \\dot{\\mu}(\\theta^\\top x\_s) x\_s x\_s^\\top.
Now, the question becomes whether is always , or only bounded to in certain cases. Consider the following example, where we assume for simplicity (i.e., linear bandits). Suppose x_s = (1,0,\dots,0)^\\top for every , we have that: \\widetilde{V}\_t = \\begin{pmatrix} t-1 & 0 & \\dots & 0 \\\\ 0 & 0 & \dots & 0 \\\\ \\vdots & \\vdots & \\ddots & \\vdots \\\\ 0 & 0 & \dots & 0 \\end{pmatrix}, \qquad \\lambda^{-1} \\widetilde{V}\_t(\lambda) = I + \\lambda^{-1} \\widetilde{V}\_t = \\begin{pmatrix} 1 + \\lambda^{-1}(t-1) & 0 & \\dots & 0 \\\\ 0 & 1 & \dots & 0 \\\\ \\vdots & \\vdots & \\ddots & \\vdots \\\\ 0 & 0 & \dots & 1 \\end{pmatrix}. As a consequence, , which shows no dependence on . Clearly, this is a rather unrealistic case when applying our inequality or that of [F] in a regret minimization algorithm. Still, restricted to the discussion of the properties of the concentration inequality, this provides an example in which our inequality shows no dependence on at all, whereas that of [F] does, due to the presence of the term .
Nevertheless, if the Reviewer believes that our definition of dimension-free as lack of explicit dependence on may be misleading, we are open to changing it (including also implicit dependencies on ) in the manuscript to avoid any confusion.
- I believe I have understood what the authors try to claim about the reason behind why they sometimes give a worst-case bound an executation-dependent quantity but in some other cases (in the finite-dimensional setting) are in need of manifesting that the quantity is not always as bad as .
- If my understanding is correct, then, let us now focus on the penultimate column of Table 1 of this submission, whose title is "Heterosc."; my concerns are as follows.
- Following the authors' reasoning about what the authors intended by heteroscedasticity, then may I ask which paper on generalized linear bandits (under assumptions similar to this submission) cannot deal with heteroscedastic feedback noise? For example, it seems that the authors' reasoning implies that all published papers by far on logistic bandits can deal with heterscedasticity.
- According to the explanation by the authors on what they intended by heteroscedasticity, the check marks in the penultimate column of Table 1 are only because the data model of interest is generalized linear models, and the contrast between the cross marks and check marks in this column is not a part of the "new knowledge" introduced by this submission.
-
We are happy to have clarified this point.
-
Regarding the additional concerns about heteroscedasticity. First of all, Table 1, as explained in the caption, is about the properties of the self-normalized concentration inequalities, not the properties of the regret minimization algorithms.
- Clearly, the papers on generalized linear bandits can deal with heteroscedastic noise. The key point is whether they are able to effectively capture it in the concentration tools they use to derive tight regret bounds. As also explained in [iv, Section 3.1, paragraph "Comparison to prior work"], one could deal with heteroscedasticity even using the inequality of Abbasi-Yadkori et al. (2011) [1], simply by considering the maximum variance/sub-Gaussian proxy. However, this would introduce a dependence on the minimum variance (denoted by in [iv]), which is known to be suboptimal for generalized linear bandits. For instance, early works on generalized linear bandits can deal with heteroscedasticity, but they apply concentration tools that do not take it into account, resulting in a suboptimal dependence on the minimum variance, see [v, Theorem 2, the term] and [vi, Theorem 2, the term]. Instead, [ii] and [iv], as well as our work, display a tight dependence on .
- The penultimate column of Table 1 and the corresponding marks address the question: "Is the concentration inequality taking into account the possible heteroscedasticity due to the exponential family noise?" First, we believe this is a meaningful question in its own right, as a property of the concentration inequality. Second, this question also makes sense from the regret minimization perspective: as explained above, nothing prevents applying, for instance, the inequality of Abbasi-Yadkori et al. (2011) [1] while ignoring heteroscedasticity and merely considering the maximum variance/sub-Gaussian proxy, which leads to suboptimal regret bounds. Finally, our self-normalized concentration inequality allows us to answer "yes" to the above question in the setting of generalized kernelized bandits. For this reason, we consider it to represent "new knowledge" that was not previously available in the literature.
References
[v] Li, Lihong, Yu Lu, and Dengyong Zhou. "Provably optimal algorithms for generalized linear contextual bandits." International Conference on Machine Learning. PMLR, 2017.
[vi] Filippi, Sarah, et al. "Parametric bandits: The generalized linear case." Advances in Neural Information Processing Systems 23 (2010).
- I have carefully read the authors response on what they intend to claim by "heterscedasticity".
- I have re-evaluated this submission based on all the discussions by far.
- I have submitted my final rating and final justification.
I have read the authors elaboration on what they intend to claim by "dimension-free" and believe I have understand what they want to claim about it. Regarding other technical parts of the rebuttal. I have the following viewpoints as a follow-up.
- Based on my understanding, in the rebuttal and subsequent responses, the authors try to battle with themselves as follows.
- The authors consider it to be inappropriate the place the executation-dependent quantity , which is termed as the "effective dimension" in standard neural bandits literature [Z, V, 1], in the upper bounds and claim that their worst-case quantity (the maximal information gain) to be more appropriate than that in [1] when comparing these two concurrent works.
- On the other hand, when the authors discuss about their "worst-case " quantity , they try to justify that this executation-dependent quantity can be dimension-free when the executation has nice .
- Overall, I suggest the authors to have a consistent tone when discussing their work and others. By the way, a minor point is that, if it is indeed the case that , it is not very appropriate to claim the introduction of to be a key difference of that in [1].
- In the authors' rebuttal bullet "Regarding the heteroschedasticity ...", there are several points to remind.
- The authors claim that "[3] focuses on dueling bandits", which is not true. [3] focuses on linear bandits.
- Also, as a minor point, at least two papers on linear bandits appear in Table 1 of this submission, but in the rebuttal, the authors state that "the goal of our Table 1 is to compare the approaches able to tackle generalized linear bandits or kernelized bandits".
- Most importantly, the authors claim that they method can tackle heteroscedastic noise, but only when the heteroscedastic variance is explicitly told to the agent. This is considered a significant drawback because such a "explicit variance revealing" is already not needed for the linear and generalized linear settings, at least several years ago, see, e.g., [3,4].
- I definitely know that "dueling ... are not exactly genearlized linear ..." and "kernel ... is not exactly linear ...", but technically speaking, since "explicit varaince revealing" can be avoided in linear bandits or bandits with link functions; it should be, mathematically, not needed for at least the finite-dimensional case in this submission.
- Since "heteroscedasticity" is really an application-inspired concept, it is not very clear why could the agent see the heteroscedastic variance by explicitly know , where we know that the noise variance is exactly .
References
[Z] Zhou, D., Li, L., & Gu, Q. (2020, November). Neural contextual bandits with ucb-based exploration. In International conference on machine learning (pp. 11492-11502). PMLR.
[V] Verma, A., Dai, Z., Lin, X., Jaillet, P., & Low, B. K. H. (2024, January). Neural dueling bandits. In ICML 2024 Workshop: Foundations of Reinforcement Learning and Control--Connections and Perspectives.
[1] Bae, S., & Lee, D. (2025). Neural Logistic Bandits. arXiv preprint arXiv:2505.02069.
[3] Zhao, H., He, J., Zhou, D., Zhang, T., & Gu, Q. (2023, July). Variance-dependent regret bounds for linear bandits and reinforcement learning: Adaptivity and computational efficiency. In The Thirty Sixth Annual Conference on Learning Theory (pp. 4977-5020). PMLR.
[4] Di, Q., Jin, T., Wu, Y., Zhao, H., Farnoud, F., & Gu, Q. (2023). Variance-aware regret bounds for stochastic contextual dueling bandits. arXiv preprint arXiv:2310.00968.
Another follow-up question is, if the authors claim that the algorithm can deal with heterscedastic variance, why the final regret bound seems like a worst-case one? In other words, the regret bounds is not something proportional to .
We are happy that we have clarified what we intended by "dimension-free." Regarding the follow-up questions:
-
We believe that we kept a consistent tone in discussing , , and .
-
Yes, it is more appropriate in our view to have a decision-independent quantity like in the regret bound since it is not random and allows us to quantify the (high-probability) regret of the algorithm before executing it, i.e., before seeing the sequence of decisions. This is in line with the linear bandit, generalized linear bandit, and kernelized bandit literature [i, Theorem 4], [ii, Theorem 1], [iii, Theorem 2], [iv, Theorems 2 and 3]. We don't see the point of deviating from such a practice. Since the goal of this work is to combine generalized linear bandits and kernelized bandits, we followed the established practices in these areas to enable a fair comparison.
-
We indeed provided an explicit example in which is not dependent on .
-
The two points above are consistent. The first one is about the regret bound, and the second one is about the algorithm. Indeed, using in the confidence radius limits the exploration of the algorithm without compromising its theoretical guarantees. So, our desideratum is: use the smallest (random) confidence radius in the algorithm, , and then present the regret bound as a deterministic expression, i.e., , for the reasons clarified above.
-
-
About heteroscedasticity:
-
Of course, it is a typo. We misplaced reference [3] with reference [4], as this is apparent in the sentence "Indeed, [3] focuses on dueling bandits that differ from our setting from the interaction protocol (although a link function is present as well), whereas [4] focuses on linear bandits."
-
Yes, the reviewer is right. The first two papers in Table 1 refer to linear bandits since they are the ones on top of which both generalized linear bandits and kernelized bandits build.
-
We take the opportunity to clarify the origin of the stochasticity in our work. We remind that we are considering a canonical exponential family noise model (Equation 2), which, for some choices of its parameters, implicitly generates noise random variables with different variances depending on the decision . Because of the properties of the canonical exponential family, the variance of the noise is directly related to its parameters . Given this, some observations are in order:
-
The function is known to the agent, as in all literature about generalized linear bandits [e.g., ii, iv], and as a consequence of this, the variance of the noise is revealed to the agent. This is what we intend by heteroscedasticity. If the Reviewer thinks that this term is inappropriate and/or misleading, we are open to any suggestion to change it. Still, it remains the common practice in the works on generalized linear bandits.
-
The Reviewer suggests that "since "explicit varaince revealing" can be avoided in linear bandits or bandits with link functions; it should be, mathematically, not needed for at least the finite-dimensional case in this submission." First of all, paper [3] considers Bernoulli feedback, which is a specific case of the canonical exponential family. Even assuming, following the Reviewer's claim, that this problem is solved in bandits with link function (we think that it might be the case for logistic ones or, at least, for ones with Bernoulli observations, but not in general for the canonical exponential family), it remains an open problem in kernelized bandits, as far as we know. Thus, we do not see the point in diverging from the common practice of assuming knowledge of and attempting to relax such an assumption in our work, given that the primary goal is not addressing heteroschedastic generalized linear bandits or heteroschedastic kernelized bandits, but combining generalized linear bandits and kernelized bandits.
-
Regarding a possible dependence on : we clarified above what we mean by heteroscedasticity in our setting. It is not our goal to obtain such a result, in line with the generalized linear bandit literature. By the way, the variance-related terms will appear in the maximal information gain .
-
-
References
[i] Chowdhury, Sayak Ray, and Aditya Gopalan. "On kernelized multi-armed bandits." International Conference on Machine Learning. PMLR, 2017.
[ii] Marc Abeille, Louis Faury, and Clément Calauzènes. Instance-wise minimax-optimal algorithms for logistic bandits. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 3691–3699. PMLR, 2021.
[iii] Whitehouse, Justin, Aaditya Ramdas, and Steven Z. Wu. "On the sublinear regret of GP-UCB." Advances in Neural Information Processing Systems 36 (2023): 35266-35276.
[iv] Louis Faury, Marc Abeille, Clément Calauzènes, and Olivier Fercoq. Improved optimistic algorithms for logistic bandits. In International Conference on Machine Learning (ICML), pages 3052–3060. PMLR, 2020.
In this paper, the regret minimization problem is studied for Generalized Kernelized Bandits (GKBs) to optimize an unknown function that belongs to a Reproducing Kernel Hilbert Space (RKHS), having access to the noise being generated from an exponential family noise source. The proposed model is an extension to Kernelized Bandits (KBs) and Generalized Linear Bandits (GLBs). An Upper Confidence Bound (UCB) based algorithm is proposed. As tight regret guarantees can not be obtained using existing self-normalized concentration inequalities, the authors have proposed a novel Bernstein-like dimension-free inequality and claimed that it is a contribution of independent interest. An upper bound on the regret is obtained as a function of learning horizon, maximal information gain, and magnitude of reward nonlinearity. This bound matches the state-of-the-art bounds for KBs and GLBs up to multiplicative constants and logarithmic terms.
优缺点分析
The paper is well-written and technically solid. However, more intuitions need to be provided to understand the impact of the result better. Moreover, at some places, it is difficult to follow due to deferring some of the discussions towards the end and insufficient details regarding a few important concepts. No simulation results are presented.
问题
- Assumption 3.3 assumes the existence of a bounded noise, which may not hold for practical scenarios, e.g., Gaussian noise. The discussion following Assumption 3.3 needs to be elaborated further for more insight.
- Lemma 5.1 reveals that there is a dependence on the maximum slope of the inverse link function and no dependence on the minimum slope. Please discuss why the former is a better thing to happen.
- In deriving Theorem 6.2, it is stated that the dependence on is removed by replacing it with the norm of the feature map, which is bounded under Assumption 3.1? Why can’t we directly assume that is finite? Would that be a stronger assumption than Assumption 3.1?
- The authors are advised to add a small discussion on the heteroscedastic characteristics of the noise for better readability.
- For algorithmic and analysis purposes, two different versions of are used. Moreover, another worst-case version is also introduced. This seems a little confusing. Please clarify what purpose do these different versions serve.
- The proposed algorithm can be compared using simulations with suitably adapted versions of the existing algorithms for KBs and GLBs.
局限性
Yes
最终评判理由
The authors have addressed the concerns raised by me. Regarding the over-claim as pointed out by reviewer nkzS, I agree that the contributions of the paper need to stated more clearly without any ambiguity. As stated by reviewer xX8K, I also believe the authors have addressed the main concerns during the rebuttal. I am happy to raise my score to 5.
格式问题
Not applicable
We thank the Reviewer for the feedback and for appreciating the technical solidity of the paper. In the final version, we will make use of the additional page to introduce the discussion earlier and provide more details on the important concepts. Below, we provide answers to the Reviewer's questions.
- We believe that this assumption is quite mild. As we discuss immediately below the assumption, for -sub-Gaussian noise (which includes Gaussian noise), we can always set , which guarantees that uniformly for all with probability at least . This follows from a union bound argument:
where the second inequality follows from the union bound, and the last one from Hoeffding's inequality for sub-Gaussian random variables. We will include this elaboration in the final version.
-
This is desirable because many link functions used in practice, such as the sigmoidal one, have a bounded maximum slope (e.g., for the sigmoid and the identity function), while their minimum slope can approach zero (i.e., for the sigmoid function). Therefore, a dependence on , as found in early works on generalized linear bandits [e.g., A], is undesirable.
-
In a kernel space, the dimensionality of the feature map induced by the kernel can be infinite. Thus, assuming to be finite would reduce the setting to the already studied generalized linear bandit case. Moreover, even under the assumption of finite , it would not be possible to remove Assumption 3.1, which—in the generalized linear bandit case—reduces to requiring a bounded norm of the parameter [see B, Assumptions 1 and 2].
-
The heteroscedasticity of the noise is directly related to the noise model, which is based on a canonical exponential family. Under this model, the variance of the noise is proportional to the slope of the link function, i.e., . We will emphasize this point in the final version.
-
We introduce two versions, and , which differ in the use of the information gain: versus the maximal information gain . Consequently, , where the former is a random variable and the latter is a deterministic quantity. Moreover, we consider their worst-case versions over the choice of : namely, and , respectively. These quantities satisfy the following relations:
In principle, we could use the largest one, , from the beginning. However, using the former ones allows us to obtain tighter bounds, which improve the theoretical guarantees and may also lead to practical improvements. We will include a clarification on that in the paper.
- Although we believe that the core contributions of the paper are mainly theoretical and algorithmic, we agree that some numerical simulations comparing the two versions of our algorithm, as well as comparisons with existing ones for KBs and GLBs, would be of interest. We are running these simulations, and we will include them in the final version of the paper.
References
[A] Jun, Kwang-Sung, et al. "Scalable generalized linear bandits: Online computation and hashing." Advances in Neural Information Processing Systems 30 (2017).
[B] Abeille, Marc, Louis Faury, and Clément Calauzènes. "Instance-wise minimax-optimal algorithms for logistic bandits." International Conference on Artificial Intelligence and Statistics. PMLR, 2021.
I have read the responses provided by the authors and comments by the reviewers. The authors have addressed the concerns raised by me. Regarding the over-claim as pointed out by reviewer nkzS, I agree that the contributions of the paper need to stated more clearly without any ambiguity. As stated by reviewer xX8K, I also believe the authors have addressed the main concerns during the rebuttal. Numerical simulations conducted by the authors should be included in the final version.
Thank you for the feedback. We are happy to have addressed your concerns. In the final version, we commit to fixing the statements about the contributions to avoid any ambiguity, and we will include the numerical simulation.
As requested by Reviewers 37bp and xX8K, during the rebuttal/discussion period, we conducted a preliminary experiment. Given the short time available, we are only able to provide an initial experiment comparing:
- our GKB-UCB algorithm, in its efficient implementation version described in Appendix A (due to the limited timeframe, we are unable to include a comparison with the version of the algorithm provided in the main paper);
- the IGP-UCB algorithm from Chowdhury, Sayak Ray, and Aditya Gopalan. "On kernelized multi-armed bandits." International Conference on Machine Learning. PMLR, 2017, designed for kernelized bandits;
- the OFULog-r algorithm from Abeille, Marc, Louis Faury, and Clément Calauzènes. "Instance-wise minimax-optimal algorithms for logistic bandits." International Conference on Artificial Intelligence and Statistics. PMLR, 2021, designed for generalized linear bandits.
We considered the setting of generalized kernelized bandits with the following configuration:
- decision space: ;
- Gaussian kernel: with (hence );
- sigmoidal link function: with Bernoulli observations;
- true function: This function has a global maximum close to 1 in value to emphasize the effects of the sigmoidal link function, along with a local maximum of similar value.
To run the algorithms, we discretized the decision space using a regular grid of 11 points. We set and (required by IGP-UCB) for all methods, and tested different time horizons: .
For IGP-UCB, which requires the sub-Gaussian proxy , we set , as appropriate for Bernoulli random variables. Since OFULog-r requires a finite-dimensional feature representation of the decisions, we used a Taylor expansion of the Gaussian kernel up to order 10, resulting in the following features:
The cumulative regret for the three algorithms is reported in the table below (as we cannot provide external links in compliance with NeurIPS policy). Each algorithm was run over 5 different seeds, and each cell reports the mean ± standard deviation.
| Method | 10 | 20 | 50 | 100 | 200 | 500 |
|---|---|---|---|---|---|---|
| GKB-UCB | ||||||
| OFULog-r | ||||||
| IGP-UCB |
We observe that GKB-UCB consistently achieves lower cumulative regret on average than the baselines for all . Moreover, the regret exhibits a sublinear trend, as expected.
We are committed to integrating additional experiments in the final manuscript, and will also provide the code for the efficient implementation (using cvxopt), along with the code used for the baseline methods.
Thanks for including the experiments! Minor comments (for camera-ready revision mostly):
- OFULog-r (Abeille et al., 2021) is for logistic bandits, not generalized linear bandits
- Can the authors consider the current SOTA logistic bandit algorithm, OFUGLB, due to Lee et al. (2024)? Their algorithm, I believe, is applicable to any generalized linear bandits as well. If the time allows for the camera-ready revision, {linear, logistic, Poisson} bandit experiments would further strengthen the paper.
- Although it is true that the pseudocode in Appendix A is more efficient than Algorithm 1, it still requires solving the MLE each iteration. I believe a more suitable terminology is "tractable implementation." The authors should carefully include discussions such as space/time complexity. Some relevant references: [1,2,3].
[1] https://arxiv.org/abs/2507.11847
Thank you for the feedback and for your support!
-
Yes, absolutely. In the rush, we wrote "generalized linear bandits," but we actually ran it with the sigmoid link function, that is, on a logistic bandit.
-
Yes, for the camera-ready version, we will run Lee et al. (2024) on the three suggested scenarios: linear, logistic, and Poisson.
-
We agree that "tractable" better describes the version of the algorithm reported in Appendix A. We will replace "efficient" with "tractable" in the camera-ready version. We will also compute the time and space complexity of this version, in line with what is done in the referenced works.
This paper studies the problem of the generalized kernelized bandit, which combines the benefits of GLM and kernel bandits. It first aims to give self-normalized concentration in the Bernstein-Like rather than the Hoeffding-style, and then leverage it to give regret bounds. The reviews are divided among reviewers. After reading the reviews and the paper carefully myself, I think the current version is not ready for publication due to the following reasons.
-
As pointed out by the reviewer, the authors are sloppy in some parts of the proof, especially when handling infinite dimensions, which seem to be the only new things here. I would recommend that authors read the proof of Corollary 3.5 of Yasin Abbasi-Yadkori's PhD thesis (of course, the authors also need to cite it)
-
The authors seem to overclaim several contributions. I am also not quite comfortable with the claim about Theorem 6.1. A similar style result already appears in the literature a long time ago, see Lemma 3 in [R1], although it is with log rather than loglog (I think the key idea is already there)
In sum, I think the authors need to do a better job when presenting their proof and contributions by carefully reviewing the literature and giving proper credit to prior work.
[R1] Rakhlin, Alexander, Ohad Shamir, and Karthik Sridharan. "Making gradient descent optimal for strongly convex stochastic optimization." arXiv preprint arXiv:1109.5647 (2011).