7.8

/10

Spotlight5 位审稿人

最低4最高5标准差0.4

3.0

置信度

创新性3.4

质量3.2

清晰度2.8

重要性3.2

NeurIPS 2025

Exploration via Feature Perturbation in Contextual Bandits

Seouh-won Yi,Min-hwan Oh

OpenReview PDF

提交: 2025-05-10更新: 2025-10-29

TL;DR

We study the feature-perturbing exploration method applicable to various bandit settings, and prove that our randomized method achieves optimal regret guarantee.

摘要

We propose *feature perturbation*, a simple yet effective exploration strategy for contextual bandits that injects randomness directly into feature inputs, instead of randomizing unknown parameters or adding noise to rewards. Remarkably, this algorithm achieves $\widetilde{\mathcal{O}}(d\sqrt{T})$ worst-case regret bound for generalized linear contextual bandits, while avoiding the $\widetilde{\mathcal{O}}(d^{3/2}\sqrt{T})$ regret typical of existing randomized bandit algorithms. Because our algorithm eschews parameter sampling, it is both computationally efficient and naturally extends to non-parametric or neural network models. We verify these advantages through empirical evaluations, demonstrating that feature perturbation not only surpasses existing methods but also unifies strong practical performance with the near-optimal regret guarantees.

关键词

Generalized Linear BanditsContextual BanditsThompson SamplingFeature Perturbation

评审与讨论

审稿意见

评分: 5置信度: 32025-06-28

The paper studies a contextual bandit algorithm adapted to Gaussian Linear Models (i.e. the density of the reward given context $x$ and parameter of interest $\theta^\star$ has a density exp $(r x^T \theta^\star - g(x^T \theta*^\star) + h(r))$ and a mean being the derivative $\mu$ of the link function $g$ ). The novelty of the proposed method is to add randomness in the features instead of adding it to the reward. The contribution of the paper is to achieve the same regret bound as optimism-based algorithms ( $O(d\sqrt{T}), d$ being the dimension of the context space) while relying on a randomized-like approach and similar practical performance (better in practice). Hence, the paper claims to bridge the gap between theoretical and practical performances of these two kinds of approaches, "By shifting the focus of exploration from parameter space to feature space", line 33.

优缺点分析

I would summarize the strengths of the paper as follows:

the paper is clear and goes straight to the point, which is pleasant.
many experimental setups are provided

However, I believe that the following point represent weaknesses of the paper:

the idea of randomizing the features is nice but it does not outperform the state-of-the-art guarantees for contextual bandits and does not induce major technical challenges, see below.
the theoretical work in the proofs is mostly a combination of pieces of proofs that already exist, typically, almost all the elements in appendix "G - Proof supplement and Lemmas" come from existing results. Once they are given, combining them in the proof of Theorem 1 does not appear extremely challenging.
all together, the contributions of the paper seem quite incremental to me.

问题

The authors suggest that it can happen to work with overparametrized bandits, is it actually the case in practice?

It might be a stupid question but since $\mu$ is Lipschitz, is there a form of equivalence (or maybe inequalities) between randomly perturbing $x_{ti}$ and randomly perturbing the reward $\mu(x_{ti}\hat{\theta}_t)$ ?

局限性

It is more a suggestion but I have the feeling that section 4.2 is very important to understand the method (and interesting in itself): I think that it could be even more detailed.

Part 6 of the main text is interesting (and almost represents the most important part of the paper to me) but I believe that it could be made clearer (for instance, elements from B.2. could be added there).

最终评判理由

The detailed authors` answers to my review raised my concerns and convinced me about the paper. This is why I increased my rating accordingly and I support its acceptance.

格式问题

No.

作者回复

2025-07-31

Thank you for taking the time to review our paper and for your thoughtful and valuable feedback, particularly your recognition of the paper's clarity and the breadth of our experimental setups. We appreciate the opportunity to clarify the significance and novelty of our contributions.

[W1-W3] We would like to emphasize that successfully adapting existing theoretical tools has long been central to many advances in the (generalized) linear bandit literature [2, 3, 13, 14, 20, 24, 25, 27, 30]. For example, Thompson Sampling [4] has inspired reinterpretations (e.g., Abeille et al. [2]) and practical variants such as PHE [19] and RandUCB [30]. While these methods share core analytical components, they were recognized not just for introducing entirely new analytical techniqes, but for effectively adapting established tools to either analyze novel exploration strategies, achieve new bounds, or provide new interpretation.

Our key contribution lies in introducing a fundamentally different form of randomization—feature perturbation—which departs from prior approaches based on parameter or reward perturbation. This new source of randomness enables tighter regret analysis in the GLM bandit setting. Adapting technical tools to control the stochastic dependencies induced by feature perturbations and integrating them with estimation error terms in a weighted Gram matrix framework are highly nontrivial. We sincerely believe that they should not be treated as just a simple combination.

To contextualize, Table 1 summarizes representative GLM bandit algorithms. Italicized methods are randomized but exhibit suboptimal regret rates or fail to capture problem-dependent constants such as $\kappa_*$ . In contrast, all OFU-based methods rely on a shared analytical backbone modified for each algorithm’s specifics.

Algorithm	Regret Upper Bound	Note
GLM-UCB [14]	$\tilde{\mathcal{O}}(d\sqrt{T}/\kappa)$
LinTS [2]	$\tilde{\mathcal{O}}(d^{3/2}\sqrt{T}/\kappa)$
SupCB-GLM [27]	$\tilde{\mathcal{O}}(\sqrt{dT\log K}/\kappa)$	$\sqrt{d\log K}$ dependency under finite arms
Logistic-UCB-2 [13]	$\tilde{\mathcal{O}}(d\sqrt{T}+d^2/\kappa)$
GLM-TSL, GLM-FPL [20]	$\tilde{\mathcal{O}}(d^{3/2}\sqrt{T}/\kappa)$
RandUCB [30]	$\tilde{\mathcal{O}}(d\sqrt{T}/\kappa)$
SupLogistic [a]	$\tilde{\mathcal{O}}(\sqrt{dT\log K}+d^3/\kappa^2)$	$\sqrt{d\log K}$ dependency under finite arms
OFULog, OFULog-r [3]	$\tilde{\mathcal{O}}(d\sqrt{\kappa_\star T}+d^2/\kappa)$	Mitigates $\kappa$ -dependence by constructing tighter confidence intervals
ada-OFU-ECOLog [b]	$\tilde{\mathcal{O}}(d\sqrt{\kappa_* T}+d^2/\kappa)$	Improves computational efficiency
OFULog+ [24]	$\tilde{\mathcal{O}}(d\sqrt{\kappa_* T}+d^2/\kappa)$	Reduces the dependence on $S$ in the confidence bound
OFUGLB [25]	$\tilde{\mathcal{O}}(d\sqrt{\kappa_* T}+d^2/\kappa)$	Extends the MNL setting in [c] to general GLMs
GLM-FP (Ours)	$\mathbf{\tilde{\mathcal{O}}(d\sqrt{\kappa_*T}+d^2/\kappa)}$	*First randomized algorithm to prove $O(d)$ dependence in the leading term benefiting from $\kappa_$**

[a] Improved Confidence Bounds for the Linear Logistic Model and Applications to Linear Bandits, Jun et al., 2021
[b] Jointly Efficient and Optimal Algorithms for Logistic Bandits, Faury et al., 2022

Our algorithm, GLM-FP, is the first randomized method in the GLM setting to provably achieve the optimal $\tilde{\mathcal{O}}(d\sqrt{T})$ regret while adapting to the problem-dependent constant $\kappa_*$ . Existing randomized methods either suffer from worse dimension dependence (e.g., GLM-TSL) or do not capture $\kappa_*$ (e.g., RandUCB). This closes a long-standing gap between the empirical success of randomized approaches and the strong theoretical guarantees previously exclusive to OFU-based algorithms. The key enabler is our novel randomization via feature perturbation instead of parameter or reward perturbation, allowing sharper analysis and tighter regret bounds.

Our analysis confronts the unique challenges posed by feature perturbations, which induce different stochastic dependencies compared to parameter sampling methods like TS. We develop new techniques to control correlations across arms and to ensure a valid regret decomposition separating estimation error ( $Reg_{EST}$ ) and feature perturbation effects ( $Reg_{FP}$ ). Perturbing the input features before the link function with controlled magnitude, along with the use of a weighted Gram matrix to achieve $\kappa_*$ -dependent bounds, is novel both in concept and proof strategy.

Practically, our approach offers flexibility and scalability advantages. TS-based methods often struggle in high-dimensional parameter regimes, such as neural function classes where parameter dimension $p$ far exceeds feature dimension $d$ . Our feature perturbation directly randomizes the low-dimensional input, making it well-suited for such settings. This is supported by empirical results where GLM-FP consistently outperforms OFU-, TS-, and PHE-based baselines across various scenarios.

In summary, our work builds on existing foundations (and so do many seminal works in this field!) but makes a distinct and substantial contribution by introducing a novel randomized exploration scheme with the first optimal regret guarantee for randomized GLM bandits. We believe this represents a meaningful step beyond prior work, both in theory and practice.

[Q1] Our method is effective for overparameterized bandits in practice, both conceptually and empirically.

Conceptually, our Feature Perturbation (FP) approach avoids the key challenge in overparameterized settings. Existing methods that perturb the parameter space often become computationally infeasible as model size grows. Even practical baselines like NeuralTS add noise only to the final rewards rather than perturbing the model itself, effectively functioning as PHE. In contrast, our FP method drives exploration directly in the low-dimensional feature space, making it inherently scalable and stable regardless of model size.

Empirically, this advantage is supported by strong experimental results. In Section 7.2, we demonstrate the practical effectiveness of our approach using neural networks. As shown in Figure 3 (bottom) and Figures 9–10 in the appendix, our DeepFP algorithm consistently outperforms standard neural bandit baselines on several real-world benchmark datasets.

To further validate this, we conducted additional experiments on the MNIST dataset ( $d=784$ ) using four neural network architectures of varying size and type. The models (with $p$ : number of parameters) are:

MLP-100 (M1): A two-layer MLP with hidden dimension 100 ( $p=88{,}701$ )
CNN-Small (M2): A small CNN with two convolutional layers (32 and 64 filters) and a two-layer fully connected head, as described in Appendix H.2 with learning rate 0.001 ( $p=423{,}073$ )
CNN-Large (M3): A deeper CNN with four convolutional layers (64 to 256 filters) and a three-layer fully connected head with dropout ( $p=7{,}455{,}169$ )

The results from 10 independent runs are summarized below. Here, [F] refers to the Fashion-MNIST dataset, and "Time" denotes the runtime for the M3[F] experiment.

Algorithm	M1	M2	M3	M2[F]	M3[F]	Time(sec)
$\varepsilon$ -greedy	$3040\pm21.8$	$\mathbf{2271\pm49.7}$	$1627\pm14.2$	$6133\pm30.9$	$2947\pm14.4$	$1044$
DeepFPL	$2432\pm30.4$	$7174\pm716.6$	$1601\pm65.8$	$8150\pm1559.1$	$2643\pm36.3$	$965$
NeuralUCB	$3047\pm210.6$	$3100\pm78.3$	$8358\pm246.2$	$8820\pm670.5$	$4758\pm132.9$	$4446$
NeuralTS	$3335\pm53.3$	$6086\pm376.1$	$8923\pm60.7$	$13205\pm1107.4$	$3895\pm73.3$	$4387$
DeepFP(ours)	$\mathbf{2426\pm10.6}$	$3713\pm565.2$	$\mathbf{1513\pm38.1}$	$\mathbf{5578\pm33.3}$	$\mathbf{2634\pm24.8}$	$981$

These results confirm the effectiveness of our algorithm in highly overparameterized regimes ( $d \ll p$ ), where feature-space exploration remains both scalable and robust across model sizes.

[Q2] In short, the answer is generally no — perturbing the input $x_{ti}$ and perturbing the reward $\mu(x\_{ti}^\top\hat{\theta}\_t)$ are not equivalent, even if $\mu$ is Lipschitz. While the Lipschitz continuity of $\mu$ does imply the inequality $|\mu(x^\top\hat{\theta})-\mu((x+\zeta)^\top\hat{\theta})|\leq\mathcal{L}\_\mu|\zeta^\top\hat{\theta}|$ , this only gives an upper bound; it does not imply equivalence between the two types of perturbations. In fact, the two approaches introduce fundamentally different exploration behaviors and uncertainty structures.

we note that relying on this inequality effectively linearizes the GLM model, reducing it to an approximate linear bandit problem. As discussed in our paper, such linearization sacrifices the curvature of $\mu$ , and leads to weaker regret bounds—specifically, a $1/\kappa$ dependence in the leading term, which our algorithm avoids. Indeed, existing randomized methods such as RandUCB and PHE fall into this category: they perturb rewards directly or equivalently rely on a linearized analysis. In contrast, our algorithm perturbs the input features, and our analysis preserves the curvature of the link function, which is critical for achieving the tighter, instance-dependent $\kappa_*$ regret bound.

As a result, perturbing $x$ before applying $\mu$ (i.e., $\mu((x+\zeta)^\top\hat{\theta})$ ) is not equivalent to applying $\mu$ first and then perturbing its output. The resulting exploration behavior is fundamentally different, both in distributional structure and in the way uncertainty is introduced. We discuss this nuance in Appendix C, which helps explain our algorithm’s strong guarantees and performance and this difference underpins the novelty and strength of our approach.

[L] We will do our best to make appropriate modification in presentation to improve our presentation in the revision.

2025-08-04

I am thankful to the authors for their clarifying and detailed answers to my review. It raises most of my concerns and I increased my rating accordingly.

2025-08-04

We sincerely appreciate your prompt reply and the time you invested in reviewing our detailed rebuttal. We are delighted that our theoretical contributions were acknowledged, and we thank you for updating your score. We will make the necessary revisions to further enhance the completeness of the paper.

审稿意见

评分: 5置信度: 22025-07-02

This paper proposes feature perturbation, a novel and simple technique that injects noise in the feature space. Contrary to known randomised bandit algorithms, this new algorithm achieves the worst case regret bound for generalised linear bandits. The paper also tries to provide the intuition behind the superior regret of the proposed algorithm compared to TS-like methods. The proposed approach is straightforward to apply, even for deep neural networks, and shows strong empirical performance compared to existing methods.

优缺点分析

Strengths:

The paper is well written and accessible.
The proposed method closes a known gap between OFU based methods and TS-like method for GLMs.
The method is simple and even applies to overparametrised models.
The method shows good empirical performance.

Weaknesses:

In the linear setting, feature perturbation is formally equivalent to TS with a well chosen noise. The paper tries to give intuition why FP has a lower regret than TS in this case but this needs to be clarified more. See questions please.

问题

In the linear case, the algorithms of TS and FP are equivalent with an appropriate choice of noise.

In the discussion, the authors state that " To achieve this (TS regret), a union bound over d coordinates is typically applied, introducing an additional $\sqrt{d}$ factor into the regret analysis."

What do the authors mean here? Is $\sqrt{d}$ an artefact of the proof?
In the linear case, what are the main algorithmic differences between TS and FP? If the algorithm is the same, why the regret is different?

These points should be clarified more in the paper, and the emphasis should be on why we are getting an improved regret.

In addition, a relatively recent paper [1] discusses conditions where TS can achieve a regret with optimal dependence on $d$ . A discussion of these results and comparison with the obtained guarantees for FP should be included.

Clarifying these points will increase my confidence.

[1] When and why randomised exploration works (in linear bandits). Marc Abeille, David Janz, Ciara Pike-Burke

局限性

Yes.

最终评判理由

This is a great paper. It is well-written and solves an interesting problem using a simple yet clever idea. The authors' rebuttal successfully clarified some points and added a discussion of Abeille et al. [1]. I support its acceptance.

格式问题

N/A

作者回复

2025-07-31

Thank you for your thoughtful feedback. We appreciate the opportunity to elaborate on the differences between Feature Perturbation (FP) and Thompson Sampling (TS), particularly in the linear bandit setting, and to clarify the source of the improved regret bound.

We appreciate your insightful question, which gives us a valuable opportunity to clarify a nuanced point. While the marginal distribution of the score for any individual arm is the same in both TS and FP, the algorithms are not equivalent. The key difference lies in the joint distribution across all arms in each step $t$ , which is determined by how the shared randomness couples their scores. This coupling structure is what fundamentally drives the improvement in regret, as we discuss in Section 6 of the paper and further below.

Although both TS and FP generate score functions that follow the same marginal distribution, $\tilde{f}\_t(x\_i)\sim\mathcal{N}(x\_{ti}^\top\hat{\theta}\_t,c\_t^2\|x\_{ti}\|\_{V\_t^{-1}}^2)$ , the mechanisms behind this sampling differ. In TS, the score is obtained via $\tilde{f}\_t(x\_i)=x\_{ti}^\top\tilde{\theta}\_t=x\_{ti}^\top\hat{\theta}\_t+c\_t\cdot x\_{ti}^\top V\_t^{-1/2}\zeta\_t$ , while in FP, it is computed as $\tilde{f}\_t(x\_i)=\tilde{x}\_{ti}^\top\hat{\theta}\_t=x\_{ti}^\top\hat{\theta}\_t+c\_t\cdot z\_t\|x\_{ti}\|\_{V\_t^{-1}}$ . Although both methods result in the same marginal distribution for each arm, the coupling structure induced by shared randomness (i.e., a single $\zeta_t$ used across all arms) leads to different joint distributions over scores ${\tilde{f}_t(x_i)}$ , resulting in algorithmic differences that impact regret.

The Source of the $\sqrt{d}$ Factor: Algorithmic and Analytical Differences The difference in regret stems from how each algorithm's structure affects the concentration bounds required by the analysis. The exploration bonus for each algorithm can be expressed as:

(\text{TS})\quad\quad |x_{ti}^\top(\tilde{\theta}_t-\hat{\theta}_t)|_2&\leq |x_{ti}^\top(\beta_t(\delta')\cdot V_t^{-1/2}\zeta_t)|\leq\beta_t(\delta')\cdot|x_{ti}|_{V_t^{-1}}|V_t^{-1/2}\zeta_t|_{V_t}\\ &=\beta_t(\delta')\cdot|x_{ti}|_{V_t^{-1}}\cdot|\zeta_t|_2,\\ (\text{FP})\quad |(\tilde{x}_{ti}-x_{ti})^\top\hat{\theta}_t|_2&\leq\left|\beta_t(\delta')\cdot|x_{ti}|_{V_t^{-1}}\frac{\hat{\theta}_t^\top\zeta_t}{|\hat{\theta}_t|_2}\right|\\ &\leq\beta_t(\delta')\cdot|x_{ti}|_{V_t^{-1}}\cdot(u^\top\zeta_t),

where $\zeta_t \sim \mathcal{N}(0, I_d)$ , $\beta_t(\delta') = \mathcal{O}(\sqrt{d})$ , and $u = \hat{\theta}_t / \|\hat{\theta}_t\|_2$ . The key difference is that the bonus for TS scales with the norm of a $d$ -dimensional random vector ( $\|\zeta_t\|_2$ ), while the bonus for FP scales with a 1-dimensional random projection ( $|u^\top\zeta_t|$ ).

To guarantee a high-probability bound for the TS bonus, the analysis must control the $\|\zeta_t\|_2$ term with high probability $1-\delta'$ , typically requiring a union bound across $d$ coordinates:

\mathbb{P}\left(|\zeta_t|_2\leq\sqrt{2d\log\frac{2d}{\delta'}}\right) \geq\mathbb{P}\left(\forall1\leq i\leq d,~|(\zeta_t)_i|\leq\sqrt{2\log\frac{2d}{\delta'}}\right) \geq 1-d\mathbb{P}\left(|(\zeta_t)_i|\geq\sqrt{2\log\frac{2d}{\delta'}}\right)

(With $\mathbb{P}((\zeta_t)_i\geq\alpha)\leq2 e^{-\alpha^2/2}=\delta'/d$ by the choice of $\alpha=\sqrt{2\log(2d/\delta')}$ .)

As a result, this introduces an extra $\sqrt{d}$ factor in the effective bonus beyond the $\beta_t(\delta') = \mathcal{O}(\sqrt{d})$ scaling. As discussed by Hamidi and Bayati [16] (and cited in our paper), this additional $\sqrt{d}$ is inherent in the worst-case analysis of standard TS. In contrast, FP only requires controlling a 1-dimensional Gaussian random variable ( $u^\top \zeta_t$ ), avoiding the need for such a union bound. This tighter control leads to a smaller bonus term and ultimately a tighter regret bound. Thus, despite having the same marginal distributions for individual arms, the different coupling structures of TS and FP result in different high-probability regret guarantees.

Comparison with Abeille et al. [1] Thank you for pointing us to this relevant work. We have reviewed the paper by Abeille et al. [1]. Their work shows that the standard TS algorithm can indeed achieve the optimal $\tilde{\mathcal{O}}(d\sqrt{T})$ regret, but this requires the action set to satisfy strong geometric conditions (e.g., being an "absorbing set," strongly convex, and smooth).

The authors of [1] rightly note that their conditions do not hold for the worst-case instances identified by Hamidi and Bayati [16]. While their assumptions may be met in certain synthetic settings (e.g., $\ell_p$ balls), they may not hold for more general, arbitrary action sets encountered in practice.

Our contribution is complementary. We show that the FP algorithm achieves the same optimal $\tilde{\mathcal{O}}(d\sqrt{T})$ regret under much milder assumptions: we require only that the action set is bounded, a standard and broadly applicable condition used in many TS papers (e.g., [2, 4]). Thus, our theoretical guarantees apply to a wider class of problems without relying on specific geometric structure.

Thank you again for highlighting this recent result. We will happily include a discussion of this comparison in the final version of our paper.

2025-08-01

Thank you for your detailed answer. I'm expecting the revised paper to:

Detail the source of the $\sqrt{d}$ factor in the main body, and to take the time to explain the differences like in the answer provided.
Include the discussion of Abeille et al. [1].

Great work. It is a really good paper overall and I'm supporting its acceptance.

For neural networks, similar ideas were explored in practice, only perturbing the latent representation in neural networks instead of the raw features to define policies. For example see:

[2] Sakhi et al. Fast Slate Policy Optimization: Going Beyond Plackett-Luce. TMLR

This can be relevant to motivate the core analysis.

2025-08-01

Thank you very much for your encouraging and constructive comments. We appreciate your suggestions and confirm that we will clarify Section 6 and add a discussion of [1], as recommended. Thank you also for the pointer to [2]—it helps contextualize our theoretical results. We are grateful for your support and look forward to incorporating these improvements in the final version.

审稿意见

评分: 5置信度: 42025-07-03

This paper studies the problem of regret minimization for contextual bandits via feature perturbation. Contrarily to usual sampling method that perturb the parameter space, they show that by directly injecting randomness in the feature space, they can recover a $O(d\sqrt{T})$ bound for generalized linear models, matching the performances of UCB-like algorithms and improving on existing sampling-based algorithms. They provide empirical evidence of their theoretical results as well as empirical evidence that their method can be applied beyond the setting of generalized linear models.

优缺点分析

The paper is very well written and clear. The setting, assumptions and results are well stated and clearly demonstrated both in the proofs and the empirical demonstration. The setting considered is of great importance in practice. Moreover some care is given to explaining how their proposed methods overcomes the limitation of Thompson Sampling to get the extra $\sqrt{d}$ factor. While their method matches the theoretical guarantees of UCB-based methods for linear bandits, they show that their algorithm adapts to the curvature of the space leading to tighter guarantees for Generalized-Linear-Bandits. On the empirical side, one of the main appeal is that their method is directly applicable to very general function classes in a principled and scalable way.

The main limitation of the theoretical results is that they are limited to the Generalized-Linear-Settings. However, this is a very common setting for this kind of theoretical guarantees and their experimental guarantees extend beyond this setting. It would be extremely interesting to see if the methods and algorithm they propose can be extended beyond the settings of bandits (For Decision-Making with Side Observations for instance).

I think that this paper is a valuable contribution for our community.

问题

Do you think there is a way to extend this type of method beyond the setting of bandits(For instance for Reinforcement learning, one could perturb the feature of the state action pair to make a decision)
Do you think this type of method is limited to greedy action selection or could be suitable to more complex sampling methods such as Information Directed Sampling ?
Do you think these type of methods are limited to these generalized linear settings or could be adapted without too much effort to more general parametrized settings (For instance finite action and lipschitzness of the reward function) ?

局限性

yes

最终评判理由

The author have clarified my questions and I maintain my good opinion of the paper.

格式问题

Nothing to report.

作者回复

2025-07-31

Thank you for taking the time to review our paper and for your thoughtful and valuable feedback. We appreciate your positive recognition of our work and the constructive comments you have provided regarding the potential applicability of our methodology across various domains. Below, we address each of your comments and questions in detail:

[Q1] Extending our Feature Perturbation (FP) method to Reinforcement Learning (RL) is a promising and natural direction for future work. Similar to how randomized algorithms such as Thompson Sampling (TS) [A–E] and Perturbed-History Exploration (PHE) [F] have been successfully adapted to RL settings, our FP framework is well-suited for analogous extensions. In particular, we believe that FP can be directly applied in structured settings like Linear MDPs. For model-based algorithms, one could consider perturbing the state-action feature representation $\varphi(s,a)$ used in computing the transition dynamics.

One of the core challenges in RL is that randomness introduced by exploration at an earlier state propagates through the trajectory, affecting future Q-value estimates. Therefore, carefully designing perturbation strategies that remain effective over temporally extended decision-making will be interseting future work. Nevertheless, this approach offers a key potential advantage: compared to injecting noise into high-dimensional neural network parameters (e.g., NoisyNets [G]) or maintaining an ensemble of models [A], perturbing the lower-dimensional input feature space can be significantly more efficient and stable in practice.

[Q2] Our Feature Perturbation (FP) method is not limited to greedy action selection and is naturally compatible with advanced exploration strategies such as Information Directed Sampling (IDS).

In our work, FP generates stochastic estimates of each action’s expected reward by perturbing the input features into ${\tilde{x}_{ti}}$ , followed by a simple argmax selection. This is analogous to TS, which selects actions greedily based on perturbed parameters $\tilde{\theta}_t$ . Both methods induce exploration through randomized estimates.

Importantly, GLM-FP already embeds in a way an information-theoretic principle: the perturbation magnitude is scaled by the elliptical potential, guiding exploration toward actions with higher potential for information gain. This mechanism effectively balances exploration and exploitation.

Given this structure, a formal information-theoretic analysis of FP—similar to analyses developed for TS [H]—is a compelling direction for future work. FP can be seen as implicitly defining a belief over action values, with IDS offering a more formal decision rule that aligns with the principles already present in our design.

[Q3] Extending our method to finite action spaces is relatively straightforward. On the other hand, generalizing to broader function classes—such as those satisfying only Lipschitz continuity—poses more substantial challenges. In particular, our current regret analysis relies on structural properties of (generalized) linear models, such as the use of the weighted Gram matrix to quantify uncertainty and obtain anti-concentration guarantees. These techniques do not easily extend to general function classes without additional assumptions or new analytical tools.

That said, our method itself is not inherently limited to the generalized linear setting. A key advantage of Feature Perturbation (FP) is that it operates in the input feature space rather than the model parameter space, making it highly adaptable and model-agnostic. Exploring how FP can be combined with potential-based analyses or adapted to settings with weaker assumptions (e.g., Lipschitz continuity) is a promising direction for future work.

[A] Randomized Prior Functions for Deep Reinforcement Learning, Osband et al., 2018
[B] (More) Efficient Reinforcement Learning via Posterior Sampling, Osband et al., 2013
[C] Why is Posterior Sampling Better than Optimism for Reinforcement Learning?, Osband et al., 2017
[D] Thompson Sampling for Learning Parameterized Markov Decision Processes, Gopalan & Mannor, 2015
[E] Frequentist Regret Bounds for Randomized Least-Squares Value Iteration, Zanette et al., 2023
[F] Randomized Exploration for Reinforcement Learning with General Value Function Approximation, Ishfaq et al., 2021
[G] Noisy Networks for Exploration, Fortunato et al., 2019
[H] An Information-Theoretic Analysis of Thompson Sampling, Russo & Roy, 2015

2025-08-03

Thanks for the clarifications. I remain a bit curious about whether or not this strategy could allow the learner to sample suboptimal actions that are very informative (Something that TS struggles with and that IDS is capable to do). In any case, that is beyond the scope of the paper and I still think that the work is interesting, well presented and worth sharing with the rest of the Neurips community.

2025-08-04

Thank you for highlighting this important point. While our method doesn’t explicitly aim to sample informative suboptimal actions, we see this as a promising direction to explore in future work. We’re also grateful for your positive evaluation and for your support in advancing this line of research.

审稿意见

评分: 4置信度: 42025-07-03

Authors propose a new randomized algorithm for generalized linear contextual bandits (GLB) based on feature perturbation.
Prior randomized algorithms for linear bandits (LB) and GLB perturb model parameters instead and incur regret that scales as $d^{3/2}$ where $d$ is dimensionality of the feature space. On the other hand, the proposed algorithm's regret scales linearly with $d$ .
By using appropriately scaled gram matrices, the authors are also able to remove unwanted dependency on a non-linearity parameter, similar to other recent deterministic GLB algorithms. However, they claim this has not been achieved by randomized algorithms in the literature.
They provide theoretical analysis of their regret guarantee (outlining the assumptions made) and also demonstrate strong empirical performance in multiple settings.

优缺点分析

Strengths of the paper:

Introduces a new randomized algo paradigm for contextual bandit problems. Since it is based on feature perturbation it is advantageous to Thompson sampling (TS) in many scenarios where computing posterior distributions could be challenging.
Provides optimal regret guarantees that match the state-of-the-art deterministic algos and is a $\sqrt{d}$ factor better than TS algos.
The writing is decent. Intuition in 4.2, proof sketch in 5 and comparison to TS (for LB) in Section 6 are all helpful. Though I think 4.2 and 6 can be made more concise and crisper. Another feedback here would be to make the contributions section more compact and focused mainly on the results. E.g., bullet 3 and 4 can be merged to focus on empirical results.
A reasonable amount of empirical validation is provided covering many known baselines. However, some more recent OFU based algorithms focused on GLB seem to be missing (See below).

Weaknesses:

While the algorithm is a new randomized algorithm for GLB, essentially (in its arm selection policy) it multiplies the exploration bonus of OFU algorithms (such as LinUCB [1], Logistic-UCB-1, Logistic-UCB-2[13]) with a random scalar (also mentioned in the paper in line 256). This in my opinion makes the algorithm and its analysis pretty much like the OFU counterparts. Therefore, it can be argued while the algorithm is randomized, it's not much different from OFU algorithms. This reduces the novelty aspect of the paper (a little) in my opinion.
The main idea is essentially captured in the Linear Bandit problem. To me there does not seem to be any specific new contributions for GLB. Both the confidence bound (wrt scaled gram matrices) and the regret calculation thereof is pretty much what prior works ([13], [3], [A]) have already done to achieve (1/ $\kappa$ independence). Hence, even though this is the first randomized algorithm with regret that is independent of the non-linearity, the paper does not make novel contributions towards that. The authors can please correct me about this, if they think I am mistaken.
In the empirical comparisons, I think the authors should also compare to state-of-the-art logistic bandit algorithms such as ada-OFU-EcoLog [A] and subsequent improvements to it ([B], [C] etc). Empirically these seem to be doing really well in the logistic case compared to the OFU and TS algorithms that are among the baselines in this paper. Also, in the experiments section, the authors should explain the instances a little bit.
The paper does not discuss the limitations of feature perturbation. For eg., in large action spaces, this would amount to scanning the entire action space before selecting the action to play. The TS algorithm on the other hand can be more efficient by using Linear Optimization oracles? There is literature on deterministic algorithms which explores large action spaces using such oracles [D].
In Assumption 1, authors assume both the features and parameters are inside the unit ball. While this is fine for LB, for GLB one would typically assume the parameter space to be contained in an S-ball and the regret should not scale badly with $S$ as $exp(S)$ can be proportional to the non-linearity $1/\kappa$ . I'm assuming that such dependence does not come in the proof. It would be great if the authors can confirm.
The self-concordance assumption in Assumption 2 might not be explicitly needed. See [B].

[A] Jointly Efficient and Optimal Algorithms for Logistic Bandits, L Faury, M Abeille, KS Jun, C Calauzènes, AISTATS 2022.

[B] Generalized Linear Bandits with Limited Adaptivity, A Sawarni, N Das, S Barman, G Sinha, NeurIPS 2024.

[C] Online (Multinomial) Logistic Bandit: Improved Regret and Constant Computation Cost, Yu-Jie Zhang, Masashi Sugiyama, NeurIPS 2023.

[D] Efficient batched algorithm for contextual linear bandits with large action space via soft elimination, O Hanna, L Yang, C Fragouli, NeurIPS 2023.

问题

All my questions are in the Strengths and Weaknesses part.

局限性

Please see strengths and weaknesses

最终评判理由

The paper proposes an interesting new approach to contextual bandits. From the analysis perspective I see less novelty (the point I was trying to discuss in [W1]) and so I am keeping my score., Having said that the algorithm is definitely interesting and can be useful. I have followed up with the authors regarding [W4] in my reply to their rebuttal.

格式问题

Plots in Figure 3 are very difficult to comprehend due to their size. They should be made larger. Some immediate suggestions:

The authors can probably focus on two settings for $d$ (instead of 3).
Results for either linear or logistic can be moved to appendix or text in other sections can be reduced (For e.g. Sections 4.2, 6)

作者回复

2025-07-31

Thank you for taking the time to review our paper and for your thoughtful and valuable feedback. We appreciate your positive recognition of our work and the constructive comments you have provided regarding the organization of the paper and its comparison with related prior work.

We appreciate your suggestion regarding the clarity of our contribution section. We have consolidated the third and fourth bullet points and also plan to revise the ambiguous statements in Sections 4.2 and 6 to make them more concise and precise in conveying their intended messages.

[W1] Our algorithm is not OFU-like in nature; rather, it represents a novel class of randomized exploration based on perturbing the features, rather than the model or rewards. While it may apear to bear a resemblance to OFU in the linear setting (so do many randomized linear bandit algorithms, such as LinTS) due to structural properties of the model, the underlying mechanism and interpretation are fundamentally different.

We appreciate the opportunity to clarify this point. In the (generalized) linear bandit literature, it is indeed common for both OFU-based algorithms [1, 3, 13, 14, 23-27] and randomized methods such as Thompson Sampling to rely on the elliptical potential [2, 4, 19, 20, 30] to guide exploration. This term helps scale uncertainty and supports efficient learning, but its presence does not imply that the underlying algorithm is OFU in spirit.

Our method introduces a new type of randomness by directly perturbing the input features. In the linear case, this structure induces a behavior that coincides with randomized-scalar variant of OFU, but this is a byproduct of the linear model's geometry—not necessarily our design goal. As we discuss in Section 4.2, once the model departs from linearity (e.g., GLM, neural networks), this resemblance breaks down, and the distinct behavior of feature perturbation becomes more evident.

Therefore, apart from elliptical geometry, our method does not use optimism-based action selection as OFU. Instead, it proposes a randomized feature exploration strategy with its own theoretical justification. We hope this clarification helps position our contribution more accurately within the broader landscape of exploration algorithms.

[W2] We are more than happy to elaborate on our key contributions on randomized exploration.

In both linear and generalized linear contextual bandits, there has been a long-standing tradeoff between the tight regret guarantees of optimism-based (OFU) algorithms [1, 3, 13, 14, 23-27] and the favorable empirical performance of randomized algorithms [2, 4, 19, 20, 30].

Our work aims to bridge this longstanding gap. Specifically, we propose feature perturbation, a novel form of randomization that is fundamentally distinct from prior approaches based on parameter or reward perturbation. While the analysis of deterministic algorithms in the GLB setting is well established, extending such techniques to randomized algorithms has remained an open challenge. Our contribution lies in adapting and unifying these analytical tools—particularly those involving wighted Gram matrices—to analyze a new class of randomized algorithms.

In our approach, the regret naturally decomposes into two components:

(1) the parameter estimation error ( $\text{Reg}\_{\text{EST}}$ ), and

(2) the error introduced by randomized perturbations in the feature space ( $\text{Reg}\_{\text{FP}}$ ).

Controlling both simultaneously required new analysis to handle the interaction between stochastic perturbations of context vectors and the evolving model estimates. In particular, bounding the $\text{Reg}_{\text{FP}}$ term involved applying concentration inequalities to perturbed prediction errors (Appendix F.4, Step 2-2), and ensuring that the weighted Gram matrix could effectively control both sources of uncertainty.

Importantly, perturbing the input features before the link function with controlled magnitude, combined with the use of a weighted Gram matrix to achieve $\kappa_*$ -dependent bounds, is novel both in concept and in proof strategy. To the best of our knowledge, no prior randomized algorithm achieves this level of control in the GLB setting.

To be clear, we emphasize that our contribution lies in designing and establishing new type of exploration, and adapting the analyticial tools to a new randomized exploration strategy. As a result we propose the first randomized GLB algorithm that matches the regret guarantees of the best deterministic methods—without inverse dependence on the non-linearity parameter $\kappa$ .

[W3] Due to space constraints, we moved some experimental details to the appendix. In response to your Strength 3 comment, we will revise earlier sections to include a clearer and more detailed experimental description.

In preparing our experiments, We selected the Algorithm from Lee et al. [25] as our main OFU-style GLB baseline, based on their empirical results showing performance comparable to or better than [A] and [B]. That said, we agree that including direct comparisons strengthens the evaluation.

The table below presents results on a fixed-arm logistic bandit setting with $S=4$ , $K=100$ , and $T=3{,}000$ , averaged over 20 random instances. We include RS-GLinCB [B] and ada-OFU-EcoLog [A] as OFU-based baselines, and use OFUGLB-e (a much more efficient variant of OFUGLB) for computational feasibility. Our method shows robust performance across settings, and we believe this comparison further supports the empirical validity of our approach.

Algorithm	Regret( $d=10$ )	Regret( $d=50$ )	Time(sec)
GLM-TSL	$68.95\pm3.61$	$361.75\pm11.33$	$56$
LogPHE	$76.20\pm2.39$	$289.15\pm8.71$	$18$
RandUCBLog	$52.40\pm6.05$	$287.10\pm12.00$	$36$
RS-GLinCB	$1294.05\pm26.32$	$900.55\pm19.15$	$28$
ada-OFU-EcoLog	$506.15\pm14.86$	$756.90\pm18.56$	$232$
OFUGLB-e	$264.45\pm8.94$	$838.30\pm17.19$	$1462$
GLM-FP (ours)	$\mathbf{51.85\pm8.05}$	$\mathbf{171.10\pm8.40}$	$28$

(Time is reported for $d=50$ setting.)

[W4] Thank you for pointing out the computational aspect and for the helpful reference [D]. Indeed, [D] presents an practical solution to the large action space problem by introducing an efficient batched algorithm. This approach significantly reduces computational cost by limiting the frequency of policy updates, which is a meaningful contribution from a systems perspective.

However, as also noted in [D], this computational efficiency comes with a trade-off. Their algorithm achieves a regret of $\tilde{O}(d^{3/2}\sqrt{T})$ , which is suboptimal with respect to the dimension $d$ . This contrasts with our work, where the primary focus is on achieving the theoretically optimal $\tilde{O}(d\sqrt{T})$ regret bound for a randomized algorithm in the standard (non-batched) online setting.

In addition, regarding the point on linear optimization oracles in the linear setting, if such an oracle is available, it can efficiently solve problems of the form $x^\top \theta$ . We would like to emphasize that this structure is also compatible with our approach. Since our method perturbs the features and selects actions based on $(x+\zeta)^\top\hat{\theta}$ , it preserves the linear structure and can similarly benefit from the oracle—effectively optimizing over $x$ with a shifted parameter vector. In this sense, the use of a linear optimization oracle is not specific to Thompson Sampling and could likewise enhance the efficiency of our method in relevant settings.

We appreciate the suggestion and will incorporate this discussion into the final version of the paper.

[W5] We confirm that the dependence of our regret bound on $S$ is standard and polynomial, not exponential.

We assume the unit ball boundness purely for notational simplicity, so as to highlight our main contribution—the feature perturbation mechanism and its improvement in dimensionality dependence. However, this assumption can be relaxed without affecting the nature of our results.

If we instead assume $|\theta^*|\leq S$ , the confidence width $\beta_t$ scales with $\sqrt{S}$ . The polynomial dependence on $S$ arises from the ellipsoidal relaxation used to ensure computational efficiency. This dependence then propagates through the analysis, leading to a regret bound that scales with $\sqrt{S}$ —not with $\exp(S)$ .

[W6] Thank you — you are right. As noted in [B], self-concordance is a sufficient condition for obtaining tight regret bounds that depend on the problem-specific constant $\kappa_*$ . In fact, the instances considered in the GLB setting already satisfy this property. Our assumption of self-concordance was not essential, but rather introduced for convenience to facilitate the definition of $M_\mu$ . We will revise this part by presenting $M_\mu$ as a formal definition instead. We appreciate your suggestion, which helps to clarify and strengthen our presentation.

评论- Thanks

2025-08-06

I am satisfied with the response. Regarding [W4], my main point was that if you are perturbing the features of all actions then in very large action spaces you will end up spending lot of time doing so. The algo in [D] basically discusses approaches to not iterate over the arm set by doing soft elimination using optimization oracles. In your case (even with linear optimization oracles) you will end up spending time updating all feature vectors. In any case, I would encourage the authors to elaborate more on the points we discussed above in their paper.

2025-08-06

We sincerely appreciate your time and patience in reviewing our detailed response. We’re glad to hear that most of your concerns have been addressed, and we thank you for your thoughtful follow-up. As you pointed out, the bonus is arm-dependent, and as we detail in Section 6, Eq. (FP) of our paper, the selection rule for the linear variant is equivalent to solving: $\arg\max\_{x\_{ti}} (x\_{ti}^{\top}\hat{\theta}\_t + c\_t \cdot z\_t ||x\_{ti}||\_{V\_t^{-1}})$ .

Because the bonus term $||x\_{ti}||\_{V\_t^{-1}}$ is specific to each arm, a naive implementation would indeed require iterating through the entire action set. a naive implementation would require iterating over the entire action set. However, this issue is not unique to our method—it arises in many linear bandit algorithms, including [1, 2, 19]. While our main contribution lies in improving statistical efficiency—achieving the optimal $\mathcal{O}(d\sqrt{T})$ regret and closing a long-standing theoretical gap—we acknowledge that we did not sufficiently discuss the per-round computational trade-offs.

We appreciate your feedback and will address this point more thoroughly in the final version of the paper, including a more explicit connection to related work. Thank you again for your valuable input, which will significantly strengthen the paper.

评论- Response

2025-08-06

Thanks for the response. My question is not specifically for the selection rule because I think that can be managed with oracles (probably with ideas similar to [D]). It is about the feature perturbation step (Step 5 in Algorithm 1), which is critical and specific to your algo. I'm not sure how this can be made efficient for large number of arms. However, I agree it is orthogonal to your goal of achieving optimal regret with randomized algorithms.

I had one more question (which is what I really meant in [W1]). Can you share your thoughts on the novelty in regret analysis (with respect to techniques used in analysis of OFU algorithms)? Doesn't the similarity in the selection rules (with the additional random scalar) lead to a very similar analysis to OFU. There is an additional perturbation term as shown in equation at the bottom of page 6 in your paper, but that is quite easy to manage.

On the other hand, the analysis of TS algorithms is quite different due to the parameter sampling [A].

In my opinion, the main point in TS algorithms is not the randomized aspect but the bayesian nature of updates of the parameters. While they have suboptimal guarantees they give a different perspective on exploration which is basically coming from bayesian analysis and is not solely because it is a randomized algo. So, I am a little confused about the claim that there is a gap between OFU like algorithms and randomized algos and that gap is closed. My opinion is that there is a gap between OFU like exploration and TS like exploration. BTW even for TS algos (with some assumptions on arm sets), there are papers that achieve optimal dimension dependence in the regret [B]. Would be happy to hear your thoughts on this topic.

[A] Thompson Sampling for Contextual Bandits with Linear Payoffs, Shipra Agrawal, Navin Goyal Proceedings of the 30th International Conference on Machine Learning, PMLR 28(3):127-135, 2013.

[B] When and why randomised exploration works (in linear bandits), Marc Abeille, David Janz, Ciara Pike-Burke Proceedings of The 36th International Conference on Algorithmic Learning Theory, PMLR 272:4-22, 2025.

评论- Thank you for follow-up discussions (1/2)

2025-08-07

Thank you for your follow-up question and comments. We appreciate the opportunity to clarify our perspective, especially regarding the novelty in our analysis framework.

Perhaps your comment — "the main point in TS algorithms is not the randomized aspect but the Bayesian nature of parameter updates" — holds true in the context of Bayesian regret analysis. However, in the worst-case (frequentist) regret setting, the central analytical challenge does stem from controlling the deviation introduced by randomization.

More precisely, in TS, the term $x_{t*}^\top\theta^* - x_t^\top\tilde{\theta}_t$ in the regret decomposition arises due to randomization. From a Bayesian perspective since both $\theta^*$ and $\tilde{\theta}_t$ are i.i.d. samples and $x_t$ is optimal with respect to $\tilde{\theta}_t$ , the expectation of this term is zero, and randomization does not play a significant role in the regret analysis. (Note that the Bayesian regret analysis is a weaker notion of regret than the worst-case regret that we analyze in our work.)

In contrast, under the frequentist regret framework, this randomization-induced term must be carefully upper bounded. To this end, the notion of stochastic optimism is introduced—namely, the requirement that a randomized algorithm selects optimistic actions with sufficiently high probability to guarantee adequate exploration. Since the regret bound scales inversely with this stochastic optimism probability $p$ , it becomes necessary to balance $p$ against the confidence width. (Achieving this balance typically requires inflating the confidence radius by a factor of $\sqrt{d}$ , which in turn leads to a worst-case regret bound that depends on $d^{3/2}$ .)

According to Hamidi et al., such $d^{3/2}$ -dependent suboptimality is in fact unavoidable under general assumptions on the arm set. Moreover, in the generalized linear setting, existing works often linearize the model to avoid complications from the link function, incurring an additional cost of $(\mathcal{L}_\mu/\kappa)^{3/2}$ and ultimately inheriting the same $d^{3/2}$ suboptimality in their regret bounds.

In contrast, our proposed Feature Perturbation (FP) method addresses these issues. In the generalized linear setting, the optimism term induced by randomization is $\mu(x_{t*}^\top\theta^*)-\mu(\tilde{x}_t^\top\hat{\theta}_t)$ . We prove that, with high probability, the perturbed feature vector $\tilde{x}$ remains within the local confidence ellipsoid $\mathcal{E}_t(x)$ , and that this suffices to guarantee stochastic optimism (sufficient exploration) with constant probability—without incurring an extra $\sqrt{d}$ factor.

Unlike OFU-style algorithms, our analysis demonstrates that applying random perturbations to the feature vectors—even after passing through the nonlinear link function $\mu$ —can still yield a leading regret term that scales proportionally to $\sqrt{\kappa_*}$ , rather than suffering from an inverse dependence on $\kappa$ . This result is established through a tight analysis that directly accounts for the nonlinearity of $\mu$ .

The regret component arising from randomization, denoted as $\text{Reg}\_\text{FP}$ , is tightly bounded through this sequence of arguments. Moreover, when combined with the estimation-related term $\text{Reg}\_\text{EST}$ , we show that the total regret can be upper bounded by a single unified quantity, $\text{Reg}\_\text{max}$ . This formulation enables us to simultaneously control both sources of regret—stemming from randomization and estimation—and ultimately results in a tight overall regret bound, which is another major difference from how OFU-based algorithms are analyzed.

评论- Thank you for follow-up discussions (2/2)

2025-08-07

On “Bridging the Gap” When we referred to a “gap” between OFU and randomized algorithms, we specifically meant the following:

OFU algorithms offer strong regret guarantees (e.g., $\tilde{\mathcal{O}}(d\sqrt{T}/\kappa)$ ), but often exhibit conservative exploration and weak empirical performance.
Randomized algorithms [2, 4, 19, 20, 30] perform better empirically, but incur suboptimal regret bounds—typically with an extra $\sqrt{d}$ or $1/\kappa$ factor.

Our work successfully combines the advantages of both OFU-based and randomized approaches, leading to the first randomized algorithm to achieve theoretically optimal regret guarantees under general assumptions. This marks a significant step forward in bridging the gap between the practical strength of randomized methods and the tight theoretical guarantees previously reserved for deterministic approaches.

Here’s a summary of where our method stands relative to existing work:

Type	Algorithm	Regret Upper Bound	Stochasticity
Deterministic	GLM-UCB [Filippi et al., 2010]	$\tilde{\mathcal{O}}(d\sqrt{T}/\kappa)$	-
	Logistic-UCB-2 [Jun et al., 2017]	$\tilde{\mathcal{O}}(d\sqrt{T} + d^2/\kappa )$	-
	OFUGLB [Lee et al., 2025]	$\tilde{\mathcal{O}}(d\sqrt{\kappa_* T} + d^2/\kappa )$	-
Randomized	LinTS [Abeille et al., 2017]	$\tilde{\mathcal{O}}(d^{3/2}\sqrt{T}/\kappa)$	Parameter ( $\theta$ )
	GLM-TSL [Kveton et al., 2020]	$\tilde{\mathcal{O}}(d^{3/2}\sqrt{T}/\kappa)$	Parameter ( $\theta$ )
	GLM-FPL [Kveton et al., 2020]	$\tilde{\mathcal{O}}(d^{3/2}\sqrt{T}/\kappa)$	Reward ( $r$ )
	RandUCB [Vaswani et al., 2020]	$\tilde{\mathcal{O}}(d\sqrt{T}/\kappa)$	Utility ( $x^\top \theta$ )
	GLM-FP (Ours)	$\tilde{\mathcal{O}}(d\sqrt{\kappa\_* T} + d^2/\kappa)$	Feature vector ( $x$ )

Note that [B] assume "the action set $\mathcal{X}$ to be a closed absorbing subset (a neighbourhood of the origin), and $J_{\mathcal{X}}(\theta)=\max_{x\in\mathcal{X}}\langle x,\theta\rangle$ to be $m$ -strongly convex and $M$ -smooth for some $m,M$ ."

The authors of [B] rightly note that their conditions do not hold for the worst-case instances identified by Hamidi and Bayati [16]. While their assumptions may be met in certain settings (e.g., $\ell_p$ balls), they may not hold for more general, arbitrary bounded action sets. However, we require only that the action set is bounded, a standard and broadly applicable condition. We are happy to inlucde the discussion on this point.

We hope our reponses clarify both the technical and conceptual novelty of our work. Thank you again for your positive evaluation and discussion—we sincerely value the opportunity to interact with you.

审稿意见

评分: 5置信度: 22025-07-03

This paper describes a new bandit learning algorithm by injecting noise directly into the context vectors. Specifically, at each round, the algorithm adds a random gaussian vector sample to every arm's feature vector.
The paper provides a tight regret bound ( $\widetilde{O} (d \sqrt{T})$ ), matching the best-known rates without relying on arm counts or parameter-space sampling.

优缺点分析

Quality (3):

Strength: The proposed method is easy to implement, and new in the specific line of contextual bandit literature. The proposed regret analysis is tight and does not dependent on the number of ars.
Weakness: Overall, computing and inverting the weighted Gram matrix may be very costly, which potentially hinders the scalability.

Clarity (3):

Strengths: The paper is well-written and easy to read. The figures are clearly plotted.
Weaknesses: The authors may consider improving their notations for better readability. For example, using \mathbf for random vectors, \bm for vectors. In addition, adding toy calculation in $d = 2$ may be helpful to illustrate the analysis of the confidence ellipsoid.

Significance (3):

Strengths: It matches the best known bound for GLMs and directly extends to last-layer features in deep nets. This is a non-trivial improvement.
Weaknesses: It is unclear how the principles/insights in this paper can be applied to other domains or data type (e.g., images, language). It would be nice to add more discussions related to that. Also, it may be helpful to add more discussions on how the proposed method and analysis can shed light on other domains.

Originality (3):

Strengths: As the authors mentioned, adding noise to features is relatively new in the domain of contextual bandits.
Weaknesses: Feature perturbation is commonly used in other areas like computer vision. Please consider adding more discussion on such related works.

问题

Can the proposed method and analysis be applied to satisficing bandits and other domains in ML?
What happens if the link function is not self-concordant (e.g., heavy-tailed rewards)?

局限性

The paper briefly discusses its limitations in Appendix D, admits that the proposed analysis may not work for more general cases like non-GLM. It would be nice to discuss some future works beyond the current package.

最终评判理由

The authors have addressed the questions, and I would maintain my score.

格式问题

N/A

作者回复

2025-07-31

Thank you for taking the time to review our paper and for your thoughtful and valuable feedback. We appreciate your positive recognition of our work, as well as your constructive comments on various aspects of the paper and its potential applicability to other domains. We address your comments point by point below.

[W1] The computational challenge of adapting a weighted Gram matrix in the GLB setting is not unique to our randomized approach. Some methods [14, 2, 27, 30] use an unweighted (vanilla) Gram matrix and leverage the Sherman-Morrison formula for computational efficiency. However, such approaches typically suffer in their regret guarantees, incurring losses in the condition number $\kappa$ term, and fail to fully capture the curvature of the link function, which can degrade empirical performance.

Any GLB algorithm [3, 13, 19, 24, 25] that relies on a data-dependent weighted Gram matrix—often the Hessian of the log-likelihood—to achieve tighter theoretical guarantees, including state-of-the-art OFU-based methods, faces the same computational burden. Typically, the GLM parameters $\hat{\theta}_t$ are estimated via iterative methods like IRLS, during which the weighted Gram matrix is naturally computed. While inverting this matrix entails additional computational cost, this cost represents a fundamental trade-off for obtaining more refined, instance-dependent regret bounds. To address a similar challenge, the (ada-)OFU-ECOLog algorithm [K] utilizes the Sherman–Morrison formula by maintaining a Gram matrix of the form $\hat{H}\_t := \sum\_{\tau=1}^{t-1} \dot{\mu}(x\_\tau^\top \hat{\theta}\_{\tau}) x\_\tau x\_\tau^\top$ , which allows for efficient updates. Following this approach, our algorithm can also be implemented efficiently.

We will add a discussion on these computational considerations in the final version, emphasizing that this challenge is common among high-performance GLB algorithms and outlining potential strategies for mitigation.

[W2] Thank you for your thoughtful feedback. I will gladly revise the final version of the papaer as suggested.

[W3] Thank you for the positive feedback on our work and for recognizing it as a non-trivial contribution, and also appreciate the insightful suggestion to discuss the broader applicability of our method, which we agree will strengthen the final version.

The central idea of our approach—exploring in the feature representation space instead of the high-dimensional parameter space—is model-agnostic and naturally extends to a wide range of modern machine learning settings, particularly in deep learning.

Applications to Image and Language Data For high-dimensional inputs like images or text, a common pipeline uses powerful pre-trained feature extractors (e.g., ResNet, ViT [A, B] for images, or BERT [C] for language) to transform raw inputs into compact embeddings. Our FP method can be applied directly to these embedding vectors. In image-based tasks, instead of perturbing raw pixels, we inject controlled noise into the feature vectors prior to decision-making layers, enabling semantically meaningful exploration. In NLP tasks, applying FP to BERT-based sentence embeddings enables exploration over semantically similar but syntactically distinct contexts. As a natural extension, developing domain-adapted algorithms with provable guarantees would be a compelling direction for future research.

Broader Insight for Deep Learning Exploration Our work contributes a simple yet effective mechanism for exploration that avoids the computational burden of modeling uncertainty over millions of parameters [D, E]. Instead, we propose a shift in perspective: rather than asking “How uncertain are my model’s parameters?”, we ask “How sensitive is the model to small changes in its input context?” This input-space exploration offers a practical proxy for model uncertainty and provides a scalable principle that can be broadly useful in deep reinforcement learning and other settings requiring efficient exploration.

We will incorporate this discussion into the final version to better highlight the generality and conceptual contribution of our work.

[W4] Our work fundamentally reinterprets the role of feature perturbation: rather than using it for robustness or regularization during training, as is common in fields like computer vision, we employ it as a principled mechanism for exploration in online decision-making. In traditional applications, feature perturbation is used during training to help models generalize better by making predictions invariant to input noise—this includes techniques such as data augmentation [F,G] and adversarial training [H].

In contrast, our method introduces noise at inference time, not to regularize the model but to induce structured stochasticity in action selection. This allows the learning agent to explore its environment by probing the model’s sensitivity to perturbations in the input space, rather than relying on uncertainty in the parameter space. The goal is not to build a more robust static model, but to enable efficient exploration in an adaptive, sequential learning setting, such as contextual bandits.

We appreciate the reviewer’s suggestion and will add a discussion in the related work section to clarify this distinction and better situate our contribution within the broader literature.

[Q1] For satisficing bandits, our exploration mechanism is directly applicable. The only change would be to replace the reward-maximizing argmax policy with a rule that terminates once an arm's confidence bound reliably exceeds the satisficing threshold.

For other domains like computer vision and language, FP can be applied to the feature embeddings generated by large models. This offers a key advantage over other methods: it provides a practical and scalable exploration strategy that sidesteps the often-intractable problem of exploring in a massive parameter space.

[Q2] Self-concordance is a sufficient condition for obtaining tight regret bounds that depend on the problem-specific constant $\kappa_*$ . It characterizes the geometry of the link function $\mu$ , which is naturally satisfied by the instances considered in the GLB setting. Many recent GLB algorithms [3, 13, 24, 25] focus on such self-concordant link functions, including linear, logistic, and Poisson models. Our assumption of self-concordance was not essential but introduced mainly for convenience, to facilitate the definition of $M_\mu$ used in the proof analysis.

In our analysis, the self-concordance assumption plays a critical role. Earlier works [14, 2, 27, 30, 20] often linearized the problem, which incurred additional regret reverse-proportional on the condition number $\kappa$ . In contrast, we leverage self-concordance to tightly control higher-order terms in the Taylor expansion of the loss, enabling improved regret bounds with favorable $\kappa$ -dependence while simultaneously managing the stochastic perturbations of the feature vector. For example, our analysis uses the Mean Value Theorem to reduce second-order error terms to first-order ones via $M_\mu$ . Combined with the Lipschitz continuity of $\mathcal{L}_\mu$ , this permits the application of elliptical potential lemmas and results in sharper confidence bounds. Without self-concordance, these tools are not applicable, making it difficult to achieve improved regret dependence on $\kappa$ .

Regarding heavy-tailed rewards, this refers to the data distribution, which directly contradicts our assumption of $R$ -subgaussian (light-tailed) noise. Handling heavy-tailed rewards would require a fundamentally different theoretical framework, such as those developed in the robust bandit literature [I, J], which is beyond the scope of this paper. Nevertheless, we believe that analyses addressing heavy-tailed settings could be naturally combined with our Feature Perturbation approach.

[L] Developing theoretically grounded algorithms beyond GLMs—such as those accommodating only Lipschitz-continuous link functions—is an important direction for future work. In parallel, extending the feature perturbation framework to domains such as vision and language, where more structured or semantic noise may better capture domain-specific inductive biases, could further enhance exploration. We appreciate the reviewer’s suggestion and will incorporate a broader discussion of these potential extensions into the final version.

[A] Deep Residual Learning for Image Recognition, He et al., 2016
[B] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, Dosovitskiy et al., 2020
[C] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Devlin et al., 2019
[D] Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning, Gal & Ghahramani, 2016
[E] Deep Exploration via Bootstrapped DQN, Osband et al., 2016
[F] A survey on Image Data Augmentation for Deep Learning, Shorten & Khoshgoftaar, 2019
[G] The Effectiveness of Data Augmentation in Image Classification using Deep Learning, Perez & Wang, 2017
[H] Explaining and Harnessing Adversarial Examples, Goodfellow et al., 2015
[I] Efficient Algorithms for Generalized Linear Bandits with Heavy-tailed Rewards, Xue et al., 2023
[J] Bandits with heavy tail, Bubeck et al., 2012
[K] Jointly Efficient and Optimal Algorithms for Logistic Bandits, Faury et al., 2022

最终决定Accept (spotlight)

2025-09-17

This paper on contextual bandits takes an unusual and interesting approach to solving the standard problem, by adding noise in the context instead of the underlying parameter.

This is interesting and original, and all reviewers were very positive about it.

I am happy to recommend acceptance at NeurIPS 2025 !