7.3

/10

Poster4 位审稿人

最低4最高5标准差0.5

3.8

置信度

创新性2.8

质量3.3

清晰度3.3

重要性2.8

NeurIPS 2025

Infrequent Exploration in Linear Bandits

Harin Lee,Min-hwan Oh

OpenReview PDF

提交: 2025-05-09更新: 2025-10-29

摘要

关键词

linear banditsgreedy selection

评审与讨论

审稿意见

评分: 4置信度: 32025-06-09

This paper provides a generic algorithmic template for infrequent exploration in stochastic linear bandits. It provides a simple meta algorithm that attains near-optimal regret, with a near-optimal regret bound. The algorithm can at most tolerate $w(\frac{\log t}{t})$ frequence exploration, in the asymptotic regime when $d$ is treated as a constant.

优缺点分析

Strengths: the result is novel and technically elegant. It uses a simple meta algorithm by reduction to black-box linear bandit algorithms. The regret bound is also generic and reflects trade-offs between factors.

Weakness: the statement of the main theorem should clarify that $f(t)=w(\log t)$ for any $t$ , otherwise it would lead to a pathological interpretation.

It would be beneficial to have a more detailed, quantitative discussion on the trade-off between regret and exploration, mainly on the effect of $G(T)$ . For example, one question in my mind is that when (how large $T$ grows in $d$ ) frequent exploration is for free, for different choices of $f(t)$ from the table.

问题

See Strengths And Weaknesses.

局限性

Yes.

最终评判理由

I've read the author's response, which addressed my question, and I will maintain my score.

格式问题

作者回复

2025-07-31

We sincerely thank the reviewer for the positive evaluation and thoughtful comments. We have carefully considered each point and provide our detailed responses below.

[W1] (f(t) = w(log t) for all t)

As defined in Section 4.1, we only require the asymptotic growth condition $f(t) = \omega(\log t)$ as $t \rightarrow \infty$ , not for every $t$ . This condition is weaker than requiring the growth condition to hold for all $t$ . We will add "as $t \rightarrow \infty$ " in the final version for clarity. Further, Theorem 3 in Appendix A.4 shows that $f(t) = \omega(\log t)$ is a necessary condition. Given these points, we believe this should be considered a strength rather than a weakness.

[W2] (Trade-off between regret and exploration)

We would like to clarify that in Theorem 1, $G(T)$ is unaffected by the exploration schedule. As long as the number of executions of the base algorithm is super-logarithmic in $t$ , the asymptotic regret of $`INFEX`$ remains the same, regardless of the exploration schedule. Therefore, there is no trade-off between exploration and $G(T)$ , which we believe is one of the most interesting findings of the theorem. $G_{\text{const}}$ , which is the $T$ -independent regret term incurred by the additional greedy selections, is affected by the exploration schedule, and its dependence on some example $f(t)$ is shown in Table 1. A more specific derivation can be found in Appendix A.3.
While infrequent exploration is not free in general in terms of non-asymptotic regret, if the reviewer is curious about the exact closed-form expression of the relationship between $T$ and $d$ where the $T$ -dependent terms dominate the constant terms, it is slightly tricky to state it as it also depends on the performance of the base algorithm. Instead, we provide the total regret when the base algorithm $`Alg`$ is LinUCB or LinTS for the examples of $f(t)$ in Table 1, where we omit the $\mathcal{O}(\cdot)$ symbol for simplicity.

$`Alg`$	$f(t)$	Total regret
$`LinUCB`$	$t / m$	$\frac{1}{\Delta}(\log T + d \log \log T)^2 + (m + \frac{d}{\Delta})\frac{d}{\Delta} \log^2 \frac{d}{\Delta}$
$`LinUCB`$	$t / (\log t)^r$	$\frac{1}{\Delta}(\log T + d \log \log T)^2 + \frac{d^2}{\Delta^2}\log^{2+r} \frac{d}{\Delta}$
$`LinUCB`$	$t^r$ $(r \in (0, 1])$	$\frac{1}{\Delta}(\log T + d \log \log T)^2 + (\frac{d}{\Delta}\log \frac{d}{\Delta})^{\frac{2}{r}}$
$`LinUCB`$	$(\log t)^r$ $(r > 1)$	$\frac{1}{\Delta}(\log T + d \log \log T)^2 + \exp( (\frac{d}{\Delta}\log \frac{d}{\Delta})^{\frac{2}{r}} ) + \Delta \exp ( (\frac{d}{\Delta^2})^{\frac{1}{r-1}} )$
$`LinTS`$	$t / m$	$\frac{1}{\Delta}(d \log \frac{dT}{m})(\log T + d \log \log T)^2 + \frac{d^3}{\Delta^2} \log^3 \frac{d}{\Delta} + \frac{md}{\Delta} \log^2 \frac{d}{\Delta}$
$`LinTS`$	$t / (\log t)^r$	$\frac{1}{\Delta}(d \log dT)(\log T + d \log \log T)^2 + \frac{d^3}{\Delta^2}\log^{3+r} \frac{d}{\Delta}$
$`LinTS`$	$t^r$ $(r \in (0, 1])$	$\frac{1}{\Delta}(d \log dT)(\log T + d \log \log T)^2 + (\frac{d^{\frac{3}{r}}}{\Delta^{\frac{2}{r}}}\log^{\frac{3}{r}} \frac{d}{\Delta})$
$`LinTS`$	$(\log t)^r$ $(r > 1)$	$\frac{1}{\Delta}(d \log (d\log T) )(\log T + d \log \log T)^2 + \exp( \frac{d^{\frac{3}{r}}}{\Delta^{\frac{2}{r}}}\log^{\frac{3}{r}} \frac{d}{\Delta} ) + \Delta \exp ( (\frac{d}{\Delta^2})^{\frac{1}{r-1}} )$

2025-08-05

Thank you for the detailed response. I will maintain my score.

W1: I'm not judging what condition is the tightest here. Instead, my point is that you need to specify the range of applicable $t$ , otherwise it's an incomplete and confusing statement.

W2: I believe it would be beneficial to somehow incorporate (part of) the table into the main paper.

2025-08-07

Once again, we thank the reviewer for the positive evaluation and for engaging in the discussion.

W1: As authors, we aim to ensure that our theorem statements are both complete and clear. To that end, we would greatly appreciate further clarification on your comment, after which we will be more than happy to make the necessary revisions.

We would like to note that we do not require any range regarding $t$ . The condition $f(t) = \omega(\log t)$ is defined by the property of the limit, $\lim_{t \rightarrow \infty} \frac{f(t)}{\log t} = 0$ , and cannot be imposed on a certain range of $t$ . The confusion may have stemmed from the fact that the first finite values of $f(t)$ may change freely, however it is irrelevant of the requirement of the theorem. The change would be reflected on the size of $G_{\text{const}}(\tau, f)$ and Theorem 1 is still valid. The condition is only necessary to guarantee that $G_{\text{const}}$ is not infinity.

If the current statement does not effectively communicate this, we would be grateful for any concrete suggestions the reviewer might have for improving its clarity.

W2: Thank you for the suggestion. We will consider ways to present the contents of the table more effectively within the paper as the current table in the rebuttal does not seem to fit naturally in the paper.

Given the simplicity and practicality of our algorithmic framework, the strength of our theoretical guarantees, the originality of several technical contributions, the clear empirical evidence we provide, and more importantly, their practical implications, we sincerely believe our work represents a well-rounded and meaningful contribution to the literature. We respectfully hope that the significance and potential impact of our results are fully recognized, and we believe the paper merits a stronger recommendation than a borderline accept.

审稿意见

评分: 5置信度: 42025-07-02

This paper studies the problem that lies in the intersection between fully adaptive algorithms that explore at every step and greedy algorithms that do not explore. In particular, the authors are interested in whether infrequent exploration can achieve near-optimal regret while reducing exploration costs and computational burden. To address this, they propose a new general algorithm, INFEX, that allows a base algorithm as an input and decides a schedule for which the algorithm explores or commits. The authors show regret bounds for their algorithm and show that these bounds match with the bounds of fully adaptive algorithms asymptotically. The authors also show numerical experiments that compare the empirical performance of INFEX with fully adaptive and greedy algorithms.

优缺点分析

Strength:

In terms of clarity, the writing is very clear with sections and subsection titles, also the authors stated contributions very clearly.
In terms of quality, the proposed framework is quite general, in the sense that we can plug in a lot of existing algorithms.
The numerical experiments and visualizations are clear, and the authors compare their algorithms with various benchmarks.

Weaknesses:

In terms of significance, the motivation could be discussed further, especially in terms of practical use. The authors mentioned this in their last open question, "Can infrequent exploration methods also demonstrate practical advantages beyond theoretical considerations?", but I'm not sure if the authors fully answered it in this paper.

问题

In Theorem 1, you provide a regret bound that holds "with high probability", which is a bit unusual since high probability statements often occur in best-arm identification instead of regret. Is this a probability 1 statement? If it is with probability 1-delta, where can I see the dependence of delta in the statement of Theorem 1?
Also in Theorem 1, can you describe intuitively what the constants a,b, and c are, and in which scenarios they would be large?
Can you explain intuitively why inefficient exploration for the instance you constructed in the numerical experiment has a benefit? I'm a bit surprised that for this instance, INFEX even beats LinUCB which is fully adaptive. I would imagine infrequent exploration gives benefits mostly computationally, but not necessarily a better regret bound.
Also, it would be good to understand why no exploration, e.g. LinTS, performs so poorly here.
Does INFEX work in settings with budget constraints? I was thinking whether you could frame some objectives for which the number of exploration is limited due to e.g. resources or time, so we have to do infrequent exploration here. Could the authors provide some intuition on whether their algorithm will or will not work in this case, or they are different settings?

局限性

Yes.

最终评判理由

The authors' detailed and clear responses addressed my concerns. I will keep my score.

格式问题

No formatting concerns.

作者回复

2025-07-31

We appreciate the reviewer’s time and effort in evaluating our paper and providing valuable feedback. We sincerely thank the reviewer for the supportive and constructive comments. Below, we address each of the comments and questions in detail.

[W] (Practical use)

We are more than happy to have the opportunity to clarify this point.

First of all, we would like to emphasize that there are many practical scenarios in which the possible amount of exploration is constrained due to logistical reasons such as costs, risks, and ethical considerations, as described in the Introduction. Our main contribution lies in showing that even in such scenarios, one can achieve polylogarithmic regret as long as the exploration budget is super-logarithmic in $t$ , complemented by showing that such a condition is necessary in Theorem 3 in Appendix A.4. Being able to be combined with any base algorithm without additional computational or implementational burden, we confidently claim that our method IS practical.
In addition, the performance of $`INFEX`$ is not only limited to theoretical growth orders, but also has empirical superiority as shown through numerical simulations. The experiments further show that the performance of $`INFEX`$ is not sensitive to the choice of $f(t)$ , showing superior performance in most configurations. Assuming one uses, for example, TS for practical purposes, the experiment shows that it is better to use TS with, say, a few steps intervals mixed with greedy action, rather than running TS at every time step, where a wide range of exploration schedules (from one out of five to one out of one hundred time steps) generally improves performance.

Overall, our proposed method and its theoretical results carry meaningful implication and impact on practical use.

[Q1] ("With high probability")

As mentioned in Section 4.1, "with high probability" means with probability $1 - \delta$ in this paper, and the shown bounds correspond to the case where $\delta = 1 / T$ , (or more precisely, the bounds are invariant up to constant factors when $\delta$ is regarded as a constant or when $\delta$ decreases polynomially in $1/T$ ). $\log \frac{1}{\delta}$ factors can be recovered by replacing some of the $\log T$ factors with $\log \frac{T}{\delta}$ factor. We omitted the dependence on $\delta$ since explicitly writing it out only leads to unnecessarily complicated bounds without providing much additional information; for example, the order of $G(T)$ becomes $\frac{1}{\Delta}(\log \frac{T}{\delta} + d \log \log T + d \log \frac{d}{\Delta})(\log T + d \log \log T + d \log \frac{d}{\Delta})$ instead of $\frac{1}{\Delta}(\log T + d \log \log T + d \log \frac{d}{\Delta})^2$ . Note that $`INFEX`$ does not require the value of $\delta$ for its execution, unless required by the base algorithm, which gives us the freedom to choose $\delta = 1/ T$ . If the expression causes confusion, we will replace the phrase with another one in the final version.

[Q2] (Constants $a, b,$ and $c$ )

The constants $a, b,$ and $c$ quantify the regret performance of the base algorithm, $`Alg`$ . They would be large when $`Alg`$ is an inefficient algorithm. For $`LinUCB`$ , we have $a = 2, b = 1$ , and $c = 2$ (Theorem 5 in Abbasi-Yadkori et al. (2011)). For $`LinTS`$ , we have $a = 3, b = 1$ , and $c = 3$ (Theorem 2).

[Q3] (Empirical superiority)

We conjecture that $`INFEX`$ empirically reduces the regret by avoiding unnecessary exploration. Once the estimator becomes accurate enough to distinguish the optimal arm, a greedy selection would incur less instantaneous regret than exploratory ones. Indeed, LinUCB explores fully adaptively, but is based on the worst-case confidence bounds. Even when the estimator correctly identifies the optimal arm, LinUCB conservatively chooses other arms as long as they have a certain amount of probabilities of being optimal in the worst cases. We believe $`INFEX`$ benefits when such "worst cases" do not occur, allowing it to successfully exploit the optimal arm for most time steps.

[Q4] (Performance of LinTS)

LinTS theoretically requires a $\sqrt{d}$ -factor inflation in its perturbation, which results in a multiplicative $\sqrt{d}$ factor in its minimax regret [Agrawal and Goyal, 2013, Abeille and Lazaric, 2017], and a $d$ factor in its instance dependent regret (Theorem 2), compared to LinUCB. Since larger perturbation in LinTS corresponds to excessive exploration, this additional factor is reflected in the empirical results as well. We believe that $`INFEX`$ significantly improves the performance of LinTS in the numerical simulation since it naturally limits excessive exploration.

[Q5] (Budget constraints)

We thank the reviewer for this question, and we would like to further highlight the significance of our paper.

Yes, our work can be directly applied in such settings. The exploration budget -- or even more generally, the whole exploration schedule -- may be given by external sources, and we do not assume that the agent has the freedom to adjust it. Our algorithm and theoretical results remain perfectly valid even in such cases, where we only require the exploration budget $f(t)$ to be superlogarithmic in $t$ . We also show that this condition is necessary in Theorem 3 in Appendix A.4, showing that if the allowed exploration budget does not exceed the logarithmic barrier, then there exists an instance where no algorithm succeeds. In the budget-constrained setting, this lower bound could be of independent interest.

2025-08-04

I would like to thank the authors for their detailed and clear responses. I will keep my score.

2025-08-07

We sincerely thank the reviewer for the recognition of our work and consistent support. We truly appreciate the time and effort you have devoted to reviewing our paper and engaging in the discussion.

审稿意见

评分: 4置信度: 42025-07-03

This paper proposes INFEX, a general framework for linear contextual bandits that performs infrequent exploration. It aims to bridge the gap between:

"Fully" exploratory algorithms like UCB or Thompson Sampling, which explore continuously but may be ethically problematic in practice; and
greedy algorithms, which completely avoids exploration but require strong assumptions (e.g., persistent feature diversity) to achieve good regret.

Their algo, INFEX, maintains a predefined exploration schedule, running a base exploratory algorithm (e.g., LinUCB) at these rounds, and acting greedily in other rounds. The main result shows that INFEX achieves instance-dependent regret bounds matching standard algorithms, as long as the exploration schedule is superlog.

优缺点分析

Strengths

Clarity and simplicity: The algorithm is intuitive, easy to implement, and compatible with many existing base algorithms.
Clean results: The paper proves that logarithmically infrequent exploration suffices to maintain optimal performance in the linear bandit setting.
Practical motivation: The framework is well-motivated for domains like healthcare or safety-critical systems, where constant exploration may be infeasible.
Good writing: Overall, the paper is easier to read and better organized than many others.

Weaknesses. The authors repeatedly state that previous works require “strong distributional assumptions,” particularly for greedy policies. However, it is not clearly explained what these assumptions are, why they are necessary for greedy algorithms, and why INFEX avoids them. A more explicit comparison (ideally with counterexamples or illustrations) would greatly strengthen the paper.

In particular, what are the “bad cases” that INFEX handles well? It would be illuminating to show specific examples where greedy fails but INFEX succeeds — this would clarify the paper’s practical impact.

问题

see above

局限性

see above

最终评判理由

I will maintain my initial evaluation.

格式问题

none

作者回复

2025-07-31

We sincerely thank the reviewer for the positive evaluation and thoughtful comments. We deeply appreciate your recognition of our contributions.

We are more than happy to elaborate further on this question by the reviewer.

In the linear contextual bandit setting, the "strong distributional assumptions" required for the greedy policy to work well [Kannan et al., 2018, Sivakumar et al., 2020, Bastani et al., 2021, Raghavan et al., 2023, Kim and Oh, 2024] refer to the assumptions that the context distribution of the arms is sufficiently diverse to allow the greedy policy to automatically explore and achieve small regret. While the greedy policy does not explore explicitly, varying the arm set at each time step can lead it to select arms in diverse directions, possibly spanning the full $R^d$ space, and thus gain information. The cited works formalize the conditions under which the greedy policy benefits from such diversity and achieves low regret.

Specifically,

Kannan et al. [2018], Sivakumar et al. [2020], and Raghavan et al. [2023] assume that the context vectors are perturbed by a multivariate Gaussian distribution at each time step, forcing the context distribution to be diverse.
Bastani et al. [2021] directly assume covariate diversity (Assumption 3 in their paper), assuming a positive minimum eigenvalue of the covariate matrix of the selected feature vector, which guarantees linear increase of the minimum eigenvalue of the empirical covariate matrix.
Kim and Oh. [2024] propose a more general condition on the context distribution under which the greedy policy achieves polylogarithmic regret.

However, all of these assumptions rely on the arm set being randomly changed over time. None of these analyses apply when the arm set is fixed across time steps. In fact, it is not the matter of the analysis as there are counterexamples with fixed arm sets where the greedy policy fails (Example 1 in Jedor et al. (2021)) and insufficient exploration fails (Theorem 3 in Appendix A.4). Therefore, the "bad cases" that $`INFEX`$ handles well are those with fixed arm sets, achieving polylogarithmic regret even in such settings. We will add more details about this point in the final version of the paper.

2025-08-07

We thank the reviewer for the positive evaluation and for engaging in the discussion

We respectfully hope that the significance and potential impact of our results are fully recognized, and we believe the paper merits a stronger recommendation than a borderline accept.

2025-08-06

Thank you for the response. I'd like to stick to my score.

This paper is cute. But I don't usually give 5 for theory papers, unless there is fundamental novelty.

审稿意见

评分: 5置信度: 42025-07-05

This paper studies the theoretical regret guarantees of infrequent exploration in linear bandits. Specifically, this work proposes INFEX, which interleaves greedy actions and the use of an adaptive exploratory policy, and establishes that INFEX achieves a polylogarithmic instance-dependent regret bound under the condition that number of exploration steps is at least $\omega(\log t)$ . As a result, this work fills the gap between the fully exploratory methods and the purely greedy method in linear bandits. Moreover, the paper complements the regret analysis with numerical simulations, showing that INFEX can attain strong empirical regrets while achieving less computation time compared to the two base linear bandit algorithms LinUCB and LinTS.

优缺点分析

Strengths:

The proposed INFEX is a generic and conceptually simple infrequent exploration method that nicely bridges the gap between the various existing adaptive exploratory bandit algorithms and the purely greedy policy. This offers interesting new insights into the linear stochastic bandit problems.
The proposed INFEX achieves an instance-dependent regret bound for any linear bandit algorithm with polylogarithmic regret (under some mild exploration condition). The results accordingly gives a new instance-dependent regret bound for LinTS, which can be of independent interest to the bandit community.
One empirical benefit of using INFEX is that under some base bandit algorithms (such as LinTS and LinUCB), the infrequent exploration mechanism can reduce the overall computation time required while still achieving similar scaling of regret.
Overall, the writing is very clear, and the proof sketch nicely highlights the important steps of the regret analysis, which makes the paper easier to read.

Weaknesses

While I can appreciate the contribution regarding the algorithm and the established regret bounds, I do find INFEX somewhat not that easy to use in practice. Specifically, as shown in Theorem 1 and Table 1, the choice of exploration schedule can significantly affect the $G_{const}$ , which depends on multiple problem-dependent and algorithm-dependent parameters. These parameters are typically unknown and hence make it difficult to optimize the trade-off between $G_{const}$ and the other regret terms. Therefore, in practice, it is hard to choose a proper exploration schedule and the resulting $f(t)$ a priori. One quite possible scenario is that infrequent exploration can incur a much higher regret by using infrequent exploration despite that the computation time can be somewhat reduced. Without addressing the above issues with the choice of exploration schedule, I find the algorithm not as promising as claimed despite that theoretically this is an interesting contribution.
While I can see that this work mainly focuses on instance-dependent regret bounds, I wonder if the infrequent exploration can possibly hurt the minimax regret bound. For example, suppose we use LinUCB and consider the degenerate case where each arm $i$ is a one-hot vector with the $i$ -th element being one and $\theta$ is $K$ -dimensional with the $i$ -th entry being the mean reward of arm $i$ , then intuitively, if the resulting exploration budget is only as small as $f(t)=\log^m t$ with $m>1$ (which still satisfies the condition in Theorem 1), then the minimax regret can be linear in $T$ . With that said, I believe that this work can be strengthened if a discussion on the resulting minimax regrets can be provided as well.
Another concern is that this paper assumes that the arm set is finite and fixed, but the more common setting in the (generalized) linear bandit literature is that the arm set can be arbitrary (e.g., OFUL [Abbasi-Yadkori et al., 2011] and LinTS [Abeille and Lazaric, 2017]) or time-varying (e.g., LinUCB [Li et al., 2010] and UCB-GLM [Li et al., 2017]). It is not immediately clear to me whether the regret bounds in Theorems 1-2 can be directly extended to the more standard linear bandit settings.

问题

This paper assumes that the arm set is finite and fixed, but the more common setting in the (generalized) linear bandit literature is that the arm set can be arbitrary or time-varying. Can the authors comment on whether these assumptions are indeed needed or the results can be readily extended to the more standard linear bandit settings?
While I can see that this work mainly focuses on instance-dependent regret bounds, I wonder if the infrequent exploration can possibly hurt the minimax regret bound. Can the authors comment on this?
From the numerical results in Figure 1, it appears that INFEX typically has higher standard deviation in regret as the exploration budget gets smaller. This can be an undesired property in practice as the regrets in different trials can vary quite significantly.

局限性

The work does not describe any potential limitations in the main paper.

最终评判理由

This paper studies the theoretical regret guarantees of infrequent exploration in linear bandits. The contributions are mainly two-fold: (1) This work proposes INFEX, which interleaves greedy actions and the use of an adaptive exploratory policy (2) This work establishes that INFEX achieves a polylogarithmic instance-dependent regret bound under the condition that number of exploration steps is at least $\omega(\log t)$ .

In my initial review, the main concerns are: (1) INFEX can be not that easily applicable in practice due to the need for choosing a proper exploration schedule. (2) While INFEX still achieves polylogarithmic instance-dependent regret bound, the infrequent exploration can possibly hurt the minimax regret bound, especially when the exploration budget is small (e.g., polylogarithmic in $T$ ). (3) This paper assumes that the arm set is finite and fixed, and this differs from the (generalized) linear bandit literature where the arm set can be arbitrary or time-varying.

Through the multi-round discussions during rebuttal, the authors addressed the concerns (1) and (3) and also offered helpful remarks on the minimax regret bound. Overall I am happy to see this paper accepted if these improvements and discussions can be incorporated into the final version.

格式问题

N/A

作者回复

2025-07-31

We appreciate the reviewer’s time and effort in evaluating our paper and providing valuable feedback. We are grateful for the recognition of our contributions and the detailed comments. Below, we address each of the comments and questions in detail.

[W1] (On the choice of exploration schedule)

We are more than happy to have the opportunity to clarify this point.

First of all, we would like to clarify that our main focus is not specifying the ``proper'' exploration schedule or frequency per se. As stated in the Introduction, there are many practical scenarios in which the possible amount of exploration is already constrained due to logistical reasons such as costs, risks, and ethical considerations. In such cases, one does not even have the freedom to choose $f(t)$ . Our main contribution lies in showing that even in such scenarios, one can achieve polylogarithmic regret as long as the exploration budget is super-logarithmic in $t$ , complemented by showing that such a condition is necessary in Theorem 3 in Appendix A.4.

On the other hand, there may be cases where there is no constraint on the amount of exploration, and it is true that our paper does not suggest speicific exploration schedule in this case. Yet, theoretical bounds are still greatly meaningful in that they provide the principled ground on how often the exploration algorithm should be executed in order to achieve polylogarithmic regret. In addition, the numerical experiments show that the performance of $`INFEX`$ is not sensitive to the choice of $f(t)$ . The simulated exploration schedule significantly varies from at least 100 total exploratory time steps to at most 2000 total exploratory time steps, and most configurations show superior performance compared to their base algorithms and the pure greedy policy. Being able to be combined with any base algorithm without additional computational or implementational burden, we confidently claim that our method IS easy to use in practice. For example, it is better to use TS with, say, a few steps intervals mixed with greedy action (rather than running TS at every time step), and the exact specification of $f(t)$ is not even needed to achieve a favorable practical performance. Overall, our proposed method and its theoretical results carry meaningful implications and impact on practical use.

[W2/Q2] (Minimax regret)

Providing the worst-case minimax regret bound for $`INFEX`$ would indeed be an interesting problem, however it is beyond the scope of our paper, and we believe that not providing it should not be considered as a weakness. The main reason we focus on instance-dependent bounds is to show how the regret changes depending on the exploration schedule for a fixed instance. Furthermore, minimax bounds might be too conservative for practical purposes, as they only focus on the worst-case regret and guarantee nothing more about the other instances which are more likely to be encountered in practice. As to the question of whether infrequent exploration can hurt the minimax bound, although we cannot provide theoretical guarantees or counterexamples at the moment, we agree that overly insufficient exploration is likely to lead to higher worst-case minimax bounds. However, we speculate that it will be sublinear in $T$ for some (or all) exploration schedules considered in this paper given that the instance-dependent bounds are polylogarithmic.

[W3/Q1] (Fixed and finite arm set)

We are glad to have the opportunity to discuss this point.

We first clarify that Theorem 2 only requires the positivity of the minimum gap $\Delta$ and can be applied to arbitrary, time-varying arm sets. Finiteness is only assumed to naturally induce the definition of $\Delta$ , so as long as $\Delta$ is well-defined and positive, the finiteness of the arm set is not required. This applies to Theorem 1 as well.

As for the requirement of a fixed arm set in Theorem 1, there are some possible generalizations. First, our analysis only requires the optimal arm to be fixed across time, rather than the entire arm set. The conditions for Theorem 1 are exactly the same as those for Theorem 5 in Abbasi-Yadkori et al. [2011] (although not explicitly mentioned, the proof requires -- and only requires -- the optimal arm to be fixed). In addition, we note that, counterintuitively, the greedy policy performs worse with fixed arm sets than with time-varying arm sets. Under certain diversity assumptions on the context distribution, the varying arm sets automatically offer exploration even if the agent acts greedily [Kannan et al., 2018, Sivakumar et al., 2020, Bastani et al., 2021, Raghavan et al., 2023, Kim and Oh, 2024], allowing the pure greedy policy to achieve small regret. In this case, $`INFEX`$ naturally inherits the guarantees provided by the cited works in their respective settings. On the other hand, there are examples with fixed arm sets where the greedy policy fails (Example 1 in Jedor et al. [2021]) and insufficient exploration fails (Theorem 3 in Appendix A.4). Therefore, it should be surprising that $`INFEX`$ achieves polylogarithmic regret even when the arm set is fixed, despite performing greedy selections at most of the time steps.

Yet, we admit that the setting where $`INFEX`$ is guaranteed to work well is not fully general, and generalizing Theorem 1 even further seems challenging for now. Please refer to Appendix E for more discussion on the possibility of extending the analysis to the time-varying arm sets.

[Q3] (Higher standard deviation)

Since the greedy policy has a high standard deviation, we believe it is natural that more greedy selections lead to a higher standard deviation. However, our simulations demonstrate that even when the greedy policy is deployed at most of the time steps, for instance, 99% of the time steps, the standard deviations of the regret are significantly smaller than the one by the purely greedy policy, being roughly 50% instead of 99%, which should be considered as a strength of $`INFEX`$ . While we agree that high standard deviation may be undesired in practice, we do not find the observed values to be critically high in the experiment. Additionally, we hope that having a higher standard deviation does not obscure the fact that $`INFEX`$ achieves small average regret, which is highly desired in practice.

2025-08-06

Thank you to the authors for the detailed response and answering my questions.

I have some follow-up comments:

W1: I agree that if the exploration budget is very tight, then there is no room for choosing $f(t)$ . However, most scenarios would be that we have some range of possible $f(t)$ s and need to configure a proper schedule among these candidates for the algorithm. In this case, it is not immediately clear how to choose a proper $f(t)$ .

W2: It can be helpful to add a remark on the minimax regret as this appears to be an interesting open problem.

W3: Thanks for the detailed explanation. Can the authors also comment on where the assumption about the fixed optimal arm is needed?

2025-08-07

Thank you for your response. We are more than happy to address your additional questions.

W1: As noted in our initial rebuttal, our goal is not to identify a single optimal schedule but rather to demonstrate that a broad class of predetermined schedules can achieve polylogarithmic regret. Together with Theorem 3, we propose a necessary and sufficient condition on the exploration schedule for $`INFEX`$ to achieve polylogarithmic regret, meaning that we verify the class of all possible schedules that allow polylogarithmic regret. We view this generality and flexibility rather as a strength of our contribution, though we understand the reviewer’s concern that the range of options may seem overwhelming in practice. But we are happy to answer this question as well.

If one is seeking a suggestion for practical scenarios (as the reviewer appears to allude to), we find that periodic exploration $f(t) = \lfloor t / m \rfloor$ and logarithmically increasing epoch length $f(t) = \Theta(t / \log t)$ are very sensible choices -- they only increase $G_{\text{const}}$ by a constant or logarithmic factor, as noted in the Discussion of Theorem 1. We presented Corollaries 1 and 2, and the experiments using the periodic exploration schedule, because we believed this schedule to be very practical! In fact, our results are the first to show that one need not run algorithms like LinTS or LinUCB on every round—running them periodically suffices and can even perform better in many practical scenarios.

As noted earlier in the rebuttal, the algorithm demonstrates strong performance across a wide range of periods, and it is not necessary to fine-tune the period precisely. We would emphasize these facts more in the revision.

W2: Sure, we appreciate your suggestion. We will add a remark noting the open problem of achieving the minimax regret.

W3: To address this comment, we cite the relevant part from Appendix E:

...our current analysis relies on the property that estimation errors of $x^{* \top} \theta^*$ diminish when the optimal arm is selected frequently. This property becomes less straightforward to guarantee when the optimal arm itself is random or time-varying. Notably, pointwise guarantees for linear regression with random design require additional distributional assumptions [Hsu et al., 2012], suggesting that bounding the estimation error of a random optimal arm without assumptions may be infeasible.

In short, we require a guarantee that the estimation error on the optimal arm decreases as the agent selects it more frequently. When the optimal arm is random, this property becomes more difficult to ensure, requiring additional assumptions even in the simpler linear regression setting (non-bandit setting).

Once again, we thank the reviewer for the positive evaluation and for engaging in the discussion.

2025-08-09

Thank you for the detailed response that address my questions. I do not have any further inquiry. I look forward to seeing these improvements and discussions incorporated into the final version.

最终决定Accept (poster)

2025-09-17

The paper studies the problem of infrequent exploration in linear bandits. Specifically, the authors analyze whether infrequent exploration can achieve near-optimal regret while reducing both exploration costs and computational burden. They propose a new general algorithm, INFEX, which takes a base algorithm as input and determines a schedule for when the algorithm should explore versus commit. The authors derive regret upper bounds for INFEX, showing that these match the bounds obtained by fully adaptive algorithms. Numerical experiments further support the theoretical findings.

The paper’s contributions were unanimously acknowledged, and reviewers particularly appreciated the generality and elegance of the results. A few interesting points were raised during the rebuttal phase that should be incorporated into the camera-ready version of the paper—most notably, the discussions with reviewers 386q and yhSE.