PaperHub
7.8
/10
Poster5 位审稿人
最低4最高5标准差0.4
5
5
5
5
4
3.2
置信度
创新性2.8
质量3.0
清晰度2.8
重要性2.8
NeurIPS 2025

Non-Stationary Lipschitz Bandits

OpenReviewPDF
提交: 2025-04-10更新: 2025-10-29
TL;DR

We study the problem of non-stationary Lipschitz bandits and achieve minimax optimal rate without knowledge of the non-stationarity.

摘要

We study the problem of non-stationary Lipschitz bandits, where the number of actions is infinite and the reward function, satisfying a Lipschitz assumption, can change arbitrarily over time. We design an algorithm that adaptively tracks the recently introduced notion of significant shifts, defined by large deviations of the cumulative reward function. To detect such reward changes, our algorithm leverages a hierarchical discretization of the action space. Without requiring any prior knowledge of the non-stationarity, our algorithm achieves a minimax-optimal dynamic regret bound of $\mathcal{\widetilde{O}}(\tilde{L}^{1/3}T^{2/3})$, where $\tilde{L}$ is the number of significant shifts and $T$ the horizon. This result provides the first optimal guarantee in this setting.
关键词
non-stationary banditsLipschitz banditstracking most significant shifts

评审与讨论

审稿意见
5

This paper studies non-stationary Lipschitz bandits and proves the first minimax optimal dynamic regret bounds in terms of a new non-stationary metric, "significant shifts", which follows a similar line of results in the non-stationary K-armed bandits.

优缺点分析

Strengths:

  1. The first minimax optimal regret bound for non-stationary Lipschitz bandits with both proven lower and upper bounds.
  2. New techniques involving multi-depth binning strategy for Lipschitz bandits.

Weaknesses:

  1. Lack of experimental validation of the MDBE algorithm.
  2. I think Section 3 is quite heavy in notation and can be better served with a diagram/picture and example to illustrate the process of choosing a depth/arm.
  3. Much of the analysis seems structured similarly to the previous works on learning significant shifts. In particular, the work of Suk and Kpotufe [2023] also uses a multi-depth adaptive discretization idea to address the problem of learning the right discretization level, although it is for a finite-armed problem. The authors could better highlight what are the key differences or technical challenges addressed over the Lipschitz contextual bandit setting.

问题

Questions/Suggestions:

  1. Why is there a log(T) factor in Definition 1? Is this because of concentration bounds or does it optimize the log dependence in the regret bound?
  2. In the pseudocode of Section B, why reset the master set at each block instead of carrying over information about eliminated regions from previous block to current block?
  3. Again, I believe a diagram or illustrative example would be helpful to explain the bin selection mechanism and replay scheduling.
  4. The dependence on the Lipschitz constant in the regret bound and other generalities such as Holder smooth bandits are missing, some discussion on these extensions would make the paper more thorough.
  5. What is the computational complexity of the algorithm and how does it compare to other algorithms for the non-stationary setting?

Typos:

  1. "almost" in Line 979
  2. Line 596: "discretion"

局限性

Yes

最终评判理由

Authors adequately addressed all my concerns.

格式问题

N/A

作者回复

We thank the reviewer for the careful analysis.

Differences with Suk and Kpotufe (2023)

While it is true that Suk and Kpotufe (2023) also operate in blocks and discretize the context space accordingly, their estimation strategy does not involve multi-scale replays. Once the context is discretized and observed, their analysis focuses on estimating the mean rewards of a fixed and finite set of KK actions, using replays of these same actions. Our algorithm must face three fundamental challenges: (i) scheduling (potentially simultaneous) replays at different depths, (ii) leveraging the information collected across different scales, and (iii) tracking the regret contribution from all scales simultaneously. In our regret analysis, each action xt[0,1]x_t \in [0,1] belongs to several bins across discretization levels, and we explicitly quantify the regret incurred at each scale. Moreover, to decide which bin to activate at a given depth and time step, we introduce a principled strategy for (a) scheduling replays at specific scales and (b) selecting the appropriate bin at a given active depth through a carefully designed sampling algorithm.

We will incorporate this discussion at the beginning of the proof sketch in Section 5 to clarify the novelty and technical depth of our theoretical contribution.

log(T)\log(T) in the definition of significant shift

While it is true that previous works do not include a logarithmic factor in the definition of a significant shift (e.g. Suk and Kpotufe 2022), we observe that incorporating a log(T)\log(T) term has no impact on the regret analysis, while leading to a strictly sharper metric: the number of significant shifts can only decrease with the added log factor, making the metric sharper.

On restarts at each block

Restarting estimates at the beginning of each block is primarily a theoretical simplification that facilitates a clean application of the doubling trick (see line 625 of our supplementary material). This strategy simplifies the regret decomposition and analysis, at the expense of only a constant numerical factor in the regret bound.

Presentation

We thank the reviewer for pointing out this presentation issue. If the paper is accepted, we will use the additional allowed page in the supplementary material to include visual illustrations of the dyadic tree structure in the main text to improve clarity.

Extension to smooth bandits (same answer as scsD and kQLa)

Our algorithm and analysis can easily be adapted beyond 11-Lipschitz reward function, and still enjoy minimax optimal rate. For the sake of space, we give a sketch of proof of this generalization (similarly to Appendix G for the multi-dimensional setting).

Our method generalizes to (κ,β)(\kappa, \beta)-Hölder mean rewards (μt)t(\mu_t)_t satisfying:

x,x[0,1],μt(x)μt(x)κxxβ.\forall x,x'\in[0,1],\quad |\mu_t(x)-\mu_t(x')| \leq \kappa|x - x'|^\beta. In the stationary setting with horizon τ\tau, the minimax regret is τ1+β1+2βκ11+2β\tau^{\frac{1+\beta}{1+2\beta}} \kappa^{\frac{1}{1+2\beta}}, achieved by discretizing into Kopt(τ,κ,β)=τ11+2βκ21+2βK_\text{opt}(\tau,\kappa,\beta) = \tau^{\frac{1}{1+2\beta}}\kappa^{\frac{2}{1+2\beta}} bins (Kleinberg, 2004). This motivates defining significant regret over [s1,s2][s_1,s_2] as:

t=s1s2δt(x)log(T)(s2s1)1+β1+2βκ11+2β.\sum_{t=s_1}^{s_2} \delta_t(x) \geq \log(T)(s_2 - s_1)^{\frac{1+\beta}{1+2\beta}} \kappa^{\frac{1}{1+2\beta}}. We consider blocks of size τl,m+1τl,m=2m(1+2β)/κ2\tau_{l,m+1}-\tau_{l,m}=2^{m(1+2\beta)}/\kappa^2, yielding 2m2^m bins and expected regret of 2m(1+β)2^{m(1+\beta)} over this block if the rewards are stable enough. Replays at depth dd last (d)=2d(1+2β)/κ2\ell(d) = 2^{d(1+2\beta)}/\kappa^2 and use 2d2^d bins. The bias for estimating the cumulative reward gap is 4(s2s1)κ/2dβ4(s_2 - s_1)\kappa/2^{d\beta}. This motivates the new eviction rule:

maxBt=s1s2δ^t(B,B)>c0log(T)(s2s1)2d+4κ(s2s1)/2dβ,\max_{B'}\sum_{t=s_1}^{s_2} \hat\delta_t(B',B) > c_0\log(T)\sqrt{(s_2-s_1)2^d} + 4\kappa(s_2 - s_1)/2^{d\beta}, where the c020c_0\approx 20 is exactly the same as in the 1-Lipschitz setting.

Replay at depth dd starting at round ss is scheduled with probability:

ps,d=1κ2d(1+2β)/2sτl,m(d<m).p_{s,d} = \frac{1}{\kappa} \cdot \frac{2^{d(1+2\beta)/2}}{\sqrt{s - \tau_{l,m}}} \quad (d < m). Each block includes 2(1+2β)(md)/22^{(1+2\beta)(m-d)/2} replays at depth dd, each incurring regret 2((1+2β)(md)+2d(1+β))/22^{((1+2\beta)(m-d) + 2d(1+\beta))/2}. Summing over all dephs d<md < m gives total replay regret O(2m(1+β))\mathcal{O}(2^{m(1+\beta)}), matching the optimal bound per block.

The expected replay interval at depth dd is (up to numerical constants that do not depend on κ\kappa and β\beta) 2(1+2β)(m+d)/22^{(1+2\beta)(m+d)/2}, while shifts of size 2dβ2^{-d\beta} become detectable within 2m(1+β)2^{m(1+\beta)} rounds. Each episode tl+1tlt_{l+1}-t_l has at most Ml=log2((tl+1tl)κ)/(1+2β)M_l = \log_2((t_{l+1}-t_l)\kappa)/(1+2\beta) blocks, and the regret per episode is upper bounded as

m=1Ml2m(1+β)(tl+1tl)1+β1+2βκ11+2β.\sum_{m=1}^{M_l} 2^{m(1+\beta)} \leq (t_{l+1}-t_l)^{\frac{1+\beta}{1+2\beta}} \kappa^{\frac{1}{1+2\beta}}. Thus, the total dynamic regret of MBDE`MBDE` is upper bounded as:

E[RT]O~(T1+β1+2βL~Tβ1+2βκ11+2β).\mathbb{E}[R_T] \leq \tilde{\mathcal{O}}\left(T^{\frac{1+\beta}{1+2\beta}} \tilde{L}_T^{\frac{\beta}{1+2\beta}} \kappa^{\frac{1}{1+2\beta}} \right). O~() \tilde{\mathcal{O}}(\cdot) only hides poly-logarithmic factors and numerical constant that does not depend on κ\kappa or β\beta.

With the exact same arguments as in our Lower bound proof (Appendix F), we show that this rate is indeed minimax-optimal with respect to T,L~T,κT, \tilde{L}_T, \kappa and β\beta.

Numerical results (same answer as VifP, kQLa and 9z8n)

We implemented our algorithm and evaluated its performance. Because of the NeurIPS policy, we cannot share code at this stage but will release it upon acceptance.

We benchmark against two baselines: BinningUCB(norestart)`BinningUCB (no restart)`, which uses K(T)T1/3K(T)\propto T^{1/3} and never resets, and BinningUCB(oracle)`BinningUCB (oracle)`, which knows the true change points τi\tau_i and resets its estimate at the end phase. At each phase, it uses the optimal per-phase discretization Ki(τi+1τi)1/3K_i \propto (\tau_{i+1}-\tau_i)^{1/3}.

We simulate a 11-Lipschitz, piecewise-linear reward on [0,1][0,1] with a peak shifting from x=0.3x=0.3 to x=0.7x=0.7 every 10510^5 rounds over horizon T=106T=10^6, inducing L~T=10\tilde{L}_T = 10 significant shifts. Regret is averaged over 100100 runs (rounded to nearest integer).

Step (×10³)BUCB(norestart)`BUCB (no restart)`BUCB(oracle)`BUCB (oracle)`MBDE`MBDE` (ours)
102137 ± 101551 ± 301566 ± 108
5010580 ± 3932794 ± 1478556 ± 349
10021007 ± 8905596 ± 39617111 ± 667

Our method adapts to non-stationarity via replays, significantly outperforming BUCB(norestart)`BUCB (no restart)`.

Computational complexity (same answer as VifP, kQLa and 9z8n)

Our algorithm is feasible for small-scale problems, as shown in experiments, but we acknowledge that its time and memory complexity limits its scalability. We detail the worst-case computational cost below.

In a block of length τl,m+1τl,m=8m\tau_{l,m+1}-\tau_{l,m}=8^m, the algorithm maintains d=1m2d=O(2m)\sum_{d=1}^m 2^d = \mathcal{O}(2^m) bins. Estimating bin means costs O(2m)\mathcal{O}(2^m) per round. Additionally, it performs the statistical test ()\textcolor{red}{(\star)} over all bin pair of bins (B1,B2)(B_1, B_2) at each depth dd, with O(4d)\mathcal{O}(4^d) pairs, totaling O(4m)\mathcal{O}(4^m) across depths. For each pair, it checks all intervals of the form [s1,t][τl,m,t][s_1, t] \subseteq [\tau_{l,m}, t], leading to O(8m)\mathcal{O}(8^m) possibilities. Therefore, the total per-round worst-case time complexity is O(2m+4m8m)=O(32m)\mathcal{O}(2^m + 4^m \cdot 8^m) = \mathcal{O}(32^m), and the per-round worst-case memory complexity is O(8m)\mathcal{O}(8^m).

Since mlog(T)m \leq \log(T), we conclude that MBDE`MBDE` has an overall worst-case time complexity O(T6)\mathcal{O}(T^6) and memory O(T4)\mathcal{O}(T^4): both polynomial in TT. For small mm (e.g., m5m \leq 5), runtime is manageable for some problem instances. Designing adaptive, efficient algorithms with optimal regret remains an important open direction, even in KK-armed settings (Gerogiannis et al., 2024).

We will add a paragraph discussing the computational complexity in the main text.

References

  • Kleinberg. "Nearly tight bounds for the continuum-armed bandit problem." NIPS 2004.
  • Suk and Kpotufe. "Tracking most significant shifts in nonparametric contextual bandits." Neurips 2023.
  • Gerogiannis, Huang and Veeravalli. Is Prior-Free Black-Box Non-Stationary Reinforcement Learning Feasible? AISTATS 2025.
评论

Thank you for your response. All my concerns are adequately addressed and so I raise my score to accept.

审稿意见
5

This paper studies the problem of non-stationary Lipschitz bandits, which generalizes both non-stationary bandits and Lipschitz bandits by allowing the reward function to evolve over time while satisfying a Lipschitz condition. The key contribution is MDBE (Multi-Depth Bin Elimination), a novel algorithm that adaptively adjusts the discretization level. The paper establishes the first optimal dynamic regret guarantees for non-stationary bandits with continuous action spaces under Lipschitz continuity.

优缺点分析

Strengths:

This paper is very well-written, with a clear motivation for studying non-stationary Lipschitz bandits as a natural generalization of both non-stationary and Lipschitz bandits with potential real-world applications. The proposed algorithm, MDBE, is novel and thoughtfully designed to address the key challenges of adaptivity and continuous action spaces under unknown non-stationarity. The theoretical results are strong. Moreover, the paper does a good job positioning its contribution relative to prior work, especially in how it extends the notion of significant shifts and adapts replay-based elimination techniques to the continuous setting.

Weaknesses:

The paper introduces a three-way trade-off in the introduction, but it's unclear what the three ways are.

Some technical terms, such as “dyadic tree”, may be unfamiliar to a broader audience and could benefit from a more intuitive or visual explanation.

Notations in some key places, such as d in Proposition 1 or δ in Definition 4, could be clarified within the theorem statement to improve readability.

While the paper is theoretically focused, a small simulation or experiment could help validate the regret bounds and provide intuition about the algorithm’s behavior in practice.

问题

What is the potential for broader applicability to other non-stationary settings of the adaptive discretization strategy? For example, do non-stationary smooth bandits have similar challenges in discretization? If yes, would the strategy proposed in this work be applicable to that problem?

You clearly summarize how your algorithmic contributions relate to prior work. However, the novelty of the proof techniques is less explicitly highlighted. A brief discussion of what is technically new in the analysis would help clarify the theoretical contribution.

局限性

Yes.

最终评判理由

The authors have addressed most of my questions during rebuttal, including numerical results for validation, so I raised the score to a clear accept.

格式问题

No major formatting issues.

作者回复

We thank the reviewer for the feedback.

Three-way tradeoff (same answer as scsD)

The three-way tradeoff refers to the balance our algorithm must deal between: (i) exploration, via replays to detect changes; (ii) exploitation, by choosing bins with the highest estimated rewards; and (iii) discretization level, that is, choosing the appropriate discretization depth at each time step. This third dimension arises specifically from our setting of Lipschitz mean reward functions that change over time. We will clarify this in Section 1.1.

Presentation and readability

We thank the reviewer for pointing the flaws in our presentation. We will add figures to enhance comprehension in the supplementary allowed page if our paper is accepted.

Numerical results (same answer as VifP, 9z8n and GmBd)

We implemented our algorithm and evaluated its performance. Because of the NeurIPS policy, we cannot share code at this stage but will release it upon acceptance.

We benchmark against two baselines: BinningUCB(norestart)`BinningUCB (no restart)`, which uses K(T)T1/3K(T)\propto T^{1/3} and never resets, and BinningUCB(oracle)`BinningUCB (oracle)`, which knows the true change points τi\tau_i and resets its estimate at the end phase. At each phase, it uses the optimal per-phase discretization Ki(τi+1τi)1/3K_i \propto (\tau_{i+1}-\tau_i)^{1/3}.

We simulate a 11-Lipschitz, piecewise-linear reward on [0,1][0,1] with a peak shifting from x=0.3x=0.3 to x=0.7x=0.7 every 10510^5 rounds over horizon T=106T=10^6, inducing L~T=10\tilde{L}_T = 10 significant shifts. Regret is averaged over 100100 runs (rounded to nearest integer).

Step (×10³)BUCB(norestart)`BUCB (no restart)`BUCB(oracle)`BUCB (oracle)`MBDE`MBDE` (ours)
102137 ± 101551 ± 301566 ± 108
5010580 ± 3932794 ± 1478556 ± 349
10021007 ± 8905596 ± 39617111 ± 667

Our method adapts to non-stationarity via replays, significantly outperforming BUCB(norestart)`BUCB (no restart)`.

Computational complexity (same answer as VifP, 9z8n and GmBd)

Our algorithm is feasible for small-scale problems, as shown in experiments, but we acknowledge that its time and memory complexity limits its scalability. We detail the worst-case computational cost below.

In a block of length τl,m+1τl,m=8m\tau_{l,m+1}-\tau_{l,m}=8^m, the algorithm maintains d=1m2d=O(2m)\sum_{d=1}^m 2^d = \mathcal{O}(2^m) bins. Estimating bin means costs O(2m)\mathcal{O}(2^m) per round. Additionally, it performs the statistical test ()\textcolor{red}{(\star)} over all bin pair of bins (B1,B2)(B_1, B_2) at each depth dd, with O(4d)\mathcal{O}(4^d) pairs, totaling O(4m)\mathcal{O}(4^m) across depths. For each pair, it checks all intervals of the form [s1,t][τl,m,t][s_1, t] \subseteq [\tau_{l,m}, t], leading to O(8m)\mathcal{O}(8^m) possibilities. Therefore, the total per-round worst-case time complexity is O(2m+4m8m)=O(32m)\mathcal{O}(2^m + 4^m \cdot 8^m) = \mathcal{O}(32^m), and the per-round worst-case memory complexity is O(8m)\mathcal{O}(8^m).

Since mlog(T)m \leq \log(T), we conclude that MBDE`MBDE` has an overall worst-case time complexity O(T6)\mathcal{O}(T^6) and memory O(T4)\mathcal{O}(T^4): both polynomial in TT. For small mm (e.g., m5m \leq 5), runtime is manageable for some problem instances. Designing adaptive, efficient algorithms with optimal regret remains an important open direction, even in KK-armed settings (Gerogiannis et al., 2024).

We will add a paragraph discussing the computational complexity in the main text.

Extension to smooth bandits (same answer as scsD and GmBd)

Our algorithm and analysis can easily be adapted beyond 11-Lipschitz reward function, and still enjoy minimax optimal rate. For the sake of space, we give a sketch of proof of this generalization (similarly to Appendix G for the multi-dimensional setting).

Our method generalizes to (κ,β)(\kappa, \beta)-Hölder mean rewards (μt)t(\mu_t)_t satisfying:

x,x[0,1],μt(x)μt(x)κxxβ.\forall x,x'\in[0,1],\quad |\mu_t(x)-\mu_t(x')| \leq \kappa|x - x'|^\beta.

In the stationary setting with horizon τ\tau, the minimax regret is τ1+β1+2βκ11+2β\tau^{\frac{1+\beta}{1+2\beta}} \kappa^{\frac{1}{1+2\beta}}, achieved by discretizing into Kopt(τ,κ,β)=τ11+2βκ21+2βK_\text{opt}(\tau,\kappa,\beta) = \tau^{\frac{1}{1+2\beta}}\kappa^{\frac{2}{1+2\beta}} bins (Kleinberg, 2004). This motivates defining significant regret over [s1,s2][s_1,s_2] as:

t=s1s2δt(x)log(T)(s2s1)1+β1+2βκ11+2β.\sum_{t=s_1}^{s_2} \delta_t(x) \geq \log(T)(s_2 - s_1)^{\frac{1+\beta}{1+2\beta}} \kappa^{\frac{1}{1+2\beta}}.

We consider blocks of size τl,m+1τl,m=2m(1+2β)/κ2\tau_{l,m+1}-\tau_{l,m}=2^{m(1+2\beta)}/\kappa^2, yielding 2m2^m bins and expected regret of 2m(1+β)2^{m(1+\beta)} over this block if the rewards are stable enough. Replays at depth dd last (d)=2d(1+2β)/κ2\ell(d) = 2^{d(1+2\beta)}/\kappa^2 and use 2d2^d bins. The bias for estimating the cumulative reward gap is 4(s2s1)κ/2dβ4(s_2 - s_1)\kappa/2^{d\beta}. This motivates the new eviction rule:

maxBt=s1s2δ^t(B,B)>c0log(T)(s2s1)2d+4κ(s2s1)/2dβ,\max_{B'}\sum_{t=s_1}^{s_2} \hat\delta_t(B',B) > c_0\log(T)\sqrt{(s_2-s_1)2^d} + 4\kappa(s_2 - s_1)/2^{d\beta},

where the c020c_0\approx 20 is exactly the same as in the 1-Lipschitz setting.

Replay at depth dd starting at round ss is scheduled with probability:

ps,d=1κ2d(1+2β)/2sτl,m(d<m).p_{s,d} = \frac{1}{\kappa} \cdot \frac{2^{d(1+2\beta)/2}}{\sqrt{s - \tau_{l,m}}} \quad (d < m).

Each block includes 2(1+2β)(md)/22^{(1+2\beta)(m-d)/2} replays at depth dd, each incurring regret 2((1+2β)(md)+2d(1+β))/22^{((1+2\beta)(m-d) + 2d(1+\beta))/2}. Summing over all dephs d<md < m gives total replay regret O(2m(1+β))\mathcal{O}(2^{m(1+\beta)}), matching the optimal bound per block.

The expected replay interval at depth dd is (up to numerical constants that do not depend on κ\kappa and β\beta) 2(1+2β)(m+d)/22^{(1+2\beta)(m+d)/2}, while shifts of size 2dβ2^{-d\beta} become detectable within 2m(1+β)2^{m(1+\beta)} rounds. Each episode tl+1tlt_{l+1}-t_l has at most Ml=log2((tl+1tl)κ)/(1+2β)M_l = \log_2((t_{l+1}-t_l)\kappa)/(1+2\beta) blocks, and the regret per episode is upper bounded as

m=1Ml2m(1+β)(tl+1tl)1+β1+2βκ11+2β.\sum_{m=1}^{M_l} 2^{m(1+\beta)} \leq (t_{l+1}-t_l)^{\frac{1+\beta}{1+2\beta}} \kappa^{\frac{1}{1+2\beta}}.

Thus, the total dynamic regret of MBDE`MBDE` is upper bounded as:

E[RT]O~(T1+β1+2βL~Tβ1+2βκ11+2β).\mathbb{E}[R_T] \leq \tilde{\mathcal{O}}\left(T^{\frac{1+\beta}{1+2\beta}} \tilde{L}_T^{\frac{\beta}{1+2\beta}} \kappa^{\frac{1}{1+2\beta}} \right). O~() \tilde{\mathcal{O}}(\cdot) only hides poly-logarithmic factors and numerical constant that does not depend on κ\kappa or β\beta.

With the exact same arguments as in our Lower bound proof (Appendix F), we show that this rate is indeed minimax-optimal with respect to T,L~T,κT, \tilde{L}_T, \kappa and β\beta.

Technical novelty (same answer as VifP)

The main challenge in our setting lies in estimating a continuous, non-stationary mean reward function without prior knowledge of the magnitude of distributional shifts—or, equivalently, the discretization level and time scale at which such shifts become detectable. Coarse discretizations are essential to rapidly detect large shifts, while finer discretizations are required to capture smaller yet statistically significant changes. This motivates a multi-scale replay framework in which each scale contributes adaptively based on the magnitude of the underlying shift.

To address this challenge, we propose a hierarchical replay mechanism spanning multiple scales, where each scale corresponds to a different discretization of the action space [0,1][0,1]. In contrast, prior works that track significant shifts operate in the KK-armed setting, where the action space remains fixed and finite even when contexts are continuous. Their replay mechanisms and regret analyses therefore rely on a fixed set of arms, without the need to estimate or track reward changes across multiple scales.

Our algorithm must face three fundamental challenges: (i) scheduling (potentially simultaneous) replays at different depths, (ii) leveraging the information collected across different scales, and (iii) tracking the regret contribution from all scales simultaneously. In our regret analysis, each action xt[0,1]x_t \in [0,1] belongs to several bins across discretization levels, and we explicitly quantify the regret incurred at each scale. Moreover, to decide which bin to activate at a given depth and time step, we introduce a principled strategy for (a) scheduling replays at specific scales and (b) selecting the appropriate bin at a given active depth through a carefully designed sampling algorithm.

We will incorporate this discussion at the beginning of the proof sketch in Section 5 to clarify the novelty and technical depth of our theoretical contribution.

References

  • Kleinberg. "Nearly tight bounds for the continuum-armed bandit problem." NIPS 2004.
评论

Thanks for the clarification and the newly added numerical results. It would be helpful to include a brief takeaway or interpretation of the simulation—what should we learn from the comparison of the numbers? Additionally, I think reporting the slope of the log-log plot (though I understand the plot cannot be shown here due to the policy) could provide further support for your regret bound.

评论

Thanks for the reply and the helpful suggestions!

The reported results show the cumulative dynamic regret, averaged over 100 independent runs (we show mean ± 95% confidence interval). The main takeaway is that our algorithm, MBDE`MBDE`, can be implemented (at least on toy numerical experiments) and successfully adapts to non-stationarity. This is evidenced by its significantly lower cumulative regret compared to BinningUCB(norestart)`BinningUCB (no restart)`, which does not reset its estimates and thus struggles in changing Lipschitz bandits environments.

In contrast, BinningUCB(oracle)`BinningUCB (oracle)` represents an idealized baseline with perfect oracle knowledge of when to reset estimates. While our method does not match this oracle performance, it shows clear improvements over the non-adaptive baseline and offers a principled approach to handling non-stationarity without requiring this oracle information.

We agree that reporting the log-log plot would provide further insight on the regret scaling; we will include this in the revised version of the paper if accepted.

评论

Thanks for your response. I'm happy to raise the score to a clear accept.

审稿意见
5

This manuscript introduces the first algorithm with optimal regret guarantees for continuum-armed bandits under time-varying, Lipschitz-continuous reward functions. The key idea is to focus on significant shifts, which meaningfully affect regret. The proposed algorithm, Multi-Depth Bin Elimination (MDBE), adaptively explores a dyadic partition of the action space and uses randomized replays to detect shifts and prune suboptimal regions. Without knowing the number or location of changes, MDBE achieves minimax-optimal regret. However, there are several aspects that should be addressed. My main concern lies in the comparison with prior work and the justification for adopting the new metric. The detail comments are listed below.

优缺点分析

Strengths 1. Addresses a previously unstudied intersection of settings: The paper considers non-stationary bandits with a continuous (Lipschitz) action space, a combination that had not been analyzed in depth before. While both components—non-stationarity and Lipschitz structure—are individually well studied, their interaction introduces new technical challenges that are not addressed by existing work. 2. Technically non-trivial adaptation of known methods: Although one might imagine combining known ideas—such as time-scale decomposition (as in Wang or Besbes–Zeevi) with Kleinberg-style discretization—this paper shows that a naïve hybrid is insufficient. The core difficulty lies in adaptively detecting global, cumulative changes over a continuous space without knowing when or how these changes occur. The algorithm’s hierarchical, replay-based design and multi-scale gap estimation address this challenge in a subtle and well-justified way. 3. Theoretical guarantees match minimax lower bounds: The MDBE algorithm achieves a regret bound of \widetilde{O}(\tilde{L}_T^{1/3} T^{2/3}), matching the minimax lower bound up to logarithmic factors. Crucially, this is done without requiring prior knowledge of the number or locations of significant shifts. The framework is also robust to small or inconsequential fluctuations, making the bound potentially sharper than those based on classical variation metrics.

Weaknesses Summary of the Paper: “Non-Stationary Lipschitz Bandits”

This paper investigates the problem of non-stationary bandits with a continuous action space and Lipschitz reward functions that can change arbitrarily over time. Unlike traditional non-stationary bandit approaches, which often assume a discrete action set and require prior knowledge of the non-stationarity level, this work addresses the more challenging setting of an infinite arm space under Lipschitz continuity without any such assumptions.

The authors propose a novel algorithm, MDBE (Multi-Depth Bin Elimination), which adaptively tracks a refined notion of non-stationarity called significant shifts—large aggregate changes in regret that are statistically meaningful. MDBE uses a hierarchical discretization of the action space and a replay mechanism to detect these significant shifts at different resolutions efficiently.

The main result is that MDBE achieves minimax-optimal dynamic regret of order \widetilde{O}(\tilde{L}^{1/3} T^{2/3}), where \tilde{L} is the number of significant shifts and T is the time horizon. This regret bound matches known lower bounds and does not rely on any prior knowledge of the number or timing of shifts. The paper provides theoretical analysis, a detailed algorithmic design, and high-probability guarantees.

Strengths: 1. Novelty and Scope: The paper pioneers the study of non-stationary bandits in a continuous action setting under Lipschitz continuity—a previously unexplored combination—and introduces a theoretically grounded notion of significant shifts to this context. 2. Strong Theoretical Guarantees: The algorithm achieves minimax-optimal regret bounds without knowing the degree of non-stationarity in advance. This is a substantial theoretical contribution that strengthens the relevance of the proposed method. 3. Sophisticated Algorithm Design: MDBE incorporates a multi-resolution replay mechanism and hierarchical bin elimination, reflecting a careful treatment of exploration vs. exploitation in a non-stationary continuous setting. The sampling and bin eviction strategies are technically sound and rigorously justified.

Weaknesses: 1. Lack of Empirical Validation: The paper is entirely theoretical and does not provide any simulations or experiments to demonstrate how the algorithm performs in practice. This limits its immediate impact for practitioners. 2. Complexity of the Algorithm: MDBE’s design is intricate, involving nested partitions, probabilistic replays, and importance-weighted estimators. While theoretically elegant, the practicality and implementation complexity may hinder adoption. 3. Computational Concerns Unaddressed: The authors acknowledge that computational efficiency is not considered. Given the algorithm’s multi-scale structure, it is unclear whether MDBE is computationally feasible for large-scale or real-time applications.

问题

  1. In the related work part, when the authors are comparing [Suk and Kim 2025], it is mentioned that [Suk and Kim 2025] assumes no regularity across arms thus it is not applicable to this manuscript. This statement seems too vague for me. I suggest the authors make more efforts on elaborating the key difference between this paper and prior works.
  2. The notation may always be confusing. The variable x in this manuscript is denoted as the arm/action. However, in the literature of bandit, x is denoted as the context and a is denoted as the arm/action. I would suggest the authors make some efforts to change it, though it will not take any effect on my scoring.
  3. In definition 2, it is confusing whether [s1, s2] ⊆ [τi−1, τi ] depends with x or it exists for all x.
  4. In the same time, I would like to ask the authors to compare the difference in the definition of significant shit with [Suk and Kim 2025], which is not cited in Definition 2.
  5. In the meantime, I also find
  6. For the Section 2.2, though mathematically the comparison makes sense, I would also want to see a concrete example to show using LeT metric is statistically and significantly more efficient. This would strengthen the motivation for adopting LeT and also shows the bounds in Section 4 can be tighter than other metrics.
  7. In the proof of Proposition 3, the notation ft−1 should also indicate the condition xt ∈ B. Thus, ft−1,B is a better representation.

局限性

Adequate.

最终评判理由

After reading the discussion and the rebuttal, I decided to update my score.

格式问题

None

作者回复

We thank the reviewer for the feedback.

Numerical results (same answer as VifP, kQLa and GmBd)

We implemented our algorithm and evaluated its performance. Because of the NeurIPS policy, we cannot share code at this stage but will release it upon acceptance.

We benchmark against two baselines: BinningUCB(norestart)`BinningUCB (no restart)`, which uses K(T)T1/3K(T)\propto T^{1/3} and never resets, and BinningUCB(oracle)`BinningUCB (oracle)`, which knows the true change points τi\tau_i and resets its estimate at the end phase. At each phase, it uses the optimal per-phase discretization Ki(τi+1τi)1/3K_i \propto (\tau_{i+1}-\tau_i)^{1/3}.

We simulate a 11-Lipschitz, piecewise-linear reward on [0,1][0,1] with a peak shifting from x=0.3x=0.3 to x=0.7x=0.7 every 10510^5 rounds over horizon T=106T=10^6, inducing L~T=10\tilde{L}_T = 10 significant shifts. Regret is averaged over 100100 runs (rounded to nearest integer).

Step (×10³)BUCB(norestart)`BUCB (no restart)`BUCB(oracle)`BUCB (oracle)`MBDE`MBDE` (ours)
102137 ± 101551 ± 301566 ± 108
5010580 ± 3932794 ± 1478556 ± 349
10021007 ± 8905596 ± 39617111 ± 667

Our method adapts to non-stationarity via replays, significantly outperforming BUCB(norestart)`BUCB (no restart)`.

Computational complexity (same answer as VifP, kQLa and GmBd)

Our algorithm is feasible for small-scale problems, as shown in experiments, but we acknowledge that its time and memory complexity limits its scalability. We detail the worst-case computational cost below.

In a block of length τl,m+1τl,m=8m\tau_{l,m+1}-\tau_{l,m}=8^m, the algorithm maintains d=1m2d=O(2m)\sum_{d=1}^m 2^d = \mathcal{O}(2^m) bins. Estimating bin means costs O(2m)\mathcal{O}(2^m) per round. Additionally, it performs the statistical test ()\textcolor{red}{(\star)} over all bin pair of bins (B1,B2)(B_1, B_2) at each depth dd, with O(4d)\mathcal{O}(4^d) pairs, totaling O(4m)\mathcal{O}(4^m) across depths. For each pair, it checks all intervals of the form [s1,t][τl,m,t][s_1, t] \subseteq [\tau_{l,m}, t], leading to O(8m)\mathcal{O}(8^m) possibilities. Therefore, the total per-round worst-case time complexity is O(2m+4m8m)=O(32m)\mathcal{O}(2^m + 4^m \cdot 8^m) = \mathcal{O}(32^m), and the per-round worst-case memory complexity is O(8m)\mathcal{O}(8^m).

Since mlog(T)m \leq \log(T), we conclude that MBDE`MBDE` has an overall worst-case time complexity O(T6)\mathcal{O}(T^6) and memory O(T4)\mathcal{O}(T^4): both polynomial in TT. For small mm (e.g., m5m \leq 5), runtime is manageable for some problem instances. Designing adaptive, efficient algorithms with optimal regret remains an important open direction, even in KK-armed settings (Gerogiannis et al., 2024).

We will add a paragraph discussing the computational complexity in the main text.

Differences with Suk and Kim (2025)

Suk and Kim (2025) study the non-stationary version of infinite-armed bandits with reservoir distribution: at each round tt, the learner selects an action at[0,1]a_t \in [0,1] and observes a noisy reward Yt(at)Y_t(a_t) with mean μt(at)\mu_t(a_t), where μt(at)\mu_t(a_t) itself is drawn from a time-varying distribution νt(at)\nu_t(a_t). Crucially, this reservoir distribution νt(at)\nu_t(a_t) evolves over time depending on the learner’s past actions. Their primary assumption concerns the proportion of arms with low expected rewards, but they do not impose any smoothness or regularity conditions on the reward function μt()\mu_t(\cdot). In other words, similar actions do not necessarily yield similar expected rewards, so there is no structure to exploit.

This lack of smoothness means that exploration strategies cannot benefit from generalizing across nearby arms, and thus do not employ discretization of the action space. Instead, their algorithms rely on subsampling techniques tailored to handle the complexity of the reservoir distributions. Due to these differences, their setting and methods relate more closely to rotting bandits literature, which deals with non-stationary rewards that evolve over time in ways independent of spatial smoothness.

We will make it clear in our related work subsection.

Notations for actions

While we agree that in KK-armed contextual bandit settings, it is standard to denote the context by xx and the action by aa, this convention does not necessarily carry over to the Lipschitz bandits literature. In fact, denoting actions by xx is common in this litterature, e.g. in Bubeck et al. (2011) and Kleinberg et al. (2019).

Clarifications on the definition of significant shift

  • The interval [s1,s2[[s_1, s_2[ depends on xx: each arm may incur significant regret over a different interval. A significant shift is said to occur when all arms have experienced such significant regret over their respective intervals.
  • This notion of significant regret is indeed derived from previous work by Suk and coauthors: across all settings (including ours), it refers to an interval over which the regret of an arm exceeds the minimax rate for that setting. In our case, this threshold is O~(t1/3)\tilde{O}(t^{1/3}). Consequently, the definition of a significant shift remains consistent across all settings that adopt this significant shift framework.

Concrete benefit of the significant shift metric (same answer as VifP)

Consider a recommendation system with a continuous pool of content (e.g., movies), indexed by x[0,1]x \in [0,1], where nearby values of xx correspond to similar content types. Suppose there exists two regions of high and comparable user preference, centered around x1=0.3x_1 = 0.3 and x2=0.7x_2 = 0.7. Imagine a scenario where preferences near x1x_1 remain stable over time (e.g., a timeless classic), while preferences near x2x_2 undergo very frequent changes (e.g., a trending topic that evolves daily).

In this case, an algorithm that consistently recommends content near x1x_1 would incur little to no regret, even though the underlying reward function changes frequently in other regions. From a global perspective, the number of changes or the total variation in mean reward could be as large as LT=VT=O(T)L_T = V_T= \mathcal{O}(T). An algorithm that relies solely on such metrics would unnecessarily restart its estimates too frequently, as it would overestimate the effective difficulty of the problem. This leads to both theoretical suboptimality and practical inefficiency. However, under our framework, such changes do not constitute a significant shift. This example shows a key strength of our metric: it captures only those changes that are statistically consequential to learning, rather than indiscriminately counting all shifts in the environment.

We will include this example at the end of Section 2.2 to further clarify the motivation for our definition.

Notation ft1,Bf_{t-1, B}

True, thank you for pointing this!

References

  • Bubeck, Munos, Stoltz, and Szepesvári (2011). X-Armed Bandits. JMLR.
  • Kleinberg, Slivkins, and Upfal. Bandits and experts in metric spaces (2019). Journal of the ACM.
  • Suk and Kim. Tracking most significant shifts in infinite-armed bandits. ICML 2025.
评论

Thank you for the responses. I maintain my positive view of the paper.

审稿意见
5

This paper studies a fundamental variant of the multi-armed bandit problem in which the mean reward function is Lipschitz in space [0,1][0,1]) and non-stationary in time. The paper considers the notion of significant changes — a refined measure of temporal non-stationarity that generalizes earlier notions like total variation or number of abrupt changes. The authors develop an algorithm that adaptively discretizes the arm space. They achieve a minimax optimal dynamic regret of L~1/3T2/3\widetilde{L}^{1/3} T^{2/3} where L~\widetilde{L} is the number of significant shifts.

优缺点分析

Strengths

  • Fundamental problem: The setting of spatially Lipschitz and temporally non-stationary bandits is both natural and under-explored.
  • Novel technical contributions: A discretization strategy based on dyadic trees to efficiently monitor changes across scales; elimination and replay mechanisms tailored to the continuous setting.
  • Theoretical guarantees: The L~1/3T2/3\widetilde{L}^{1/3} T^{2/3} regret is minimax optimal (in T and L~\tilde L, at least)

Weakness: The results are stated only for Lipschitz constant L=1. The work would be more complete if they could incorporate L and compare the dependence with stationary Lipschitz bandits.

minor comments:

  • The phrase "locally stationary" phases (line 52) is unclear. Please provide a more precise definition or intuition.
  • The “three-way” trade-off mentioned under forced exploration (Section 3.2) appears to be more of a two-way trade-off: frequency of restarts and discretization granularity. Please clarify the third dimension.

问题

Do the result hold in higher dimensions?

局限性

see above

最终评判理由

I will maintain my initial evaluation.

格式问题

none

作者回复

We thank the reviewer for the feedback.

Extension to other Lipschitz constants (same answer as kQLa and GmBd)

Our algorithm and analysis can easily be adapted beyond 11-Lipschitz reward function, and still enjoy minimax optimal rate. For the sake of space, we give a sketch of proof of this generalization (similarly to Appendix G for the multi-dimensional setting).

Our method generalizes to (κ,β)(\kappa, \beta)-Hölder mean rewards (μt)t(\mu_t)_t satisfying:

x,x[0,1],μt(x)μt(x)κxxβ.\forall x,x'\in[0,1],\quad |\mu_t(x)-\mu_t(x')| \leq \kappa|x - x'|^\beta.

In the stationary setting with horizon τ\tau, the minimax regret is τ1+β1+2βκ11+2β\tau^{\frac{1+\beta}{1+2\beta}} \kappa^{\frac{1}{1+2\beta}}, achieved by discretizing into Kopt(τ,κ,β)=τ11+2βκ21+2βK_\text{opt}(\tau,\kappa,\beta) = \tau^{\frac{1}{1+2\beta}}\kappa^{\frac{2}{1+2\beta}} bins (Kleinberg, 2004). This motivates defining significant regret over [s1,s2][s_1,s_2] as:

t=s1s2δt(x)log(T)(s2s1)1+β1+2βκ11+2β.\sum_{t=s_1}^{s_2} \delta_t(x) \geq \log(T)(s_2 - s_1)^{\frac{1+\beta}{1+2\beta}} \kappa^{\frac{1}{1+2\beta}}.

We consider blocks of size τl,m+1τl,m=2m(1+2β)/κ2\tau_{l,m+1}-\tau_{l,m}=2^{m(1+2\beta)}/\kappa^2, yielding 2m2^m bins and expected regret of 2m(1+β)2^{m(1+\beta)} over this block if the rewards are stable enough. Replays at depth dd last (d)=2d(1+2β)/κ2\ell(d) = 2^{d(1+2\beta)}/\kappa^2 and use 2d2^d bins. The bias for estimating the cumulative reward gap is 4(s2s1)κ/2dβ4(s_2 - s_1)\kappa/2^{d\beta}. This motivates the new eviction rule:

maxBt=s1s2δ^t(B,B)>c0log(T)(s2s1)2d+4κ(s2s1)/2dβ,\max_{B'}\sum_{t=s_1}^{s_2} \hat\delta_t(B',B) > c_0\log(T)\sqrt{(s_2-s_1)2^d} + 4\kappa(s_2 - s_1)/2^{d\beta},

where the c020c_0\approx 20 is exactly the same as in the 1-Lipschitz setting.

Replay at depth dd starting at round ss is scheduled with probability:

ps,d=1κ2d(1+2β)/2sτl,m(d<m).p_{s,d} = \frac{1}{\kappa} \cdot \frac{2^{d(1+2\beta)/2}}{\sqrt{s - \tau_{l,m}}} \quad (d < m).

Each block includes 2(1+2β)(md)/22^{(1+2\beta)(m-d)/2} replays at depth dd, each incurring regret 2((1+2β)(md)+2d(1+β))/22^{((1+2\beta)(m-d) + 2d(1+\beta))/2}. Summing over all dephs d<md < m gives total replay regret O(2m(1+β))\mathcal{O}(2^{m(1+\beta)}), matching the optimal bound per block.

The expected replay interval at depth dd is (up to numerical constants that do not depend on κ\kappa and β\beta) 2(1+2β)(m+d)/22^{(1+2\beta)(m+d)/2}, while shifts of size 2dβ2^{-d\beta} become detectable within 2m(1+β)2^{m(1+\beta)} rounds. Each episode tl+1tlt_{l+1}-t_l has at most Ml=log2((tl+1tl)κ)/(1+2β)M_l = \log_2((t_{l+1}-t_l)\kappa)/(1+2\beta) blocks, and the regret per episode is upper bounded as

m=1Ml2m(1+β)(tl+1tl)1+β1+2βκ11+2β.\sum_{m=1}^{M_l} 2^{m(1+\beta)} \leq (t_{l+1}-t_l)^{\frac{1+\beta}{1+2\beta}} \kappa^{\frac{1}{1+2\beta}}.

Thus, the total dynamic regret of MBDE`MBDE` is upper bounded as:

E[RT]O~(T1+β1+2βL~Tβ1+2βκ11+2β).\mathbb{E}[R_T] \leq \tilde{\mathcal{O}}\left(T^{\frac{1+\beta}{1+2\beta}} \tilde{L}_T^{\frac{\beta}{1+2\beta}} \kappa^{\frac{1}{1+2\beta}} \right). O~() \tilde{\mathcal{O}}(\cdot) only hides poly-logarithmic factors and numerical constant that does not depend on κ\kappa or β\beta.

With the exact same arguments as in our Lower bound proof (Appendix F), we show that this rate is indeed minimax-optimal with respect to T,L~T,κT, \tilde{L}_T, \kappa and β\beta.

Locally stationary

By locally stationary phases, we mean time intervals for which the reward function does not change. We thank the reviewer for pointing this flaw and we will clarify it in the final version.

Three-way tradeoff (same answer as kQLa)

The three-way tradeoff refers to the balance our algorithm must deal between: (i) exploration, via replays to detect changes; (ii) exploitation, by choosing bins with the highest estimated rewards; and (iii) discretization level, that is, choosing the appropriate discretization depth at each time step. This third dimension arises specifically from our setting of Lipschitz mean reward functions that change over time. We will clarify this in Section 1.1.

Extension to multi-dimension

Yes, we showed how to adapt our algorithm and analysis in pp-dimensional action space in Appendix G.

References

  • Kleinberg. "Nearly tight bounds for the continuum-armed bandit problem." NIPS 2004.
评论

Thank you for the thoughtful response. I will maintain my score.

审稿意见
4

This paper considers a fundamental online learning problem of non-stationary Lipschitz bandits. In particular, the notion of non-stationarity is new, different from previously studied ones of LTL_T and VTV_T, both of which could be very large even if the best arm is unchanged. This new notion is called significant shifts introduced recently in other bandit problems, and the authors then propose a Multi-Depth Bin Elimination algorithm to achieve optimal regret for Lipschitz bandits under significant shifts.

优缺点分析

Strength

  • This paper tackles a theoretically fundamental problem in bandits with infinite continuum actions and new notion of non-stationarity.
  • The results are strong with optimal rates.

Weakness

  • The presentation of algorithm is not very clear: there are two algorithms for separate submodules but actually not an overall algorithm presented in block algorithm environment.
  • The explained intuition of proving Theorem also seems incomplete. It would be nice to have a more high-level proof framework connecting dots together before going into details. Some schematic illustration would also be very helpful. As in current presentation, it is hard to evaluate confidently the correctness of proof.
  • It appears that the main algorithm design is a combination of binning/elimination of seminal works and the replay mechanism of recent work Suk and Kpotufe [2022] (the same authors who also proposed significant shifts). As a result, it is not very evident how much of the theoretical contribution is conditioning on these two.

Minor

  • A few typos such as [ and ] in line 127, line 182.

问题

  • The difference of significant shifts is still quite abstract. Can the authors provide some more concrete examples to help understanding? And it is more ideal if such example has strong practical motivation/relevance.
  • Can the authors provide a more exact characterization of c0c_0 as mentioned in the footnote? How does it affect the algorithm performance?
  • Can the authors elaborate more on the unique challenges presented in the analysis of this problem, in particular, compared to previous work by Suk and coauthors?
  • Despite no numerical implementation, can the authors at least analyze the computation complexity? This would give a better sense of the feasibility of proposed approach.

局限性

  • No numerical experiments as acknowledged by the authors that this is pure theoretical work.

最终评判理由

I thank the authors’ response in their technical novelty and soundness and maintain my positive recommendation.

格式问题

N/A

作者回复

We thank the reviewer for the comment.

Presentation

We appreciate the feedback regarding the presentation and will improve it by including an illustrative figure of the sampling scheme (Algorithm 2) in the main text, using the space provided in the additional page if our paper is accepted.

Concrete benefits of the significant shift metric (same answer as 9z8n)

Consider a recommendation system with a continuous pool of content (e.g., movies), indexed by x[0,1]x \in [0,1], where nearby values of xx correspond to similar content types. Suppose there exists two regions of high and comparable user preference, centered around x1=0.3x_1 = 0.3 and x2=0.7x_2 = 0.7. Imagine a scenario where preferences near x1x_1 remain stable over time (e.g., a timeless classic), while preferences near x2x_2 undergo very frequent changes (e.g., a trending topic that evolves daily).

In this case, an algorithm that consistently recommends content near x1x_1 would incur little to no regret, even though the underlying reward function changes frequently in other regions. From a global perspective, the number of changes or the total variation in mean reward could be as large as LT=VT=O(T)L_T = V_T= \mathcal{O}(T). An algorithm that relies solely on such metrics would unnecessarily restart its estimates too frequently, as it would overestimate the effective difficulty of the problem. This leads to both theoretical suboptimality and practical inefficiency. However, under our framework, such changes do not constitute a significant shift. This example shows a key strength of our metric: it captures only those changes that are statistically consequential to learning, rather than indiscriminately counting all shifts in the environment.

We will include this example at the end of Section 2.2 to further clarify the motivation for our definition.

Constant c0c_0

The constant c0c_0 is given in our theoretical analysis (see line 568, in the supplementary material): c0=3+7(e1)220.c_0 = 3 + 7(e - 1)\sqrt{2} \approx 20. This constant is derived from the confidence bounds of Proposition 2, and ensures that the test procedure ()\textcolor{red}{(\star)} used to trigger bin evictions is theoretically valid. However, this theoretically justified choice leads to conservative behavior in practice: bins are evicted only when the statistical evidence is strong. To balance theory with empirical performance, we use a smaller constant in our numerical experiments (see the numerical results paragraph in this rebuttal, where we specifically choose c0=1c_0 = 1), which leads to faster elimination and better regret in practice.

Unique challenges in our analysis (same answer as kQLa)

The main challenge in our setting lies in estimating a continuous, non-stationary mean reward function without prior knowledge of the magnitude of distributional shifts—or, equivalently, the discretization level and time scale at which such shifts become detectable. Coarse discretizations are essential to rapidly detect large shifts, while finer discretizations are required to capture smaller yet statistically significant changes. This motivates a multi-scale replay framework in which each scale contributes adaptively based on the magnitude of the underlying shift.

To address this challenge, we propose a hierarchical replay mechanism spanning multiple scales, where each scale corresponds to a different discretization of the action space [0,1][0,1]. In contrast, prior works that track significant shifts operate in the KK-armed setting, where the action space remains fixed and finite even when contexts are continuous. Their replay mechanisms and regret analyses therefore rely on a fixed set of arms, without the need to estimate or track reward changes across multiple scales.

Our algorithm must face three fundamental challenges: (i) scheduling (potentially simultaneous) replays at different depths, (ii) leveraging the information collected across different scales, and (iii) tracking the regret contribution from all scales simultaneously. In our regret analysis, each action xt[0,1]x_t \in [0,1] belongs to several bins across discretization levels, and we explicitly quantify the regret incurred at each scale. Moreover, to decide which bin to activate at a given depth and time step, we introduce a principled strategy for (a) scheduling replays at specific scales and (b) selecting the appropriate bin at a given active depth through a carefully designed sampling algorithm.

We will incorporate this discussion at the beginning of the proof sketch in Section 5 to clarify the novelty and technical depth of our theoretical contribution.

Numerical results (same answer as kQLa, 9z8n and GmBd)

We implemented our algorithm and evaluated its performance. Because of the NeurIPS policy, we cannot share code at this stage but will release it upon acceptance.

We benchmark against two baselines: BinningUCB(norestart)`BinningUCB (no restart)`, which uses K(T)T1/3K(T)\propto T^{1/3} and never resets, and BinningUCB(oracle)`BinningUCB (oracle)`, which knows the true change points τi\tau_i and resets its estimate at the end phase. At each phase, it uses the optimal per-phase discretization Ki(τi+1τi)1/3K_i \propto (\tau_{i+1}-\tau_i)^{1/3}.

We simulate a 11-Lipschitz, piecewise-linear reward on [0,1][0,1] with a peak shifting from x=0.3x=0.3 to x=0.7x=0.7 every 10510^5 rounds over horizon T=106T=10^6, inducing L~T=10\tilde{L}_T = 10 significant shifts. Regret is averaged over 100100 runs (rounded to nearest integer).

Step (×10³)BUCB(norestart)`BUCB (no restart)`BUCB(oracle)`BUCB (oracle)`MBDE`MBDE` (ours)
102137 ± 101551 ± 301566 ± 108
5010580 ± 3932794 ± 1478556 ± 349
10021007 ± 8905596 ± 39617111 ± 667

Our method adapts to non-stationarity via replays, significantly outperforming BUCB(norestart)`BUCB (no restart)`.

Computational complexity (same answer as kQLa, 9z8n and GmBd)

Our algorithm is feasible for small-scale problems, as shown in experiments, but we acknowledge that its time and memory complexity limits its scalability. We detail the worst-case computational cost below.

In a block of length τl,m+1τl,m=8m\tau_{l,m+1}-\tau_{l,m}=8^m, the algorithm maintains d=1m2d=O(2m)\sum_{d=1}^m 2^d = \mathcal{O}(2^m) bins. Estimating bin means costs O(2m)\mathcal{O}(2^m) per round. Additionally, it performs the statistical test ()\textcolor{red}{(\star)} over all bin pair of bins (B1,B2)(B_1, B_2) at each depth dd, with O(4d)\mathcal{O}(4^d) pairs, totaling O(4m)\mathcal{O}(4^m) across depths. For each pair, it checks all intervals of the form [s1,t][τl,m,t][s_1, t] \subseteq [\tau_{l,m}, t], leading to O(8m)\mathcal{O}(8^m) possibilities. Therefore, the total per-round worst-case time complexity is O(2m+4m8m)=O(32m)\mathcal{O}(2^m + 4^m \cdot 8^m) = \mathcal{O}(32^m), and the per-round worst-case memory complexity is O(8m)\mathcal{O}(8^m).

Since mlog(T)m \leq \log(T), we conclude that MBDE`MBDE` has an overall worst-case time complexity O(T6)\mathcal{O}(T^6) and memory O(T4)\mathcal{O}(T^4): both polynomial in TT. For small mm (e.g., m5m \leq 5), runtime is manageable for some problem instances. Designing adaptive, efficient algorithms with optimal regret remains an important open direction, even in KK-armed settings (Gerogiannis et al., 2024).

We will add a paragraph discussing the computational complexity in the main text.

References

  • Wei and Luo. "Non-stationary reinforcement learning without prior knowledge: An optimal black-box approach." COLT 2021.
  • Suk and Kpotufe. "Tracking most significant shifts in nonparametric contextual bandits." Neurips 2023.
  • Gerogiannis, Huang and Veeravalli. Is Prior-Free Black-Box Non-Stationary Reinforcement Learning Feasible? AISTATS 2025.
评论

Thank you for the thoughtful response. The response helped to address the concerns that I have and I recommend the authors incorporating some of them in the revision. I maintain my positive score.

最终决定

I concur with the reviewers’ positive view of the submission and think the paper makes a valuable addition. The work combines the well-studied problems of continuous-armed bandits and of non-stationary (finite-action) multi-armed bandits in a nice way to form a new problem and proposes an algorithm for the problem that achieves minimax optimal regret without knowledge of the non-stationarity. The technical contributions are solid and well situated in the literature. Some clarity and presentation issues were raised, as well as the lack of broader empirical validation, but the authors have addressed these in the rebuttal with clarifications and a small experiment. I trust these will be incorporated in the camera ready.