PaperHub
5.5
/10
Poster4 位审稿人
最低2最高4标准差1.0
2
2
4
4
ICML 2025

Optimal and Practical Batched Linear Bandit Algorithm

OpenReviewPDF
提交: 2025-01-24更新: 2025-08-11

摘要

We study the linear bandit problem under limited adaptivity, known as the batched linear bandit. While existing approaches can achieve near-optimal regret in theory, they are often computationally prohibitive or underperform in practice. We propose BLAE, a novel batched algorithm that integrates arm elimination with regularized G-optimal design, achieving the minimax optimal regret (up to logarithmic factors in $T$) in both large-$K$ and small-$K$ regimes for the first time, while using only $O(\log\log T)$ batches. Our analysis introduces new techniques for batch-wise optimal design and refined concentration bounds. Crucially, BLAE demonstrates low computational overhead and strong empirical performance, outperforming state-of-the-art methods in extensive numerical evaluations. Thus, BLAE is the first algorithm to combine provable minimax-optimality in all regimes and practical superiority in batched linear bandits.
关键词
linear banditbatched banditexploration-exploitation

评审与讨论

审稿意见
2

The paper introduces BLAE, a new algorithm for batched linear bandit problems. The central challenge in batched learning is balancing exploration and exploitation within discrete batches, as frequent updates may not be feasible in real-world applications. BLAE combines arm elimination and regularized G-optimal design to achieve near-optimal regret bounds while maintaining practical computational complexity.

给作者的问题

  1. The paper highlights that traditional regularized least squares methods rely on all past time steps for parameter updates, but early noise can affect future arm selection. To mitigate this, BLAE uses a batch-wise approach with regularization to ensure parameter updates within each batch are independent. My concern is the feasibility of this approach—relying on a single batch may discard valuable information from earlier batches. Is there theoretical or experimental support for this batch-wise update method, and how does the algorithm balance using only one batch of data while potentially losing useful information from earlier steps?

  2. Does BLAE sometimes eliminate the optimal arm too early? Reporting how often the best arm is mistakenly eliminated would provide insights into the method’s robustness.

  3. How does BLAE scale computationally for large-K settings?

论据与证据

Yes, most claims made in the submission are supported. However,

  1. The authors argue that prior methods require excessive computation, but no formal computational complexity analysis is presented.

  2. Theoretical bounds support the claim, but there is no empirical comparison between "batch-wise estimation" vs. "using all past data".

方法与评估标准

The proposed methods and evaluation criteria (e.g., regret, batch complexity) are well-chosen for the problem of batched linear bandit learning.

However,

  1. The paper does not discuss how regularization parameters or elimination thresholds are selected.
  2. No comparison with fully adaptive (non-batched) methods like LinUCB. While non-batched methods are not direct competitors, it would be useful to show how much performance is sacrificed by batching.
  3. No real-world dataset tested.

理论论述

The theoretical claims in the paper are largely well-supported by the proofs provided.

  1. The proof assumes batch-wise least squares estimation is independent across batches due to regularization. However, this assumption lacks explicit justification, and a more rigorous bound on the impact of past information loss on regret would be helpful.
  2. The batch complexity proof seems correct under typical conditions, but it would benefit from a more explicit discussion of elimination efficiency in large-K cases.

实验设计与分析

  1. The paper does not provide extensive evidence of BLAE's performance on real-world datasets with more complex dynamics and noise, particularly when the author mentions that other batch linear Bandit algorithms perform poorly in practical applications, especially in large-scale scenarios.
  2. No comparison with fully adaptive methods (e.g., LinUCB).
  3. Batch complexity is explicitly measured; but no runtime/memory usage comparisons.
  4. Needs ablation study on batch size, regularization, and elimination strategy

补充材料

The supplementary material consists of codes, but I did not thoroughly check it.

与现有文献的关系

The problem of linear bandits has been widely studied in the literature. The BLAE algorithm builds on this foundational work by considering the batched learning scenario, where updates are made periodically (rather than after each round). This shift to a batched setting allows for more efficient exploration and the possibility of reducing computational overhead while still maintaining a low regret bound.

遗漏的重要参考文献

N/A

其他优缺点

Other Strengths:

  1. The comparison with E4 and RS-OFUL shows that BLAE consistently achieves lower regret and fewer updates per batch.

  2. Develop batch-wise optimal design strategies that efficiently allocate exploration efforts and can be generalized to both small-K and large-K settings.

  3. Introduces a novel concentration bound accommodating both batching and regularization.

Other Weaknesses:

  1. I believe this paper makes an incremental contribution. The combination of G-optimal design and arm elimination is a classic approach, and even the use of regularized G-optimal design is not particularly novel. Achieving O(loglogT)O(loglogT) batch complexity in batched learning is also quite a standard result. It would be helpful if the authors could highlight the difficulties in the proof, particularly the challenges associated with unifying the large-K and small-K cases.
  2. The computational cost of solving G-optimal design in batched settings is not analyzed.

其他意见或建议

  1. In small-K cases, eliminating suboptimal arms is straightforward. However, in large-K cases, elimination could lead to misidentification of the best arm if not enough data is collected. The paper could further clarify why earlier methods, e.g., (Ren et al., 2024), failed to achieve this bound in the large-K setting.
  2. A brief discussion of why batch-wise estimation does not significantly degrade performance (or an empirical comparison with full-history least squares) would be useful.
  3. While the theoretical proof is rigorous, an intuitive explanation of why batch complexity reduces from O(logT)O(\log T) to O(loglogT)O(\log \log T) would help readers unfamiliar with batched bandit theory.
作者回复

Thank you for your time. However, with all due respect, we are very concerned by the current assessment. We sincerely hope to establish some common grounds with our response.


On Computational Cost

To our best knowledge, all prior works on batched linear bandits have not provided theoretical complexity analyses, as exact theoretical complexity analyses for comparisons are challenging due to various optimization subroutines in each algorithm.

For fair comparisons of computational cost, we provided empirical runtime experiments in Appendix E and conducted additional extensive evaluations (Table 2-4 in https://tinyurl.com/3m6f4v5n). Our experiments show that our method significantly improves computational efficiency compared to existing optimal algorithms.


Batch-wise vs. Fully Sequential

Obviously, even optimal batched algorithms will underperform fully sequential ones due to limited information in batched settings. Thus, direct comparison is inherently unfair, as the problem settings are fundamentally different.

Nevertheless, per your request, we conducted additional experiments comparing batched algorithms with the fully sequential LinUCB. Results (Figure 6 in https://tinyurl.com/3m6f4v5n) clearly show that the regret difference between BLAE and LinUCB is the smallest among existing batched algorithms, with competing batched methods exhibiting significantly larger performance gaps.


Regularization Parameters and Elimination Thresholds

The reviewer's comment is incorrect. The regularization parameter λ\lambda_\ell is explicitly stated in Theorem 1 as O(1)O(1) (set to λ=1\lambda_\ell = 1 in experiments). The elimination threshold ε\varepsilon_\ell is explicitly stated in Theorem 1 and precisely defined in Lemma 2.


Real-world Datasets?

Regret evaluation in bandit literature typically relies on simulations since offline datasets may not contain feedback for all counterfactual actions that an algorithm can take. Thus, absence of real-world datasets is typically not considered a weakness, particularly for a theoretical work.

Nevertheless, we conducted extra experiments based on real-world dataset MovieLens. The results (Figure 7 in https://tinyurl.com/3m6f4v5n) confirm that BLAE consistently outperforms existing methods, validating its practical effectiveness.


Assume Batch-wise Independence?

No, our proof does not assume "batch-wise least squares estimation independence across batches," irrespective of regularization. Independence is achieved (rather than assumed) solely through our batch-wise design.

There appears to be a clear misunderstanding. This is serious because it may present challenges to the reviewer in evaluating our main results. We sincerely would like to ensure we are on the same page.


Elimination efficiency? We are unaware of any standard notion of "elimination efficiency." Could the reviewer rigorously define what is meant by this term?


Ablation? Altering batch size is unnecessary, as it is proven optimal batch complexity. Additional evaluations on varying regularization λ\lambda_\ell are provided. (Figure 8 in https://tinyurl.com/3m6f4v5n).


Novelty

We strongly rebut "incremental contribution" comment. The use of G-optimal design or arm elimination in itself should not diminish novelty. By that standard, many impactful papers using optimal design would have been rejected.

To highlight a few of our contributions briefly: our refined analysis (Lemma 3) significantly improves exploration strategies beyond standard G-optimal design. We derived a new bound for this. Existing concentration inequalities in arm elimination inherently depend on KK, making them suboptimal in large-KK scenarios. In contrast, our novel analytical framework utilizes efficient coverings of the unit sphere, providing sharper, more general regret bounds. Crucially, our proposed elimination strategy attains a unified optimal regret bound of O(dTdTlogK)O(d\sqrt{T} \wedge \sqrt{dT\log K}) using only O(loglogT)O(\log\log T) batches. Please refer to our paper for more contributions.


G-optimal Design

A key advantage of our algorithm is that we compute the G-optimal design only O(loglogT)O(\log\log T) times across TT rounds. Thus, given that computing optimal design requires at most O(Kd3)O(Kd^3), the average computational complexity due to G-optimal design in our algorithm is merely O(Kd3loglogTT)O\left(\frac{K d^3 \log\log T}{T}\right), which tends to zero as TT increases. This highlights a crucial benefit of our batched update.


Suboptimal in Large-KK Cases There might be a misunderstanding. Existing methods being suboptimal in large KK case is not about "misidentification of the best arm" in large KK. Rather, these methods always yield KK-dependent regret, even for large KK, but ours don't (see Table 1)


We hope our responses clarified the key points and respectfully ask that the evaluation reflect our contributions.

审稿意见
2

This paper studies the batched linear bandit problem. The authors propose a new algorithm for this problem and show that this achieves near optimal regret. Compared to other algorithms which also achieve near optimal regret in this problem, the authors claim that their proposed algorithm is more computationally efficient and performs better in practice. To illustrate this, experimental results in simulated environments are provided.

给作者的问题

  1. Fairness of experiments in terms of choice of hyperparameters of prior work
  2. Choice of hyperparameters – are the same ones used in theory and practice?
  3. Importance of contribution
    

论据与证据

  • From section 3, it is not clear if the batches are given to us or if they are something the learner can pick.
  • The choice of epsilon seems very important but is only given in the Theorem 1 in terms of its order. For experiments, it seems a different value of epsilon is chosen, and it is not clear if this choice satisfies the conditions of the theorem. Therefore, it is not clear if the exact same approach is optimal in theory and practice.

方法与评估标准

  • It would be good to see results for larger d – it seems that maximum d=10 is actually quite small.
  • It would also have been good to show a run time comparison of the main algorithms used in the experiments.
  • The run time experiments in Table 2 of the appendix are interesting, but it would good to make it obvious from the main paper.

理论论述

The proof of Lemma 1 seems fairly standard (unless I am missing something). Discussion & references of similar proofs would be appreciated.

There were several steps in the proofs I did not understand:

  • Not clear where number of pairs comes from in715.
  • In 891, why can we pick pi this way?
  • Where does the assumption T>Omega(K^5) come from? (1071)
  • Why is there a second bound in 1099? – I think this is to get the different results, but the structure of the proof could make it clearer. I also did not understand lines 912,933-944,1050, 1077. More justification should be provided. I think probably the proofs are correct (and I do not necessarily need detailed explanations of all these steps in the rebuttal, I just thought it might be helpful to point out the parts which were difficult to follow).

I don’t think the proofs of Lemmas 4,5,11,13 are necessary to include as they are standard results that have been proven before in the literature. References to existing proofs are fine to include instead.

实验设计与分析

  • It would be good to see results for different choices of epsilon. In particular, how does choosing epsilon according to Theorem 1 alter the results?
  • It seems there are hyperparameters of other algorithms as well, but it is not clear how these have been chosen. Was the same procedure used for BLAE as well as the other algorithms? It seems that for E^4 the parameters were configured according to theory, which does not seem to be a fair comparison.

补充材料

I looked over most of it.

与现有文献的关系

There is a good discussion of related work, although the analysis should be related to similar analyses in related work.

遗漏的重要参考文献

na

其他优缺点

  • One concern is the size of the contribution. I think the introduction does not do a sufficient job explaining why the work in this paper is solving an important question.
  • Another concern is the difference in hyperparameters for theory & practice. The authors claim that their algorithm is effective for both, but seemingly requires different tuning for each (unless I have missed something).

其他意见或建议

It would be helpful in the table to explain whether each algorithm works in the ‘large’ or ‘small’ K regime in Table 1. Since one contribution of this paper seems to be an algorithm which works in both regimes, it would be good to easily be able to understand what regiemes existing works cover.

作者回复

We appreciate your time and review. However, with all due respect, there appear to be some misunderstandings, which we sincerely hope to clarify in our response.


Batched Learning Setting

In the batched linear bandit setting, the learner chooses both the number of batches and when to update parameters. This is the standard setup in the literature (Esfandiari et al. (2021), Ruan et al. (2021), Hanna et al. (2023), Ren et al. (2024), Zhang et al. (2025)). This is the key aspect of the batched linear bandit problem because one wishes to design an algorithm that can update parameters as infrequently as possible while still achieving provably optimal regret.


Consistency of Parameter Choice in Theory and Experiments

We confirm that the choice of ε\varepsilon_\ell is consistent across theory and experiments. Its order is defined in Theorem 1 and rigorously maintained throughout the paper, including experiments. The precise definition of ε\varepsilon_\ell is given in Lemma 2, which depends on parameter ε\varepsilon (distinct from ε\varepsilon_\ell). ε\varepsilon is an arbitrary constant in (0,1) (chosen as 0.5 in all experiments which is theoretically valid). Our algorithm BLAE employs identical parameter choices in both theory and practice, ensuring optimality in both domains. There is no "difference in hyperparameters for theory & practice."


Experimental Results for Larger Dimension dd

We conducted additional experiments with larger dd values. Results appear in Figure 4 of Sec. 5 in https://tinyurl.com/3m6f4v5n. These demonstrate that regret difference increases with dd.


Runtime Comparison of the Main Algorithms

We conducted additional runtime experiments with larger arm counts KK. Results in Table 2-4 of Sec. 3 in https://tinyurl.com/3m6f4v5n demonstrate that BLAE is the fastest among optimal algorithms, with runtime comparable to suboptimal algorithms even in very large-KK settings.


Lemma 1: Our Lemma 1 adaptation explicitly incorporates the regularization term to achieve our specific bound, which is crucial to our analysis. While sharing structural similarities with standard results, its subtle differences merit attention. We will include the relevant previous result related to Lemma 1 in the revision (e.g., Chapter 20 in Lattimore & Szepesvári (2020)).


Effect of Different Choices of Epsilon on Results

As mentioned, the elimination threshold ε\varepsilon_\ell is defined based on parameter ε\varepsilon (distinct from ε\varepsilon_\ell). In all experiments, we followed the theoretical guideline for ε\varepsilon_\ell and ε\varepsilon.

Since theory allows ε(0,1)\varepsilon \in (0,1) to be chosen arbitrarily, we tested sensitivity to different ε\varepsilon values. Results in Figure 5 in https://tinyurl.com/3m6f4v5n show BLAE remains robust across various ε\varepsilon values. Note again that in all of our experiments the parameter choices satisfy the conditions of the theoretical results.


Clarification on Hyperparameter Selection for Other Algorithms

All experiments were conducted fairly. For E4E^4 (Ren et al., 2024), we used their exact public implementation. All baseline algorithms used theoretically suggested hyperparameter values from previous literature, while BLAE’s hyperparameters followed our theoretical results, ensuring fair and transparent comparisons.


Significance of Contributions

To understand our contributions, it's essential to recognize the limitations in existing batched linear bandit literature:

  • Ren et al. (2024): Optimal regret only for small-KK regime, exhibit poor practical performance (see Figure 1 of our paper).
  • Abbasi-Yadkori et al. (2011) and Esfandiari et al. (2021): Higher batch complexity, making them theoretically suboptimal.
  • Ruan et al. (2021), Hanna et al. (2023), and Zhang et al. (2025): Impractical due to excessive runtimes (see Table 2 of our paper).

The key motivation for batched linear bandits is computational efficiency without sacrificing regret or batch complexity. In contrast to previous methods, BLAE is the first algorithm that simultaneously achieves provable optimality in both small-KK and large-KK regimes, while maintaining practical efficiency in both regret and batch complexity.
Therefore, our contribution addresses a critical gap in existing literature, aligning with the fundamental motivation of batched linear bandits.


Clarifying small-KK and large-KK

It is NOT a matter of whether existing algorithms work or not. The existing algorithms "work" in both small-KK and large-KK regimes. But, their optimality was proven only in one of the regimes—never in both. In contrast, BLAE uniquely achieves optimality simultaneously in both regimes.


We hope our responses clarified the key points and respectfully ask that the evaluation reflect our contributions.

审稿意见
4

In this paper, the authors study the linear bandit problem under limited adaptivity, known as the batched linear bandit. They propose a novel batched algorithm that integrates arm elimination with regularized G-optimal design, achieving the minimax optimal regret in both large-K and small-K regimes for the first time, while using only loglogT batches. Finally, the algorithm demonstrates low computational overhead and strong empirical performance, outperforming state-of-the-art methods in extensive numerical evaluations.

给作者的问题

NA

论据与证据

Yes.

方法与评估标准

Yes.

理论论述

The proof sketch part is convincing, the proof looks correct to me.

实验设计与分析

The experimental result is good in general, while I have several concerns.

  1. All the choices of KK are not very large. I am wondering what will be the result (especially the runtime) if KK is chosen to be very large or infinity (continuous action space).

  2. I am not very convinced of the experimental result of E4E^4. In Ren et al, a sub-linear regret upper bound is derived. However, in the experiment in this submission, it seems to have linear regret. Is there any contradiction? Will tuning the failure probability improve the result?

补充材料

No.

与现有文献的关系

This paper studies the batched linear bandits. The algorithm in this paper is the first to match the lower bounds for both small K and large K at the same time.

遗漏的重要参考文献

NA.

其他优缺点

Strengths:

  1. Matching the two lower bounds at the same time is a very nice result.

  2. The algorithm is simple and computationally efficient in experiments.

  3. The presentation is clear in general.

Weakness:

  1. Is there any theoretical analysis on the computational complexity? How does it scale if KK is very large or even infinity?

  2. Please refer to the points mentioned in the experimental result part.

其他意见或建议

NA

作者回复

We sincerely thank you for reviewing our paper and providing valuable feedback. We appreciate your recognition of our work and your constructive comments. Below, we address each of your comments and questions in detail:


Performance in the Large KK Regime

We thank the reviewer for raising this important point. We conducted additional experiments with much larger KK values, with detailed results available in Table 2-4, and Figure 2 of Section 3 in link.

As shown in Table 2-4, BLAE is the fastest among optimal algorithms, with runtime comparable to that of suboptimal algorithms. Moreover, Figure 2 demonstrates that BLAE consistently achieves significantly lower regret and maintains stable performance even as KK becomes very large, confirming the scalability of our approach in large, finite action spaces.

Regarding infinite (continuous) action spaces, we clarify that our theoretical guarantees (and many others) and algorithmic design assume a finite action set. Thus, extending the current method to a continuous action domain is outside the scope of this paper.

To our knowledge, prior studies have not provided experimental evaluations for very large values of KK (e.g., K=1000,2000,5000K = 1000, 2000, 5000). Thus, we strongly believe that our extensive experiments sufficiently demonstrate that our algorithm offers superior performance and robustness, positioning it as the most effective solution for large-KK regimes. We would be more than happy to perform further experiments if needed.


Experimental Performance of E4E^4

We thank the reviewer for noting the discrepancy in E4E^4's results. While Ren et al. (2024) theoretically established a sub-linear regret bound, our experiments exhibited (near-)linear regret behavior.
We emphasize that we used the exact implementation officially provided by Ren et al. (2024).
We identified two primary reasons for this empirical performance discrepancy:

First, E4E^4 occasionally eliminates the optimal arm prematurely. Table 1 of Section 2 in link shows how frequently false elimination of the optimal arm occurs over 20 independent experimental runs.

Second, the batch size in E4E^4 is highly sensitive to hyperparameter γ\gamma. In particular, the third batch size scales as O((logT)1+γ)O((\log T)^{1+\gamma}). When choosing a relatively large value of γ\gamma (e.g., γ=10\gamma = 10, as in the Ren et al. (2024) setup), the third batch size often exhausts the entire horizon, leaving practically no time to effectively exploit information learned in the third batch.
Consequently, the algorithm relies only on information from the first two, relatively short batches. If suboptimal arm identification occurs during these early batches (as frequently observed), this leads directly to linear regret. In contrast, BLAE's \ell-th batch size has order O(T212)O\left(T^{\frac{2^{\ell}-1}{2^{\ell}}}\right), avoiding this issue.

We conducted additional experiments by carefully tuning the hyperparameter γ\gamma for E4E^4 from 10 down to 0.99, to their advantage. The results of these additional experiments, presented in Figure 3 of Section 4 in link, confirm that careful tuning of γ\gamma leads to sub-linear regret behavior. However, we also observe that reducing γ\gamma for the sake of regret performance significantly increases the number of batches.
Hence, there is a clear tradeoff with regard to the control of γ\gamma in E4E^4.

Moreover, even after the tuning of γ\gamma in E4E^4 for the sake of regret performance, the performance of E4E^4 still results in noticeably worse regret performance compared to BLAE (Figure 3 in link). Thus, the tuning leads to increased batch complexity although the regret performance increases. Despite careful tuning for E4E^4, overall numerical performances in both regret and batch complexity of E4E^4 still show clearly inferior practical performance compared to BLAE.


Computational Complexity Analysis and Scalability

We thank the reviewer for this insightful question. To the best of our knowledge, all prior works on batched linear bandits have not provided theoretical complexity analyses, as exact theoretical complexity analyses for comparisons are challenging due to various optimization subroutines in each algorithm. Therefore, any derived theoretical complexity bounds would likely be overly crude and may not be suitable for comparisons for their true computational performance in practice.

To evaluate computational cost, we provided empirical runtime experiments in Appendix E and conducted additional extensive evaluations (see Table 2-4 of Section 3 in link).

Our empirical results address scalability when KK is very large. Even at large KK, our experiments show that our method significantly improves computational efficiency compared to existing optimal algorithms.

审稿意见
4

The paper tackles the well-studied batched linear bandits problem and provided tight regret bounds in the small and large arms (K) closing the gap in the upper and lower bounds. Experiments show that they perform well on simulated datasets over the existing algorithms.

Overall: The paper is clearly laid out with both the theoretical contributions and the experiments showcasing the benefits. In particular, care has been taken to highlight the issues with existing approaches and how the current approach solves them.

给作者的问题

(1) Are there experiments showing that Ren et al is competitive in the small K setting with the BLAE algorithm? The explanation that the best arms are getting eliminated (Ren et al) could be further explored to show that this happens increases as a function of K? (2) What are the other settings where the refined analysis could be of interest as alluded to in the paper?

论据与证据

Yes

方法与评估标准

Yes

理论论述

Yes, lemma 1, 2, 3 and Theorem 1

实验设计与分析

Yes, section 7.1

补充材料

I skimmed over the material

与现有文献的关系

Summarized in Table 1, pretty comprehensive.

遗漏的重要参考文献

Not that I am aware of

其他优缺点

Pros:

(i) The provided regrets are tight and match the established lower bounds for the problem. Both the communication rounds O(loglog T) and the regret bounds of O(d\sqrt(T), \sqrt(dTlogK)) are optimal. (ii) The main issue is the refined technical analysis of the batched least squares regression problem and the lack of independence across batches. (iii) The experiments show that the regret is much better for the BLAE algorithm not only in terms of the mean but also variance in sharp contrast with E4. This is further explained due to the elimination of the optimal arm.

Cons:

(a) It would have been interesting to show the results in the small K regime and that it is only in the large K regime, it is suboptimal.

其他意见或建议

NA

作者回复

We sincerely thank you for taking the time to review our paper and for providing thoughtful and valuable feedback. We greatly appreciate your recognition of our work and your constructive comments. Below, we address each of your comments and questions in detail:


Experimental Results in the Small-KK Regime

We thank the reviewer for this suggestion. We conducted experiments in the small-KK regime with K=2,4,8,16K = 2, 4, 8, 16, and the detailed results are in Figure 1 of Section 1 in this link.

As shown in Figure 1, BLAE consistently outperforms existing methods in the small‑KK setting. While E4E^4 (Ren et al., 2024) performs competitively for K=2K=2, this trivial case is straightforward for any algorithm. Therefore, apart from this trivial case, our results confirm that BLAE remains superior across all small-KK regimes considered.


False Elimination of Optimal Arms in E4E^4 (Ren et al., 2024) as a Function of KK

We appreciate the reviewer’s insightful question. To investigate whether the frequency of mistakenly eliminating the optimal arm in E4E^4 (Ren et al., 2024) increases with KK, we conducted experiments across a range of KK values—from small to large regimes. Each experiment was repeated 20 times, and we tracked the rate at which the optimal arm was falsely eliminated. Detailed results can be found in Table 1 of Section 2 in this link.

The data in Table 1 clearly indicate an increasing trend in the elimination rate as KK grows. This trend helps explain why E4E^4 (Ren et al., 2024) performs worse than other batched linear bandit algorithms, particularly in large-KK regimes.


Potential Applications of Refined Analysis in Other Settings

We appreciate the reviewer raising this valuable point. Our refined analysis indeed applies naturally to broader settings, including the pure exploration scenario. In pure exploration, the primary goal is to accurately identify optimal arms within a given budget, without explicitly focusing on cumulative regret. Our approach, which combines a regularized G-optimal design with an arm elimination strategy, efficiently discards suboptimal arms with high probability, significantly improving the accuracy and efficiency of exploration.

Furthermore, since best arm identification is a widely-studied and applied setting—where exactly one optimal arm must be correctly identified—our refined analytical methods directly apply here as well. Specifically, our algorithm’s ability to consistently maintain the optimal arm with high probability demonstrates its practical utility for reliable best-arm selection.

Thus, our theoretical refinements meaningfully contribute both to general pure exploration and specifically to the best arm identification problem.

审稿人评论

Thanks to the authors for the detailed rebuttal comments to mine and the other reviewers questions. I will keep my score for now based on all the responses.

最终决定

This paper studies the batched linear bandits problem, and proposes the BLAE algorithm to solve it. The key design of BLAE is arm elimination with regularized G-optimal design and the algorithm achieves the minimax optimal regrets in both large-k and small-k settings. Although some reviewers still have concerns in experiments, I find this submission technically solid and interesting. Therefore, I recommend weak acceptance. Finally, authors are encouraged to write professionally and tune down the tone in communication with reviewers in the future.