Optimal and Practical Batched Linear Bandit Algorithm
摘要
评审与讨论
The paper introduces BLAE, a new algorithm for batched linear bandit problems. The central challenge in batched learning is balancing exploration and exploitation within discrete batches, as frequent updates may not be feasible in real-world applications. BLAE combines arm elimination and regularized G-optimal design to achieve near-optimal regret bounds while maintaining practical computational complexity.
给作者的问题
-
The paper highlights that traditional regularized least squares methods rely on all past time steps for parameter updates, but early noise can affect future arm selection. To mitigate this, BLAE uses a batch-wise approach with regularization to ensure parameter updates within each batch are independent. My concern is the feasibility of this approach—relying on a single batch may discard valuable information from earlier batches. Is there theoretical or experimental support for this batch-wise update method, and how does the algorithm balance using only one batch of data while potentially losing useful information from earlier steps?
-
Does BLAE sometimes eliminate the optimal arm too early? Reporting how often the best arm is mistakenly eliminated would provide insights into the method’s robustness.
-
How does BLAE scale computationally for large-K settings?
论据与证据
Yes, most claims made in the submission are supported. However,
-
The authors argue that prior methods require excessive computation, but no formal computational complexity analysis is presented.
-
Theoretical bounds support the claim, but there is no empirical comparison between "batch-wise estimation" vs. "using all past data".
方法与评估标准
The proposed methods and evaluation criteria (e.g., regret, batch complexity) are well-chosen for the problem of batched linear bandit learning.
However,
- The paper does not discuss how regularization parameters or elimination thresholds are selected.
- No comparison with fully adaptive (non-batched) methods like LinUCB. While non-batched methods are not direct competitors, it would be useful to show how much performance is sacrificed by batching.
- No real-world dataset tested.
理论论述
The theoretical claims in the paper are largely well-supported by the proofs provided.
- The proof assumes batch-wise least squares estimation is independent across batches due to regularization. However, this assumption lacks explicit justification, and a more rigorous bound on the impact of past information loss on regret would be helpful.
- The batch complexity proof seems correct under typical conditions, but it would benefit from a more explicit discussion of elimination efficiency in large-K cases.
实验设计与分析
- The paper does not provide extensive evidence of BLAE's performance on real-world datasets with more complex dynamics and noise, particularly when the author mentions that other batch linear Bandit algorithms perform poorly in practical applications, especially in large-scale scenarios.
- No comparison with fully adaptive methods (e.g., LinUCB).
- Batch complexity is explicitly measured; but no runtime/memory usage comparisons.
- Needs ablation study on batch size, regularization, and elimination strategy
补充材料
The supplementary material consists of codes, but I did not thoroughly check it.
与现有文献的关系
The problem of linear bandits has been widely studied in the literature. The BLAE algorithm builds on this foundational work by considering the batched learning scenario, where updates are made periodically (rather than after each round). This shift to a batched setting allows for more efficient exploration and the possibility of reducing computational overhead while still maintaining a low regret bound.
遗漏的重要参考文献
N/A
其他优缺点
Other Strengths:
-
The comparison with E4 and RS-OFUL shows that BLAE consistently achieves lower regret and fewer updates per batch.
-
Develop batch-wise optimal design strategies that efficiently allocate exploration efforts and can be generalized to both small-K and large-K settings.
-
Introduces a novel concentration bound accommodating both batching and regularization.
Other Weaknesses:
- I believe this paper makes an incremental contribution. The combination of G-optimal design and arm elimination is a classic approach, and even the use of regularized G-optimal design is not particularly novel. Achieving batch complexity in batched learning is also quite a standard result. It would be helpful if the authors could highlight the difficulties in the proof, particularly the challenges associated with unifying the large-K and small-K cases.
- The computational cost of solving G-optimal design in batched settings is not analyzed.
其他意见或建议
- In small-K cases, eliminating suboptimal arms is straightforward. However, in large-K cases, elimination could lead to misidentification of the best arm if not enough data is collected. The paper could further clarify why earlier methods, e.g., (Ren et al., 2024), failed to achieve this bound in the large-K setting.
- A brief discussion of why batch-wise estimation does not significantly degrade performance (or an empirical comparison with full-history least squares) would be useful.
- While the theoretical proof is rigorous, an intuitive explanation of why batch complexity reduces from to would help readers unfamiliar with batched bandit theory.
Thank you for your time. However, with all due respect, we are very concerned by the current assessment. We sincerely hope to establish some common grounds with our response.
On Computational Cost
To our best knowledge, all prior works on batched linear bandits have not provided theoretical complexity analyses, as exact theoretical complexity analyses for comparisons are challenging due to various optimization subroutines in each algorithm.
For fair comparisons of computational cost, we provided empirical runtime experiments in Appendix E and conducted additional extensive evaluations (Table 2-4 in https://tinyurl.com/3m6f4v5n). Our experiments show that our method significantly improves computational efficiency compared to existing optimal algorithms.
Batch-wise vs. Fully Sequential
Obviously, even optimal batched algorithms will underperform fully sequential ones due to limited information in batched settings. Thus, direct comparison is inherently unfair, as the problem settings are fundamentally different.
Nevertheless, per your request, we conducted additional experiments comparing batched algorithms with the fully sequential LinUCB. Results (Figure 6 in https://tinyurl.com/3m6f4v5n) clearly show that the regret difference between BLAE and LinUCB is the smallest among existing batched algorithms, with competing batched methods exhibiting significantly larger performance gaps.
Regularization Parameters and Elimination Thresholds
The reviewer's comment is incorrect. The regularization parameter is explicitly stated in Theorem 1 as (set to in experiments). The elimination threshold is explicitly stated in Theorem 1 and precisely defined in Lemma 2.
Real-world Datasets?
Regret evaluation in bandit literature typically relies on simulations since offline datasets may not contain feedback for all counterfactual actions that an algorithm can take. Thus, absence of real-world datasets is typically not considered a weakness, particularly for a theoretical work.
Nevertheless, we conducted extra experiments based on real-world dataset MovieLens. The results (Figure 7 in https://tinyurl.com/3m6f4v5n) confirm that BLAE consistently outperforms existing methods, validating its practical effectiveness.
Assume Batch-wise Independence?
No, our proof does not assume "batch-wise least squares estimation independence across batches," irrespective of regularization. Independence is achieved (rather than assumed) solely through our batch-wise design.
There appears to be a clear misunderstanding. This is serious because it may present challenges to the reviewer in evaluating our main results. We sincerely would like to ensure we are on the same page.
Elimination efficiency? We are unaware of any standard notion of "elimination efficiency." Could the reviewer rigorously define what is meant by this term?
Ablation? Altering batch size is unnecessary, as it is proven optimal batch complexity. Additional evaluations on varying regularization are provided. (Figure 8 in https://tinyurl.com/3m6f4v5n).
Novelty
We strongly rebut "incremental contribution" comment. The use of G-optimal design or arm elimination in itself should not diminish novelty. By that standard, many impactful papers using optimal design would have been rejected.
To highlight a few of our contributions briefly: our refined analysis (Lemma 3) significantly improves exploration strategies beyond standard G-optimal design. We derived a new bound for this. Existing concentration inequalities in arm elimination inherently depend on , making them suboptimal in large- scenarios. In contrast, our novel analytical framework utilizes efficient coverings of the unit sphere, providing sharper, more general regret bounds. Crucially, our proposed elimination strategy attains a unified optimal regret bound of using only batches. Please refer to our paper for more contributions.
G-optimal Design
A key advantage of our algorithm is that we compute the G-optimal design only times across rounds. Thus, given that computing optimal design requires at most , the average computational complexity due to G-optimal design in our algorithm is merely , which tends to zero as increases. This highlights a crucial benefit of our batched update.
Suboptimal in Large- Cases There might be a misunderstanding. Existing methods being suboptimal in large case is not about "misidentification of the best arm" in large . Rather, these methods always yield -dependent regret, even for large , but ours don't (see Table 1)
We hope our responses clarified the key points and respectfully ask that the evaluation reflect our contributions.
This paper studies the batched linear bandit problem. The authors propose a new algorithm for this problem and show that this achieves near optimal regret. Compared to other algorithms which also achieve near optimal regret in this problem, the authors claim that their proposed algorithm is more computationally efficient and performs better in practice. To illustrate this, experimental results in simulated environments are provided.
给作者的问题
- Fairness of experiments in terms of choice of hyperparameters of prior work
- Choice of hyperparameters – are the same ones used in theory and practice?
-
Importance of contribution
论据与证据
- From section 3, it is not clear if the batches are given to us or if they are something the learner can pick.
- The choice of epsilon seems very important but is only given in the Theorem 1 in terms of its order. For experiments, it seems a different value of epsilon is chosen, and it is not clear if this choice satisfies the conditions of the theorem. Therefore, it is not clear if the exact same approach is optimal in theory and practice.
方法与评估标准
- It would be good to see results for larger d – it seems that maximum d=10 is actually quite small.
- It would also have been good to show a run time comparison of the main algorithms used in the experiments.
- The run time experiments in Table 2 of the appendix are interesting, but it would good to make it obvious from the main paper.
理论论述
The proof of Lemma 1 seems fairly standard (unless I am missing something). Discussion & references of similar proofs would be appreciated.
There were several steps in the proofs I did not understand:
- Not clear where number of pairs comes from in715.
- In 891, why can we pick pi this way?
- Where does the assumption T>Omega(K^5) come from? (1071)
- Why is there a second bound in 1099? – I think this is to get the different results, but the structure of the proof could make it clearer. I also did not understand lines 912,933-944,1050, 1077. More justification should be provided. I think probably the proofs are correct (and I do not necessarily need detailed explanations of all these steps in the rebuttal, I just thought it might be helpful to point out the parts which were difficult to follow).
I don’t think the proofs of Lemmas 4,5,11,13 are necessary to include as they are standard results that have been proven before in the literature. References to existing proofs are fine to include instead.
实验设计与分析
- It would be good to see results for different choices of epsilon. In particular, how does choosing epsilon according to Theorem 1 alter the results?
- It seems there are hyperparameters of other algorithms as well, but it is not clear how these have been chosen. Was the same procedure used for BLAE as well as the other algorithms? It seems that for E^4 the parameters were configured according to theory, which does not seem to be a fair comparison.
补充材料
I looked over most of it.
与现有文献的关系
There is a good discussion of related work, although the analysis should be related to similar analyses in related work.
遗漏的重要参考文献
na
其他优缺点
- One concern is the size of the contribution. I think the introduction does not do a sufficient job explaining why the work in this paper is solving an important question.
- Another concern is the difference in hyperparameters for theory & practice. The authors claim that their algorithm is effective for both, but seemingly requires different tuning for each (unless I have missed something).
其他意见或建议
It would be helpful in the table to explain whether each algorithm works in the ‘large’ or ‘small’ K regime in Table 1. Since one contribution of this paper seems to be an algorithm which works in both regimes, it would be good to easily be able to understand what regiemes existing works cover.
We appreciate your time and review. However, with all due respect, there appear to be some misunderstandings, which we sincerely hope to clarify in our response.
Batched Learning Setting
In the batched linear bandit setting, the learner chooses both the number of batches and when to update parameters. This is the standard setup in the literature (Esfandiari et al. (2021), Ruan et al. (2021), Hanna et al. (2023), Ren et al. (2024), Zhang et al. (2025)). This is the key aspect of the batched linear bandit problem because one wishes to design an algorithm that can update parameters as infrequently as possible while still achieving provably optimal regret.
Consistency of Parameter Choice in Theory and Experiments
We confirm that the choice of is consistent across theory and experiments. Its order is defined in Theorem 1 and rigorously maintained throughout the paper, including experiments. The precise definition of is given in Lemma 2, which depends on parameter (distinct from ). is an arbitrary constant in (0,1) (chosen as 0.5 in all experiments which is theoretically valid). Our algorithm BLAE employs identical parameter choices in both theory and practice, ensuring optimality in both domains. There is no "difference in hyperparameters for theory & practice."
Experimental Results for Larger Dimension
We conducted additional experiments with larger values. Results appear in Figure 4 of Sec. 5 in https://tinyurl.com/3m6f4v5n. These demonstrate that regret difference increases with .
Runtime Comparison of the Main Algorithms
We conducted additional runtime experiments with larger arm counts . Results in Table 2-4 of Sec. 3 in https://tinyurl.com/3m6f4v5n demonstrate that BLAE is the fastest among optimal algorithms, with runtime comparable to suboptimal algorithms even in very large- settings.
Lemma 1: Our Lemma 1 adaptation explicitly incorporates the regularization term to achieve our specific bound, which is crucial to our analysis. While sharing structural similarities with standard results, its subtle differences merit attention. We will include the relevant previous result related to Lemma 1 in the revision (e.g., Chapter 20 in Lattimore & Szepesvári (2020)).
Effect of Different Choices of Epsilon on Results
As mentioned, the elimination threshold is defined based on parameter (distinct from ). In all experiments, we followed the theoretical guideline for and .
Since theory allows to be chosen arbitrarily, we tested sensitivity to different values. Results in Figure 5 in https://tinyurl.com/3m6f4v5n show BLAE remains robust across various values. Note again that in all of our experiments the parameter choices satisfy the conditions of the theoretical results.
Clarification on Hyperparameter Selection for Other Algorithms
All experiments were conducted fairly. For (Ren et al., 2024), we used their exact public implementation. All baseline algorithms used theoretically suggested hyperparameter values from previous literature, while BLAE’s hyperparameters followed our theoretical results, ensuring fair and transparent comparisons.
Significance of Contributions
To understand our contributions, it's essential to recognize the limitations in existing batched linear bandit literature:
- Ren et al. (2024): Optimal regret only for small- regime, exhibit poor practical performance (see Figure 1 of our paper).
- Abbasi-Yadkori et al. (2011) and Esfandiari et al. (2021): Higher batch complexity, making them theoretically suboptimal.
- Ruan et al. (2021), Hanna et al. (2023), and Zhang et al. (2025): Impractical due to excessive runtimes (see Table 2 of our paper).
The key motivation for batched linear bandits is computational efficiency without sacrificing regret or batch complexity. In contrast to previous methods, BLAE is the first algorithm that simultaneously achieves provable optimality in both small- and large- regimes, while maintaining practical efficiency in both regret and batch complexity.
Therefore, our contribution addresses a critical gap in existing literature, aligning with the fundamental motivation of batched linear bandits.
Clarifying small- and large-
It is NOT a matter of whether existing algorithms work or not. The existing algorithms "work" in both small- and large- regimes. But, their optimality was proven only in one of the regimes—never in both. In contrast, BLAE uniquely achieves optimality simultaneously in both regimes.
We hope our responses clarified the key points and respectfully ask that the evaluation reflect our contributions.
In this paper, the authors study the linear bandit problem under limited adaptivity, known as the batched linear bandit. They propose a novel batched algorithm that integrates arm elimination with regularized G-optimal design, achieving the minimax optimal regret in both large-K and small-K regimes for the first time, while using only loglogT batches. Finally, the algorithm demonstrates low computational overhead and strong empirical performance, outperforming state-of-the-art methods in extensive numerical evaluations.
给作者的问题
NA
论据与证据
Yes.
方法与评估标准
Yes.
理论论述
The proof sketch part is convincing, the proof looks correct to me.
实验设计与分析
The experimental result is good in general, while I have several concerns.
-
All the choices of are not very large. I am wondering what will be the result (especially the runtime) if is chosen to be very large or infinity (continuous action space).
-
I am not very convinced of the experimental result of . In Ren et al, a sub-linear regret upper bound is derived. However, in the experiment in this submission, it seems to have linear regret. Is there any contradiction? Will tuning the failure probability improve the result?
补充材料
No.
与现有文献的关系
This paper studies the batched linear bandits. The algorithm in this paper is the first to match the lower bounds for both small K and large K at the same time.
遗漏的重要参考文献
NA.
其他优缺点
Strengths:
-
Matching the two lower bounds at the same time is a very nice result.
-
The algorithm is simple and computationally efficient in experiments.
-
The presentation is clear in general.
Weakness:
-
Is there any theoretical analysis on the computational complexity? How does it scale if is very large or even infinity?
-
Please refer to the points mentioned in the experimental result part.
其他意见或建议
NA
We sincerely thank you for reviewing our paper and providing valuable feedback. We appreciate your recognition of our work and your constructive comments. Below, we address each of your comments and questions in detail:
Performance in the Large Regime
We thank the reviewer for raising this important point. We conducted additional experiments with much larger values, with detailed results available in Table 2-4, and Figure 2 of Section 3 in link.
As shown in Table 2-4, BLAE is the fastest among optimal algorithms, with runtime comparable to that of suboptimal algorithms. Moreover, Figure 2 demonstrates that BLAE consistently achieves significantly lower regret and maintains stable performance even as becomes very large, confirming the scalability of our approach in large, finite action spaces.
Regarding infinite (continuous) action spaces, we clarify that our theoretical guarantees (and many others) and algorithmic design assume a finite action set. Thus, extending the current method to a continuous action domain is outside the scope of this paper.
To our knowledge, prior studies have not provided experimental evaluations for very large values of (e.g., ). Thus, we strongly believe that our extensive experiments sufficiently demonstrate that our algorithm offers superior performance and robustness, positioning it as the most effective solution for large- regimes. We would be more than happy to perform further experiments if needed.
Experimental Performance of
We thank the reviewer for noting the discrepancy in 's results. While Ren et al. (2024) theoretically established a sub-linear regret bound, our experiments exhibited (near-)linear regret behavior.
We emphasize that we used the exact implementation officially provided by Ren et al. (2024).
We identified two primary reasons for this empirical performance discrepancy:
First, occasionally eliminates the optimal arm prematurely. Table 1 of Section 2 in link shows how frequently false elimination of the optimal arm occurs over 20 independent experimental runs.
Second, the batch size in is highly sensitive to hyperparameter . In particular, the third batch size scales as . When choosing a relatively large value of (e.g., , as in the Ren et al. (2024) setup), the third batch size often exhausts the entire horizon, leaving practically no time to effectively exploit information learned in the third batch.
Consequently, the algorithm relies only on information from the first two, relatively short batches. If suboptimal arm identification occurs during these early batches (as frequently observed), this leads directly to linear regret. In contrast, BLAE's -th batch size has order , avoiding this issue.
We conducted additional experiments by carefully tuning the hyperparameter for from 10 down to 0.99, to their advantage. The results of these additional experiments, presented in Figure 3 of Section 4 in link, confirm that careful tuning of leads to sub-linear regret behavior. However, we also observe that reducing for the sake of regret performance significantly increases the number of batches.
Hence, there is a clear tradeoff with regard to the control of in .
Moreover, even after the tuning of in for the sake of regret performance, the performance of still results in noticeably worse regret performance compared to BLAE (Figure 3 in link). Thus, the tuning leads to increased batch complexity although the regret performance increases. Despite careful tuning for , overall numerical performances in both regret and batch complexity of still show clearly inferior practical performance compared to BLAE.
Computational Complexity Analysis and Scalability
We thank the reviewer for this insightful question. To the best of our knowledge, all prior works on batched linear bandits have not provided theoretical complexity analyses, as exact theoretical complexity analyses for comparisons are challenging due to various optimization subroutines in each algorithm. Therefore, any derived theoretical complexity bounds would likely be overly crude and may not be suitable for comparisons for their true computational performance in practice.
To evaluate computational cost, we provided empirical runtime experiments in Appendix E and conducted additional extensive evaluations (see Table 2-4 of Section 3 in link).
Our empirical results address scalability when is very large. Even at large , our experiments show that our method significantly improves computational efficiency compared to existing optimal algorithms.
The paper tackles the well-studied batched linear bandits problem and provided tight regret bounds in the small and large arms (K) closing the gap in the upper and lower bounds. Experiments show that they perform well on simulated datasets over the existing algorithms.
Overall: The paper is clearly laid out with both the theoretical contributions and the experiments showcasing the benefits. In particular, care has been taken to highlight the issues with existing approaches and how the current approach solves them.
给作者的问题
(1) Are there experiments showing that Ren et al is competitive in the small K setting with the BLAE algorithm? The explanation that the best arms are getting eliminated (Ren et al) could be further explored to show that this happens increases as a function of K? (2) What are the other settings where the refined analysis could be of interest as alluded to in the paper?
论据与证据
Yes
方法与评估标准
Yes
理论论述
Yes, lemma 1, 2, 3 and Theorem 1
实验设计与分析
Yes, section 7.1
补充材料
I skimmed over the material
与现有文献的关系
Summarized in Table 1, pretty comprehensive.
遗漏的重要参考文献
Not that I am aware of
其他优缺点
Pros:
(i) The provided regrets are tight and match the established lower bounds for the problem. Both the communication rounds O(loglog T) and the regret bounds of O(d\sqrt(T), \sqrt(dTlogK)) are optimal. (ii) The main issue is the refined technical analysis of the batched least squares regression problem and the lack of independence across batches. (iii) The experiments show that the regret is much better for the BLAE algorithm not only in terms of the mean but also variance in sharp contrast with E4. This is further explained due to the elimination of the optimal arm.
Cons:
(a) It would have been interesting to show the results in the small K regime and that it is only in the large K regime, it is suboptimal.
其他意见或建议
NA
We sincerely thank you for taking the time to review our paper and for providing thoughtful and valuable feedback. We greatly appreciate your recognition of our work and your constructive comments. Below, we address each of your comments and questions in detail:
Experimental Results in the Small- Regime
We thank the reviewer for this suggestion. We conducted experiments in the small- regime with , and the detailed results are in Figure 1 of Section 1 in this link.
As shown in Figure 1, BLAE consistently outperforms existing methods in the small‑ setting. While (Ren et al., 2024) performs competitively for , this trivial case is straightforward for any algorithm. Therefore, apart from this trivial case, our results confirm that BLAE remains superior across all small- regimes considered.
False Elimination of Optimal Arms in (Ren et al., 2024) as a Function of
We appreciate the reviewer’s insightful question. To investigate whether the frequency of mistakenly eliminating the optimal arm in (Ren et al., 2024) increases with , we conducted experiments across a range of values—from small to large regimes. Each experiment was repeated 20 times, and we tracked the rate at which the optimal arm was falsely eliminated. Detailed results can be found in Table 1 of Section 2 in this link.
The data in Table 1 clearly indicate an increasing trend in the elimination rate as grows. This trend helps explain why (Ren et al., 2024) performs worse than other batched linear bandit algorithms, particularly in large- regimes.
Potential Applications of Refined Analysis in Other Settings
We appreciate the reviewer raising this valuable point. Our refined analysis indeed applies naturally to broader settings, including the pure exploration scenario. In pure exploration, the primary goal is to accurately identify optimal arms within a given budget, without explicitly focusing on cumulative regret. Our approach, which combines a regularized G-optimal design with an arm elimination strategy, efficiently discards suboptimal arms with high probability, significantly improving the accuracy and efficiency of exploration.
Furthermore, since best arm identification is a widely-studied and applied setting—where exactly one optimal arm must be correctly identified—our refined analytical methods directly apply here as well. Specifically, our algorithm’s ability to consistently maintain the optimal arm with high probability demonstrates its practical utility for reliable best-arm selection.
Thus, our theoretical refinements meaningfully contribute both to general pure exploration and specifically to the best arm identification problem.
Thanks to the authors for the detailed rebuttal comments to mine and the other reviewers questions. I will keep my score for now based on all the responses.
This paper studies the batched linear bandits problem, and proposes the BLAE algorithm to solve it. The key design of BLAE is arm elimination with regularized G-optimal design and the algorithm achieves the minimax optimal regrets in both large-k and small-k settings. Although some reviewers still have concerns in experiments, I find this submission technically solid and interesting. Therefore, I recommend weak acceptance. Finally, authors are encouraged to write professionally and tune down the tone in communication with reviewers in the future.