7.3

/10

Poster4 位审稿人

最低4最高5标准差0.5

3.3

置信度

创新性2.5

质量2.8

清晰度2.0

重要性2.3

NeurIPS 2025

Optimal Estimation of the Best Mean in Multi-Armed Bandits

Takayuki Osogami,Junya Honda,Junpei Komiyama

OpenReview PDF

提交: 2025-05-07更新: 2025-10-29

TL;DR

We propose an algorithm for estimating the best mean reward in a multi-armed bandit with asymptotically optimal, instance-adaptive sample complexity.

摘要

We study the problem of estimating the mean reward of the best arm in a multi-armed bandit (MAB) setting. Specifically, given a target precision $\varepsilon$ and confidence level $1-\delta$, the goal is to return an $\varepsilon$-accurate estimate of the largest mean reward with probability at least $1-\delta$, while minimizing the number of samples. We first establish an instance-dependent lower bound on the sample complexity, which requires handling the infinitely many possible candidates of the estimated best mean. This lower bound is expressed in a non-convex optimization problem, which becomes the main difficulty of this problem, preventing the direct application of standard techniques such as Track-and-Stop to provably achieve optimality. To overcome this difficulty, we introduce several new algorithmic and analytical techniques and propose an algorithm that achieves the asymptotic lower bound with matching constants in the leading term. Our method combines a confidence ellipsoid-based stopping condition with a two-phase sampling strategy tailored to manage non-convexity proposed algorithm is simple, nearly free of hyperparameters, and achieves the instance-dependent, asymptotically optimal sample complexity. Experimental results support our theoretical guarantees and demonstrate the practical effectiveness of our method.

关键词

confidence ellipsoidsample complexitypure explorationprobably approximately correctnon-convexity

评审与讨论

审稿意见

评分: 5置信度: 32025-06-25

This paper studies how to estimate the expected ε-accurate reward of the best arm in a multi-armed bandit problem with probability at least 1 − δ. The authors propose a new algorithm, EllipsoidEst, which employs a two-phase sampling strategy and adaptively solves a non-convex optimization problem to determine optimal sampling proportions. They prove that the algorithm is asymptotically optimal and show that it outperforms previous methods in experiments.

优缺点分析

Strengths:

The paper studies a novel bandit problem—estimating the best mean instead of identifying the best arm—under a fixed-confidence setting.
The proposed algorithm, EllipsoidEst, is simple and asymptotically optimal.
The paper is generally well-written and the theoretical analysis is rigorous and complete.

Weaknesses:

The experimental evaluation is limited: it focuses on simple Bernoulli settings with relatively small numbers of arms, and lacks comparisons on more diverse reward distributions or larger-scale settings.
The estimator output by the algorithm may have a bias.
The non-convex optimization is handled in the algorithm, but it may become harder to scale to larger or more complex problem settings.

问题

The output by the algorithm may exhibit upward bias due to the stopping rule. Is there a practical correction method or debiasing step that could be incorporated?
In large-scale settings (e.g., hundreds of arms), how would the computational cost of solving the non-convex optimization scale? Are there practical heuristics or approximations that could make it scalable?
In the fixed-confidence BAI setting, many algorithms stop when the best arm is identified with high probability. One can then use the empirical mean of the selected arm as an estimate of the best mean. Since the arm is correct with probability at least $1-\delta$ and $\delta$ is typically small, this estimate is already likely to be accurate, except in rare cases where the instance is easy and the algorithm stops early without enough samples. Therefore, the motivation for studying best-mean estimation as a separate problem is not entirely clear to me.
The proposed algorithm relies on solving a non-convex optimization problem to compute the sampling ratios r, but no closed-form solution is provided. It is unclear how the optimization is solved in practice, and whether this step introduces significant computational overhead—especially when the number of arms K is large. Although the algorithm may reduce sample complexity, it could become difficult to apply in practice due to the runtime cost of solving this optimization at every round. In addition, since different solvers may yield slightly different results, it is unclear how sensitive the algorithm's performance is to the accuracy or stability of the optimization step.
The experiments are limited to Bernoulli bandits and relatively small-scale settings. It would be helpful to include results on other reward distributions (e.g., Gaussian or sub-Gaussian) and larger numbers of arms to better demonstrate the algorithm's generality and scalability. Additionally, the algorithm involves several parameters, but the paper does not include any sensitivity analysis to show how performance varies with these choices.

局限性

The non-convex optimization involved in computing sampling ratios, while tractable in small cases, may present challenges in large-scale problems with many arms.
The returned estimator may have an upward bias due to the nature of the stopping criterion, which could affect applications requiring unbiased estimates.

最终评判理由

Thank you for your response. I have no further questions. I have raised my score.

格式问题

No.

作者回复

2025-07-29

Thank you very much for your detailed and constructive comments. We respond to your questions one by one.

[Q1] bias

Our goal is not to produce an unbiased point estimate, but rather to construct an interval that contains the true best mean with high probability. As discussed in Section 7.3, the midpoint of the returned interval is biased and should not be interpreted as a point estimate with minimal bias. We do not view this as a weakness, but rather as a fundamental property of any PAC estimate of the best mean.

[Q2] computational cost

We have reduced the non-convex optimization problem to more tractable subproblems. Specifically, it involves (i) performing bisection to find the root of two monotonic one-dimensional functions (see Lines 129–131), and (ii) solving a one-dimensional non-convex optimization via grid search over a narrow interval $[\underline{U}(\mu),\underline{U}(\mu)\lor\overline{U}(\mu)]\subseteq[\mu^\star+\epsilon,\mu^\star+2\epsilon]$ (see L143-145). Both one-dimensional functions can be evaluated in time linear in $K$ , the number of arms. Furthermore, the bisection is required only during Phase 1 (at each round), while the grid search is performed just once at the start of Phase 2. This structure makes the optimization computationally scalable even in large-scale settings.

In Appendix B.7, we also empirically study the computational cost of our algorithm. Figures 12–13 show that the running time of our algorithm is negligible and comparable to that of the baselines.

Regarding "more complex problem settings" that you raised in Weakness 3, it is true that this kind of reduction to simple problem might be applicable for more complex problems. However, we believe that the current problem of best mean estimation itself has its importance as discussed in the introduction.

[Q3] BAI for best mean estimation

In our experiments, we evaluate a representative BAI algorithm, UGapEC, with a natural modification to handle the "rare cases", specifically by collecting additional samples after stopping in order to ensure a PAC guarantee on the estimated best mean. However, as our results demonstrate, this approach of combining BAI with post hoc sampling is typically suboptimal compared to a method that is designed explicitly for best-mean estimation. This gap highlights the value of studying best-mean estimation as a distinct problem, rather than treating it as a simple extension of BAI.

[Q4] non-convex optimization problem

As explained in our response to [Q2], we reduce the non-convex optimization to tractable subproblems: (i) bisection for solving one-dimensional monotonic equations, and (ii) grid search over a narrow one-dimensional interval. While we acknowledge that runtime may vary depending on implementation details, both bisection and grid search are simple and robust methods. Even with a straightforward implementation, the computational overhead is negligible in practice, as demonstrated in our empirical analysis in Appendix B.7. These results suggest that the optimization step does not hinder the algorithm’s practical applicability, even with a large number of arms.

[Q5] diverse experiments

We agree that broader evaluations are important. In Appendix B, we provide results on more diverse settings, including Gaussian reward distributions (see Appendix B.6). Note that we experiment with up to $K=256$ arms in some settings. That said, our experiments indicate that the proposed algorithm outperforms baselines for $K\le 16$ . Since performance tends to degrade for larger $K$ , we believe that extending experiments to much larger $K$ would offer limited additional practical insight.

Regarding hyperparameters, our algorithm requires one tunable parameter, $\lambda$ , and two instance-dependent parameters, $R$ and $S$ . In Appendix B.1, we show that $\lambda=(R/S)^2$ is the optimal choice in practice. We further study the sensitivity to $R$ and $S$ in Appendix B.2, demonstrating that the algorithm is reasonably robust to their misspecification.

2025-08-04

Thank you for your detailed response. Regarding Question 3, I would like to clarify my concern. I am curious about the motivation for studying best-mean estimation as a separate problem. Even if the modified BAI algorithm may not be the optimal solution, it appears to adequately address the problem in most practical scenarios. Given this, I would like to understand the significance of studying this problem independently. Could you please elaborate on the theoretical or practical reasons for treating it as a distinct problem?

2025-08-04

Thank you very much for your thoughtful follow-up and for engaging in the discussion.

We appreciate your observation that a modified best-arm identification (BAI) algorithm may often suffice in practice. However, we believe there are both theoretical and practical reasons to study best-mean estimation (BME) as a distinct problem.

Theoretical reasons

We emphasize two key points that distinguish BME from BAI:

BAI is not sufficient for BME: As you noted, in some "rare cases", additional samples may be needed to estimate the best mean with the PAC guarantee. While such cases may seem rare, they are more frequent than they might appear, as our experiments demonstrate.
BAI is not necessary for BME: In many cases, particularly when multiple means are clustered around the best mean, identifying the ( $\epsilon$ ) best arm is unnecessarily difficult. BAI algorithms often oversample in these cases, trying to distinguish arms with nearly identical means, which is also the case of $\epsilon$ -BAI, where we need to confirm that no arm is better by more than $\epsilon$ than the identified arm. Even for these cases, under the framework of BME, one can exploit empirical means from multiple arms to reliably estimate the best mean, without identifying the best arm, by exploiting the confidence ellipsoid.

Empirical reasons

To support this theoretical argument, we have conducted experiments comparing sample complexity across different settings. We compare UGapEc-BME (the total sample complexity of modified UGapEc), UGapEc-BAI (the sample complexity in the BAI phase of UGapEc-BME), and our EllipsoidEst, in two settings: Uniform (settings of Figure 2(b)) and Clustered (settings of Figure 7(a)).

Uniform

#Arms (K)	UGapEc-BME	UGapEc-BAI	EllipsoidEst
2	1633.2	284.2	580.5
4	2765.8	1260.8	911.0
8	5921.3	4261.3	1982.2
16	13434.7	11618.7	5717.5
32	27939.0	25967.0	19068.3
64	54505.2	52377.2	69815.4

In these settings, the "rare cases" arise frequently, and the modified UGapEc must take many additional samples after identifying the best arm.

Clustered

#Arms (K)	UGapEc-BME	UGapEc-BAI	EllipsoidEst
2	31132.2	29783.2	850.0
4	61628.9	60123.9	2160.0
8	125056.9	123396.9	6488.0
16	249217.6	247401.6	22016.0
32	509288.7	507316.7	80864.0
64	1021372.8	1019244.8	312256.0

Here, the modified UGapEc spends most of the time on identifying the best arm. In fact, the BAI phase of the modified UGapEc already requires more samples than EllipsoidEst. This illustrates how BAI can be significantly harder than BME, leading to unnecessary sample complexity.

2025-08-06

Thank you for your response. I have no further questions. I have raised my score.

审稿意见

评分: 5置信度: 42025-07-01

The paper proposes an algorithm that returns an $\epsilon$ -accurate estimate of the largest mean (by returning an interval smaller than $\varepsilon$ that contains $\mu_\star$ ) in a bandit model with probability $1-\delta$ . It also gives an asymptotic lower bound on the sample complexity of any such $(\varepsilon,\delta)$ -PAC algorithm, which is matched by their algorithm. The lower bound can be written as a solution to a non-convex optimization problem, and a (somewhat) efficient computation of the resulting optimal allocation is proposed. Interestingly the proposed algorithms calls this "optimal solver" only once and therefore remains efficient.

优缺点分析

Strengths

Lower-bound inspired algorithms are ubiquitous in the growing literature on pure exploration in bandits. This paper makes however an important step forward by tackling a problem in which the set of correct answers (all possible estimators within epsilon) is infinite. The new lower bound is definitely non trivial, as well as the analysis of the proposed algorithm. I also appreciated the experimental section.

Weaknesses

The lower bound is cool, but I would have presented it differently and (2) more importantly there is currently a mistake in its proof

(1) Section 4 starts with an optimization problem (P1) and discussion about it without connecting it directly to the lower bound. I would start with Theorem 2 and define f(\mu) therein as a minimum over r and U in $(\mu^\star,\mu^\star + \varepsilon)$ satisfying the constraints (2) and (3). Then after that you can discuss its computation. It is not clear to me whether we even need (P1) with a fully unconstrained $\mu$ actually.

(2) In the proof of the lower bound, some alternative models $\lambda_i^+$ and $\lambda_i^-$ are chosen. However, these models depend on $\delta$ . Lemma 6 on the other hand is an asymptotic result that is only valid for fixed instances $\mu$ and $\lambda$ , so you cannot apply it to get Equation (50). Lemma 6 is based on asymptotic arguments (the law of large numbers), maybe you could use instead some non asymptotic result relating the probability of an event under two different models, such as the change of distribution Lemma used in [5]. In any case, I think a precise proof sketch of the lower bound is needed in the main text. Also the lower bound should hold only for Gaussian distributions (even if algorithms work for sub-Gaussian ones), in the proof of Lemma 6 when you use the change of distribution, $L_t$ really need to be your (Gaussian) likelihood.

On the upper bound side, I found it a bit hard to get Corollary 1 from Theorem 5. The finite time upper bound in Theorem 5 seem to feature other terms that depend on $\log(1/\delta)$ , in particular the last one with $T_{\max}$ . Where does the parameter $c$ comes from in $\delta^{c^2}$ ? Even if it is nice to have some an explicit upper bound instead of only an asymptotic result, I am not sure Theorem 5 is very insightful for the reader.

问题

Can you fix the lower bound?
I wonder whether the proposed regularization is actually essential. Your correctness analysis hinges on the concentration inequality of [1] which indeed requires regularization to concentrate all arms simultaneously. But you could use other results that do not require regularization, and would yields a $\log\log(n)$ instead of $\log(n)$ , see, e.g., https://jmlr.org/papers/volume22/18-798/18-798.pdf
Did you experiment with the naive baseline described in l.171-174? E.g. use LUCB as a sampling rule (which doesn't depend on $\epsilon$ ) and stop whenever the confidence interval of the empirical best arm is of size at most $2\varepsilon$ . I suspect this would be already better compared to the naive two stage baseline used in the experiments (and perhaps we can get a simple sample complexity bound and compare it to the optimal one, at least numerically).

Minor:

the statement of Theorem 2 is weird. Why are you keeping the parameter $\eta > 0$ and not letting it go to zero which removes it?

局限性

Yes.

最终评判理由

I raised my score in light of the correction in the lower bound.

格式问题

No issues

作者回复

2025-07-29

Thank you very much for your detailed and constructive comments. In the following, we respond to your questions and weaknesses, addressing them in this revised order.

[Q1] lower bound

Thank you for your careful reading. Indeed, there is an issue in the current proof of Theorem 2 (Appendix A.2.2) due to the dependence on $\delta$ . While it would be possible to correct the proof by deriving a non-asymptotic version of Lemma 6 as suggested, it can also be easily fixed by the following simple modification.

Now we consider a sequence $\delta_1,\delta_2,\dots\to 0$ satisfying $\frac{\mathbb{E}[\tau_n]}{\log 1/\delta_n}\le 2(1-\eta)R^2 f(\mu)$ . As pointed out, $(U,L)$ satisfying $U-L>2\varepsilon$ , (41) and (42) depends on $\delta$ . Still, $U'=U'(\delta):=\min ( U(\delta), \mu^\star+2\varepsilon ) \in [\mu^\star, \mu^\star+2\varepsilon]$ and $L'=L'(\delta):=\max(L(\delta), \mu^{\star}-2\varepsilon)\in [\mu^\star-2\varepsilon, \mu^\star]$ become bounded sequences and therefore there exists a subsequence $\delta_{n_1},\delta_{n_2},\dots$ such that $U'(\delta_{n_m})\to U_{\infty}$ and $L'(\delta_{n_m})\to L_{\infty}$ for some $U_{\infty},L_{\infty}$ such that $U_{\infty}-L_{\infty}\ge 2\varepsilon$ . Since $\frac{\mathbb{E}[\tau_{n}]}{\log 1/\delta_n}\le 2(1-\eta)R^2 f(\mu)$ is finitely bounded independently of $\delta$ , we can immediately see from (41) and (42) that there exists $m_0>0$ such that, for all $m\ge m_0$ ,

\text{(i'): } \exists i\in[K],\ \frac{\mathbb{E}\left[N_i(\tau_{n_m})\right]}{\log 1/\delta_{n_m}} (U_{\infty} - \mu_i)^2 \le 2 (1-\eta/2) R^2

\text{(ii'): } \sum_{i \in [K]} \frac{\mathbb{E}\left[N_i(\tau_{n_m})\right]}{\log 1/\delta_{n_m}} (\mu_i - L_{\infty})_+^2 \le 2 (1-\eta/2) R^2.

By using the finiteness of $\frac{\mathbb{E}[\tau]}{\log 1/\delta}$ again, there also exists $a>0$ independent of $\delta$ such that

\text{(i''): } \exists i\in[K],\ \frac{\mathbb{E}\left[N_i(\tau_{n_m})\right]}{\log 1/\delta_{n_m}} ((U_{\infty} + a) - \mu_i)^2 \le 2 (1-\eta/4) R^2

\text{(ii''): } \sum_{i \in [K]} \frac{\mathbb{E}\left[N_i(\tau_{n_m})\right]}{\log 1/\delta_{n_m}} (\mu_i - (L_{\infty}-a))_+^2 \le 2 (1-\eta/4) R^2,

which satisfies $(U_{\infty}+a)-(L_{\infty}-a)\ge 2\varepsilon +2a > 2\varepsilon$ .

From this result, the discussion after (42) becomes valid by replacing $(U(\delta),L(\delta))$ with fixed parameters $(U_{\infty}+a, L_{\infty}-a)$ and considering the subsequence $(\delta_{n_1}, \delta_{n_2},\dots)$ , which derives the contradiction as in the original argument.

[Q2] regularization

Thank you for the insightful suggestion. We agree that the regularization is not essential for the correctness analysis. While our current analysis relies on the concentration inequality from [1], alternative techniques, such as the one in the paper you suggested, can indeed yield similar guarantees without regularization.

That said, the referenced result does not seem to be directly applicable to general sub-Gaussian distributions. Nonetheless, adopting such techniques for specific exponential families is a promising direction, and we will incorporate this discussion in the revised version.

[Q3] naive baseline

Thank you for the suggestion. We conducted the following ablation studies to assess the impact of using a naive baseline as described:

(A) We fixed the sampling strategy to UCB (i.e., the algorithm never enters Phase 2). We use UCB instead of LUCB, since best arm identification is not our goal.

(B) We changed the stopping condition to a naive one: stop when $\max_i L_i \ge U - 2 \epsilon$ .

Results:

(A) While the two-phase sampling strategy is crucial for our theoretical guarantees, its empirical impact is limited in the settings considered in our experiments. We observed some cases where the two-phase and one-phase strategies behave differently, but in the scenarios we tested, both strategies yielded very similar stopping times ( $\tau$ ).

(B) In contrast, the choice of stopping condition significantly affects performance, especially when arm means are clustered. The table below compares the stopping times ( $\tau$ ) under EllipsoidEst (with our proposed stopping rule with the confidence ellipsoid) versus the naive stopping condition, using the same setup as in Fig. 6(b) (Clustered) with K=16:

$\log_{10} (1/\delta)$	EllipsoidEst	Naive Stopping
1	21200	61024
2	22016	63073
3	22816	65120
4	23632	67153
5	24432	69185
6	25232	71217

These results show that the naive stopping rule can result in significantly larger sample complexity in practice. We will include this comparison in the revised version.

[Q4] Theorem 2

Thank you very much for pointing this out. Since $\eta$ can be made arbitrarily small, we can indeed remove it. We will revise the paper accordingly.

[W1] P1

Thank you for the suggestion. We will revise the paper based on your suggestions. In particular, it is not essential to define the optimization including $U$ outside $[\mu^\star, \mu^\star+\epsilon]$ and we will clarify the equivalence in an early place.

[W2] lower bound

See our response to [Q1]

[W3] Theorem 5

$c$ is in the third line of (19), which has the same order as $\epsilon_c=O((\log 1/\delta)^{-1/3})$ . It may be true that the non-asymptotic bound is not so informative in our case, and we will state current Corollary 1 as the main theorem and move the current Theorem 5 to the appendix.

评论- on [Q3]

2025-08-05

On [Q3] it is unclear to me why using UCB instead of LUCB in the baseline I suggested is going to make things easier. I am curious whether you confirmed this intuition by running simulations.

2025-08-06

Thank you very much for engaging in the discussion.

We have chosen UCB in our previous response because it is the sampling strategy used in our EllipsoidEst. However, we have conducted an additional experiment to confirm that switching to LUCB requires more samples. Specifically, we use the sampling strategy of the LUCB algorithm proposed by Kalyanakrishnan et al. (ICML 2012) for selecting the best $m$ arms, setting $m=1$ in our case. Here are the results for the setting of Fig. 6(b) (Clustered) with K=16 (the one used in our previous response):

$\log_{10}(1/\delta)$	UCB	LUCB
1	21200	35836
2	22016	37216
3	22816	38566
4	23632	39916
5	24432	41266
6	25232	42616

In this particular case, the sample complexity of UCB (i.e., EllipsoidEst not entering into Phase 2) matches that of EllipsoidEst, but this is not always the case in other settings.

审稿意见

评分: 4置信度: 32025-07-02

In this paper the authors study the problem of estimating the mean value of the best arm in an unstructured multi-armed bandit problem. They assume the rewards are $R$ -subgaussian, and provide a lower bound for $(\epsilon,\delta)$ -PAC estimation that any $(\epsilon,\delta)$ -PAC algorithm must satisfy. Then provide an algorithm that is based on a UCB principle, and conclude with experiments comparing the performance of their technique to Successive elimination and UGapEc .

优缺点分析

Strenghts

The authors provide ample numerical evidence, also in the appendix, comparing Bernoulli and Gaussian rewards. Probably it could benefit to also compare to a very simple exploration technique (just uniform?)
Apart from some questions below, the main results for the algorithm seem sound, and the UCB approach is rather interesting.
Pure exploration for estimation problems in bandit/MDPs is relatively new, and there are only a few works studying this problem.

Weaknessess

Overall there are some presentation issues.
- The presentation of the lower bound is overall a bit unclear (derivation, presentation). To a general reader, it may not seem clear why P1 should be true and why we should obtain this type of optimization problem.
- Also the presentation after (P2) is unclear: what is the strategy and end goal? why care about this specific case of $U$ ? This becomes even more confusing when $f(U;\mu),U(\mu),\bar U, \bar g$ , etc get introduced.
- In Theorem 2 an asymptotic lower bound is presented. However, it is unclear what the two alternative models are, and what "substantial modifications" are required to deal with "infinitely many possible answers", as the authors write in line 154-155. In the appendix no guide is provided to the proof strategy.
- The confidence ellipsoid argument line 177-182 is unclear why it should hold, and why we need it. Similarly, the proof of thm. 4 is not well explained.
- Missing connections with practitioners: It is my belief that a practitioners will hardly use this method to evaluate the performance of a real system that maybe has safety constraints. However, the goal of this type of theoretical work should be to provide insights on what are the key ideas that potentially could be used in a more practical setting.
Technical:
- It is not immediately clear what is the connection with the multiple-correct answer paper (Degenne and Koolen 2019). The correct answers there are a property of the model $\mu$ . For example, their paper is important in applications where we are interested in estimating something like $g(\mu)$ , where $g$ maps the model to some output that is not necessarily a singleton (e.g., the set of optimal arms). However, in this case we are interested only in estimating a single scalar value.
- the model assumes sub-gaussian distributions, however the lower bound seems to be written for gaussian distributions (see also the log-likelihood ratio in lemma 6). However, there may be sub-gaussian reward distributions that are easier to learn than gaussian. As of now, the lower bound seems more to be an upper bound.
- the algorithm seem to rely on a known upper bound $S$ on $\max_i |\mu_i|$ , and it is not clear what happens if we do not know this constant.
- Not clear what are the main technical advancements. The algorithm uses an UCB based approach which seems novel, but apart from that it is not clear what are the main technical advancements compared to prior work.
- Proof of lemma 6: why the relative log-likelihood ratio can be factorized in this way as if the rewards are Gaussianly distributed? Don't you just assume that the rewards are sub-gaussian?
- Not clear why the method does not seem to perform well with $K>10$ (see fig 7)
Missing relevant prior work: while the authors discuss some related work, there is ample literature on the pure exploration problem. For example, lemma 6 resembles classical changes of measure originally employed by Lai, e.g., in Adaptive treatment allocation and the multi-armed bandit problem, 1987. More importantly, there is also some prior work on policy evaluation in RL that uses BAI techniques (Russo and Pacchiano, "Adaptive Exploration for Multi-Reward Multi-Policy Evaluation", 2025). This paper is also related to the first comment above. How does your technique compare and what are the differences?

问题

line 118: $r_i$ is the sampling rate or number of samples?
line 121: unclear to reader the role of U and why it should make P1 non-convex, and also why fixing the value makes it convex.
line 162: what does it mean " It also employs a two-phase sampling strategy to minimize the need for solving non-convex optimization problems"?
Knowing an upper bound $\max_i |\mu_i|< S$ , should this quantity appear in the lower bound since we have additional information/structure about the problem? If not, why it should not appear?
What are the key challenges in implementing this algorithm?

局限性

Authors only discuss limitations in the checklist, saying that the algorithm does not perform well with a large number of arms.

最终评判理由

The rebuttal directly addressed core technical ambiguities, however, there are a number of ambiguities that need to be addressed:

The lower bound is proved for Gaussians while the algorithm assumes sub-Gaussian rewards. This needs to be addressed by either assuming only Gaussian rewards or extending the lower bound discussion.
stopping doesn’t use tight bounds as noted by reviewer Cj7t
scalability for large number of arms $K$ is limited.

We also discussed with the authors a connection to the “multiple correct answers” paper (Degenne and Koolen, 2019). They argue that Degenne and Koolen’s analysis cannot be extended because it assumes a finite set of correct answers, whereas here the set is infinite. Does the infinitude arise because any value in an ϵϵ-ball is acceptable? Are all points in this ball equivalent in terms of information? These questions seem unanswered in the paper. My intuition is that any point in the ball is probably equivalent, and thus, even though the set is infinite, it can be viewed as a single point from a topological perspective; therefore it may not the same as the multiple-correct-answers setting. Furthermore, this issue does not seem to appear in the policy-evaluation paper by Russo and Pacchiano (2025), which treats MDPs and targets single/multi-policy evaluation for a finite/convex set of rewards. This requires further investigation..

格式问题

I did not notice any major issue.

作者回复

2025-07-28

Thank you very much for your detailed and constructive comments. In the following, we respond to your questions and weaknesses, addressing them in this revised order.

[Q1] $r_i$

$r_i$ is a sampling rate, although the sum over $r_i$ is not necessarily 1. $2R^2 r_i \log 1/\delta$ corresponds to the number of samples.

[Q2] role of $U$

As discussed around Section 2, $U$ is an upper bound of the best mean. P2 is a standard linear program and thus convex. In contrast, P1 includes higher-order terms such as $r_i U^2$ and $r_i \mu_i U$ , which introduce nonconvexity in the joint variables $(r,U)$ . Once $U$ is fixed, these terms become linear in $r$ . For additional discussion on the nonconvexity of P1, see our response to [Q2] of Reviewer Cj7t.

[Q3] meaning of a sentence It means that we use the two-phase strategy that computes the nonconvex optimization only once at the time of the phase shift instead of computing it at every round like standard Track-and-Stop algorithms. This significantly reduces the computational burden associated with nonconvex optimization. For further details on the two-stage strategy, see our response to [Q2] from Reviewer SxLQ.

[Q4] role of $S$ in the lower bound

It is true that the lower bound should change if we know $S$ , though it is standard in the literature after [1] not to consider this gap. Please also see our response to [W8] on the confidence region.

[Q5] implementation challenges

There are no major difficulties in the actual implementation of the algorithm besides it relies on knowledge of instance-dependent constants ( $R$ and $S$ ), which may not always be readily available in practice.

Instead, the main challenges lie in its design, which enables computing the optimal allocation only once. This requires carefully determining the timing of the phase shift so that the estimates are sufficiently accurate, while avoiding over-exploration of any arm. A further challenge is that the estimates at the time of the phase shift might be inaccurate with a small but nonzero probability. To ensure a bound on the expected number of samples (rather than a high-probability bound), the algorithm is designed so that it still terminates within a reasonable number of rounds even when the computed allocation is suboptimal.

[W1] presentation of the lower bound

Each constraint in P1 corresponds to ensuring that the interval $[U-2\epsilon, U]$ contains the true best mean. Specifically:

(2) ensures that each $\mu_i$ is below $U$ (large $r_i$ is needed if the gap $U-\mu_i$ is small).
(3) ensures that at least one $\mu_i$ is above $U-2\epsilon$ , confirming that the interval contains a valid candidate for the best mean.

Given these constraints, we need to choose $U$ to minimize the required number of samples, which is captured by the objective function.

[W2] presentation after P2

The difficulty in our problem lies in the fact that optimizing the total number of samples involves solving a nonconvex optimization problem over $U$ . In contrast, once $U$ is fixed, the remaining optimization is explicitly solved. Given this situation, to extract the nonconvex and difficult feature, we introduced $f(U;\mu)$ as the optimal value given $U$ (which is nonconvex in $U$ ) and $U(\mu)$ as its minimizer, respectively. Since this is a nonconvex optimization, specifying upper and lower bounds of the optimal solution is beneficial for implementation and analysis, for which quantities like $\overline{U}$ and $\overline{g}$ are introduced. We will clarify these points in the revision.

[W3] lower bound

The alternative models are specified in (43) and (45) in the proof of Theorem 2. Constructing these alternatives requires carefully choosing values of $U$ and $L<U-2\epsilon$ , depending on the algorithm that violates the lower bound by using the discussion in Lemma 7. Such constructions are more straightforward for finitely many possible answers. For the difficulties associated with infinitely many possible answers, see our response to [Q3] from Reviewer Cj7t.

[W4] confidence ellipsoid and Theorem 4

The argument using the confidence ellipsoid is intended to justify that, if there exists a subset of arms whose confidence region (whether rectangular or elliptical) is contained in the square defined with the interval $[U-2\epsilon,U]$ , the true best mean is guaranteed to lie within this interval. Theorem 4 formalizes this argument in a higher-dimensional setting by showing that the confidence ellipsoid allows us to conclude that the true mean lies within the desired interval.

[W5] connection with practitioners

While our work is primarily theoretical, it does offer a practical insight: to reliably estimate the best mean, one should construct a confidence ellipsoid around the empirical means. This approach is intuitively appealing, and we provide rigorous justification for its use.

[W6] connection with the multiple-correct answer paper

While the multiple-correct answer paper allows non-scalar output $g(\mu)$ , it requires that $g(\mu)$ must be from a finite set. This finiteness is a key component in their analysis, for instance, in the lower bound derivation, where they bound the probability that the algorithm outputs a particular answer $g$ within a given number of rounds. In contrast, our setting focuses on estimating a single scalar quantity (best mean) but from an uncountable set. Due to this difference, the method and the results for multiple-correct answers cannot be directly applied in our settings.

[W7] lower bound is for Gaussian

Thank you for pointing this out. Our lower bound is indeed derived under the Gaussian assumption, primarily for clarity and tractability. The lower bound can be generalized to broader sub-Gaussian models by replacing the L2 distance (which arises from the Gaussian KL divergence) with the corresponding KL divergence for the given distribution class. We will clarify this point in the revised version.

We also agree that in some cases, sub-Gaussian distributions may lead to easier learning problems than the Gaussian case, depending on the available structure or additional information.

[W8] knowledege on $S$

We acknowledge that relying on a known upper bound $S$ is a limitation of our algorithm. However, this assumption is standard in the linear bandit literature with sub-Gaussian noise after the seminal paper by Abbasi-Yadkori et al. [1]. In some settings (e.g., Bernoulli rewards) $S$ is trivially known, but this is not always the case.

When $S$ is unknown, the specific form of the confidence bound from Theorem 2 in [1] (which we use) becomes inapplicable. However, Theorem 1 in [1] still provides a valid anytime confidence bound that does not require knowledge of $S$ , although the resulting confidence region becomes a bit complicated. We will add this clarification in the revision.

[W9] technical advancements

Our main technical contributions lie in the sampling strategy and the stopping rule, both of which differ from prior approaches in meaningful ways.

Sampling Strategy: We extend the UCB approach using a two-phase strategy, where the optimal allocation (which is nonconvex) is computed only once, at the phase transition. This avoids solving a non-convex problem at every round, which is required in traditional Track-and-Stop. The technical challenge here is to guarantee that the estimates are sufficiently accurate at the time of the phase shift, while the phase shift must occur early enough to avoid over-exploring suboptimal arms.

Furthermore, unlike the sticky Track-and-Stop algorithm in the multiple-correct answer paper, our setting requires reasoning over infinitely many candidate answers. This introduces additional difficulty, as the set of relevant alternatives (i.e., those that could invalidate the current answer) dynamically changes with the estimates, which further complicates the analysis.

Stopping Rule: We introduce the use of a confidence ellipsoid for best mean estimation in an unstructured bandit setting. While such ellipsoids have been used in linear bandits, their application to identifying the best mean among independent arms is novel.

[W10] proof of Lemma 6

Thank you very much for pointing this out. In Section 4, we derive the lower bound under the Gaussian assumption, which should have been stated more clearly. For general sub-Gaussian rewards, the L2 distance should be replaced with the corresponding KL divergence.

[W11] performance under many arms

We agree that the performance degradation with larger numbers of arms is worth further investigation. However, this phenomenon is not unique to our setting and has also been widely observed in the literature on best arm identification [6]. Algorithms that achieve asymptotically optimal sample complexity often underperform in practice when there are many arms. For instance, Approximate Best Arm in [6] is asymptotically optimal but resorts to Naive Elimination, an algorithm with suboptimal asymptotic complexity, when the number of arms is small or after enough suboptimal arms have been eliminated.

[W12] Lai 1987

The change-of-measure technique used in Lemma 6 indeed follows the classical approach developed by Lai (1987), and we will make this connection explicit with proper citation. The key difference lies in how the technique is applied: in our work, it is adapted to handle the setting with infinitely many correct answers, which introduces new technical challenges, particularly in the proof of Theorem 2.

[W13] Russo and Pacchiano 2025

The paper by Russo and Pacchiano (2025) focuses on evaluating the value of all policies, rather than estimating the value of the best policy. In the context of multi-armed bandits (i.e., reinforcement learning without states), their objective corresponds to estimating the means of all arms, while our work focuses on estimating the best mean with high confidence.

2025-08-05

I thank the authors for their response. I have some additional questions:

Regarding [W1], in Eq. 2 should it be $(U-\mu_i)_+^2$ ?
What if the solution $r_i$ is not unique?
Regarding $S$ , since it is a parameter of the algorithm, how would the lower bound change?
Regarding the connection with the multiple-correct answer paper [2], I just wanted to specify that you use [2] to motivate your approach in your introduction. However, to me it seems completely different as I wrote in my weakness section. I also believe there is some similarity between what you do and Russo et al 2025, where they propose an $(\epsilon,\delta)$ -pac approach for policy evaluation. You state that in their work "focuses on evaluating the value of all policies, rather than estimating the value of the best policy." However, your statement does not seem entirely correct. While they do not estimate the value of the best policy, at the same time they are not estimating the value of all policies. Rather, they are estimating the value for a target policy, or a finite number of target policies. Therefore, it would be ideal to discuss the differences.
In the proof of thm 2 what are the main technical challenges you are referring to in your response to W12?
$K>10$ does not seem to be a larger number of arms though. Is there any recent work with similar issues? Furthermore, in [6] they seem to use 300000 arms.

2025-08-06

Eq. 2

Thank you very much for pointing out this typo. It should be $(U-\mu_i)^2_+$ .

When optimal solutions are not unique

When there are multiple optimal solutions in (P2), we pick the one of the form written in Lemma 1 (ii), which is the definition of $r_i(U;\mu)$ .

For the case with multiple (current) optimal arms, we fix the tie-breaking rule arbitrarily to determine $i^\star$ as declared in Line 106.

Optimal solution $U(\mu)$ of (P1) can also be not unique. This point is indeed the difficult part of our analysis, and Lemma 9 is constructed so that any choice of these solutions leads to the optimal performance.

lower bound with the knowledge of $S$

When we construct the lower bound under the explicit knowledge of $S$ , we need to first change the domain of $U\in\mathbb{R}$ in the characterizing optimization (P1) into $[-S+2\varepsilon, S]$ and then drop the condition in (2) for the case of $U=S$ since it becomes unnecessary to confirm $\mu^{\star}\le U=S$ . This does not change the optimal value and solution if $\mu^{\star}\le S-2\varepsilon$ and the lower bound would not change in most practical instances.

connection with the multiple-correct answer paper [2]

We would appreciate it if you could kindly explain what was unclear in our response [W6] on [2]?

To put it slightly differently, a similarity between our problem and the problem of [2] is that the solution $g(\mu)$ is a function of the parameter $\mu$ in both problems. A difference is that the solution space is finite in the problem of [2], while it is infinite in our problem.

differences with Russo and Pacchiano (2025)

As you noted, "all policies" should have been stated "all of the target policies". In any case, while the problem studied by Russo and Pacchiano (2025) is different from ours, the two problems have a common structure in that the characterizing optimization is nonconvex.

Russo and Pacchiano (2025) deal with this nonconvexity by convex relaxation as shown in their Theorem 4.6. Their Theorem 5.2 then shows that this relaxation loses optimality at most by a factor of 4.

On the other hand, we solve our nonconvex optimization problem (P1) by decomposing it into convex subproblems (P2) and reducing (P1) into a one-dimensional optimization problem within a narrow interval, which we solve via grid search with an arbitrary accuracy, as we explain in L143-145.

We will discuss this point in the revision.

technical challenges in the proof of Theorem 2

Change-of-measure has been widely used since Lai (1987). In the K-armed bandit problem (Lai 1987), the answer is mutually exclusive (e.g., arm 1 is best or arm 2 is best), and there is only one correct answer for a given model. Whereas in multiple answers [2], there can be overlapping answers (e.g., best arm is in {1,2}, best arm is in {2,3,4}), and in our case, there can even be infinitely many answers (e.g., best mean is in $(U-2 \epsilon, U)$ with $U \in \mathbb{R}$ ). These differences lead to the technical challenges, as we responded to [Q3] by Reviewer Cj7t.

Here, we summarize our response to [Q3] of Cj7t. In standard change-of-measure, one typically relates the probabilities of events such as $\Pr[(\mbox{best arm is } i)]$ or $\Pr[(\mbox{best arm is in some set})]$ , which are straightforward to define when the set of correct answers is finite. However, it becomes nontrivial to define such events when set of answers is infinite, and we construct appropriate events (around the bottom of p.15) using structural properties of the underlying optimization problem, as developed in Lemma 7 and the first part of the proof of Theorem 2. For example, unlike [2], where the finiteness of the answer set allows the summation $\sum_{i\in i^\star(\mu)}$ (Appendix B.5), we instead introduce the events $\mathcal{E}^+$ and $\mathcal{E}^-$ to decompose the continuum of possible answers into a finite number of sets.

A large number of arms

Our main contribution is primarily theoretical: we establish the first asymptotically optimal algorithm for best mean estimation. Improving its performance in practical settings, particularly those involving a large number of arms, is an interesting direction for future work, as discussed at the end of our conclusion.

That said, there are practical applications where estimating the best mean among $K\sim 10$ arms is relevant, as outlined in the first three paragraphs of our introduction. Moreover, the number of arms that EllipsoidEst can handle depends on the specific instance. We conducted an additional experiment in the setting of [6] and confirmed that EllipsoidEst estimates the best mean with 13,997 samples for $K=30$ , 1,257,487 samples for $K=300$ , and 127,477,071 samples for $K=3000$ . While these numbers are far smaller than $K=300,000$ , where best arm identification followed by additional sampling for best mean estimation would be more effective, EllipsoidEst remains applicable for moderate values of $K\sim 10$ to $K\sim 100$ .

2025-08-07

I thank the authors for their response. I will reconsider my evaluation at the end of the discussion phase to also factor in the other reviews.

审稿意见

评分: 4置信度: 32025-07-03

The authors consider the task of best-mean estimation in a pure exploration multi-armed bandit setting. Specifically, their objective is to identify an estimate of the mean corresponding to the best arm within a threshold $\epsilon$ using as few samples as possible. First they provide an instance-specific lower bound that is asymptotic. Then, they propose an algorithm who sample complexity matches that of the lower bound asymptotically. The authors further complement their results with a numerical experiments.

优缺点分析

Strength.

The authors study a new problem in bandits corresponding to identifying an epsilon close estimate of the best mean which is an interesting questions.
The authors provide instance dependent lower bounds which requires adapting the classical change-of-measure argument from the setting of best-arm identification to their new setting.
The authors also provide an algorithm whose sample complexity matches the lower bound asymptotically, i.e. $\delta \to 0$ .
Even though the paper is of theoretical nature, the authors provide experimental results to complement their theoretical findings.

Weaknesses.

The analysis and design of track-and-stop for this new problem
I think some relevant papers are missing in the literature review. The authors did not compare with the paper of Garivier and Kaufman (see [00]) which I believe already treated this work.
The authors only provide an asymptotic lower bound, i.e. $\delta \to 0$ .
The proposed algorithm combines previously known ideas, notably similar stopping rules were considered for instance in [01].
The optimality of the proposed algorithm is only asymptotic. What about the non-asymptotic regime?

[00] A. Garivier, and E. Kaufamm. Non-Asymptotic Sequential Tests for Overlapping Hypotheses Applied to Near Optimal Arm Identification in Bandit Models, Sequential Analysis

[01] P. Menard, et al. Fast active learning for pure exploration in reinforcement learning, ICML 2021

问题

The notation $\sum_{i}$ is not rigorous. What are the limits in the sum? I think you want to say $\sum_{i \in [K]}$ .
The authors say that their lower bound instance dependent constants involves a non-convex optimization problem which does not arise in classical bandits setting. This is a bit confusing to me because in the classical best arm identification setting the optimization problem involved in the lower bound is also nonconvex (e.g. see [20]). Can the authors clarify this point?
The authors claim that the change-of-measure argument requires substantial changes because with they deal with many answers. I think this is over-claiming the difficulty of the problem. Can the author be more precise why this is a challenging step in their analysis?
The authors rely on the concentration of Lemma 3 to derive their stopping rule which relies on the self-normalized martingale bound given in [1]. Why not use the bound given in [02] which offers a tighter dependence on $\log\log(t)$ rather than $\log(t)$ ? this can yield faster stopping.

[02] E. Kaufmann and W. M. Koolen, Mixture Martingales Revisited with Applications to Sequential Tests and Confidence Intervals, JMLR 2021.

局限性

See weaknesses and questions

最终评判理由

Overall, I find this work on the borderline. I Would like to maintain my score. The rebuttal did clarify few things, but I would encourage the reviewers to make rigorous some of their claims:

Regarding the structure of the set of optimal allocation, the authors argue that this is a challenging part of their algorithm design. In the rebuttal, they gave a numerical example as to why this the case, which to be honest would be better visualised by plot, but I believe a proof would also be much better. The manuscript doesn't really talk about this part which I find a bit strange if it is such a challenge. This point was raised because I was trying to understand the significance and novelty of their algorithm design.
Although minor, the authors should also make sure they use stopping rules with improved threshold. For specific distributions, like Gaussians, this is possible. The authors argued that their focus is sub-gaussian distributions. I find this argument really weak especially since their lower bounds are only for gaussians.
The authors should also clarify that their lower bounds are for the gaussian case, which they did not highlight in the manuscript.
The authors should also tone down their claims about the lower bound in view of the discussion during the rebuttal. I find the result to rely heavily on existing ideas around the change-of-measure for deriving lower bounds not that this diminishes their contributions. For example, non-convexity of the optimisation problem is standard in bandit settings.

格式问题

None noticed

作者回复

2025-07-27

Thank you very much for your detailed and constructive comments. In the following, we respond to your questions and weaknesses, addressing them in this revised order.

[Q1] Notation

Thank you for pointing this out. We agree and will revise the notation to $\sum_{i\in[K]}$ for clarity and rigor.

[Q2] Non-convexity

We agree that, in classical best arm identification, the lower bound can also be formulated as a nonconvex optimization. However, an important distinction lies in the availability of a convex reformulation.

In classical best arm identification, the lower bound (e.g., Proposition 16 in [02] and (1) in [20]) takes the form:

\max_{w \in \Sigma} \inf_{\lambda \in \mathrm{Alt}(\mu)} \sum_{a \in [K]} w_a d(\mu_a, \lambda_a) $$ which is indeed a nonconvex problem if we first solve the inner minimization. However, this can be equivalently rewritten as a linear program with infinitely many constraints:

\max_{x \in \mathbb{R}, w \in \Sigma_K} x \mbox{ s.t. } x - \sum_{a \in [K]} w_a d(\mu_a, \lambda_a) \le 0, \forall \lambda \in \mathrm{Alt}(\mu),

where the objective and the constraints are linear (and hence convex) in $(x,w)$. While the constraint set indexed by $\lambda\in\mathrm{Alt}(\mu)$ is nonconvex, the resulting optimization problem enjoys desirable properties such as convexity (and often uniqueness) of the optimal solution set, and continuity of the optimal allocation with respect to $\mu$. These properties are crucial for algorithms like Track-and-Stop, which rely on the property that the allocation gradually converges to the optimal one. In contrast, our problem does not admit such a convex reformulation, and the lower bound remains inherently nonconvex. This leads to challenges such as possible discontinuities in the optimal allocation and the lack of structure typically exploited in classical settings. This is the key distinction we intended to highlight. **[Q3] change-of-measure argument** In standard change-of-measure arguments, one typically relates the probabilities of events such as $\Pr[(\mbox{best arm is $i$})]$ or $\Pr[(\mbox{best arm is in some set})]$, which are straightforward to define when the set of correct answers is finite. In contrast, in our setting, the number of possible answers (i.e., the estimated best mean) is uncountably infinite, making it nontrivial to define the event whose probability should be bounded. To address this, we construct appropriate events (see around the bottom of p.15) using structural properties of the underlying optimization problem, as developed in Lemma 7 and the first part of the proof of Theorem 2. For example, unlike [2] (Degenne and Koolen, 2019), where the finiteness of the answer set allows the summation $\sum_{i \in i^*(\mu)}$ (Appendix B.5), we instead introduce the events $\mathcal{E}^+$ and $\mathcal{E}^-$ to decompose the continuum of possible answers into a finite number of sets. That said, our main technical contributions lie in the upper bound analysis, and we will tone down the emphasis on this point in the revision if it still looks over-claiming. **[Q4] self-normalizing martingale bound** Thank you for the insightful suggestion. While our current analysis relies on the concentration inequality from [1], alternative techniques, such as the one in the paper you suggested, can indeed yield similar guarantees without regularization. That said, the referenced result does not seem to be directly applicable to general sub-Gaussian distributions. Nonetheless, adopting such techniques for specific exponential families is a promising direction, and we will incorporate this discussion in the revised version. **[W0] analysis and design** We believe this point is intended as a strength of the paper, rather than a weakness. **[W1] Garivier and Kaufman** The referenced paper [00] addresses the $\varepsilon$-best arm identification problem, similar to works such as [3, 4, 8, 10, 11] that we discuss in the paper. In contrast, our work focuses on estimating the best mean with a PAC guarantee, rather than identifying an $\varepsilon$-best arm. In particular, [00] computes the empirical mean of an $\varepsilon$-best arm, but this empirical mean is not guaranteed to satisfy the PAC criterion. Our setting and guarantees are therefore fundamentally different from those in [00]. **[W2] asymptotic lower bound** We agree that our lower bound is established in the asymptotic regime as $\delta\to 0$. Deriving tighter or non-asymptotic lower bounds is an interesting direction for future work. That said, we note that even in the "finite answer" setting, the lower bound in [2] (Degenne and Koolen, 2019) is also formulated for the asymptotic regime, highlighting the inherent difficulty of obtaining sharp finite-$\delta$ lower bounds in these problems. In particular, for the problems with multiple correct answers, we cannot directly use the high-level form [00] due to unavailability on an event that holds with probability $>1-\delta$ in one model and with probability $<\delta$ in alternative models. **[W3] similar stopping rules** The construction of our confidence region is fundamentally different from that in [01]. Our key contribution regarding the stopping rule is the use of a confidence ellipsoid framework, which has been previously employed in linear bandits but not in the setting considered by [01]. In contrast, [01] uses independent confidence intervals, which is a natural and standard approach when rewards from different arms are independent, especially in the context of best arm identification. If there are specific aspects of similarity you are concerned about, we would appreciate further clarification to ensure we address any overlooked connections. **[W4] non-asymptotic optimality** We acknowledge that the optimality of the proposed algorithm is currently established only in the asymptotic regime as $\delta\to 0$. Extending these results to non-asymptotic regimes is an important and challenging direction for future work, as also noted in our response to [W2].

评论- Further Questions to the Authors

2025-08-05

I thank the authors for their answers. I do need further clarifications. Please find below my questions and comments.

Regarding non-convexity of the optimization problem leading to the lower bound. The classical idea is to actually partition the continuum set of alternative parameters into a finite set of subsets where indeed the most inner problem is convex, as for instance done in [20].

 \sup_{\omega \in \Sigma} \min_{\lambda \in \mathrm{Alt}(\mu)}\sum_{i=1}^K \omega_i \mathrm{kl}(\mu_i, \lambda_i) =  \sup_{\omega \in \Sigma} \min_{j \neq i_\star(\mu)}\inf_{\lambda: \lambda_j > \lambda_{i_\star(\mu)}}  \sum_{i=1}^K \omega_i \mathrm{kl}(\mu_i, \lambda_i)

You mention: "... This leads to challenges such as possible discontinuities in the optimal allocation and the lack of structure typically exploited in classical settings ... "

Indeed, in general for structured problems, the optimal allocation is not necessarily unique as is the case of Linear bandits, but ideas like C-tracking can still work as observed in [W3]. Is the set of optimal allocations in your problem not unique? what properties does this set have?
As for the continuity of the set of optimal allocation with respect to $\mu$ , one typically uses the notion of hemi-continuity as done in [20]. Could you clarify what are the continuity properties of the set of optimal allocations?

[W3] Y. Jedra, A. Proutiere, Optimal best-arm identification in Linear Bandits, NeurIPS 2020.

Regarding the weakness concerning analysis and design. Track-and-stop as a design principle is well studied and established. In other words, I am wondering about what novelty you provide in terms of design principles. You said that your main technical novelty is in the upper bound analysis. It would be useful to clarify what are the key challenges that differentiate your track-and-stop algorithm either from a design perspective or from analysis perspective. I would appreciate if you could point me to specific part of your proofs that you think are novel.

Regarding the stopping rule. Note that confidence interval with $\log(\log(t))$ dependence yields much faster stopping in practice. If one assumes that the distribution is gaussian, why is it that your setting does not allow using such concentration bounds?

My concern about the stopping rule when I mentioned the work [W2] has to do with the novelty of your stopping rule. [W2] builds on using upper and lower confidence bounds. Your stopping rule seem to use a similar design mechanism. In best arm identification, and generally speaking in hypothesis testing one often relies on a generalized likelihood ratio test for tighter bounds. Can you elaborate on how your stopping rule is related to such tests?

Minor comments:

In Line 4.2.9, I believe it should be Proof of Theorem 2 in the title of the subsection.

2025-08-06

Regarding non-convexity.

In [W3], the optimal allocation may not be unique, but the set of optimal allocations is convex. This implies that, if we are already near some optimal allocation and we obtained another optimal allocation which may be far away from the current one, then any allocation on the line between them is also optimal thanks to the convexity. As a result, tracking any optimal allocation can guarantee the convergence to some optimal allocation even if the targeted (non-unique) optimal allocation oscillates over rounds. In contract, in our problem, the optimal allocation set may be nonconvex or disconnected.

Regarding the weakness concerning analysis and design.

As a result of this nonconvexity or discontinuity in our problem, a mixture of optimal allocations may no longer be optimal, which is the reason that Track-and-stop is not appropriate for our problem. This difficulty also applies to the case of [2] (we guess that [20] you wrote means [2]), and we used the continuity property very similar to hemi-continuity (corresponding to our Lemma 9). Still, since [2] heavily relies on the fact that the number of answers is finite (and needs to compute the optimal allocation at every round), we need additional algorithmic advancement to address continuously many answers in an efficient way. Please also see our reply to [W9] of Reviewer nDLd on this point.

The main novelty thus lies in the design of our EllipsoidEst algorithm. In particular,

A nonconvex optimization problem is reduced to a tractable one-dimensional optimization over a narrow interval.
The algorithm's two-phase structure allows the one-dimensional optimization to be solved only once, at the beginning of Phase 2.

The details of these design choices are carefully constructed to enable the analysis of upper bounds. While each step of the analysis may not be particularly novel individually, we believe the overall argument is far from trivial and requires substantial effort to formally establish the upper bound.

The most difficult part of the analysis on the algorithmic behavior would be Lemma 12, aside from the analysis related to the nonconvex optimization. In our algorithm, we need to evaluate the behavior of the algorithm that just pulls the arm with the maximum UCB in most cases, and the estimated optimal allocation is enforced only through imposing the minimum pulls of the estimated best arm in Phase 2 in Line 5 of Algorithm 1. While this contributes to computational efficiency, it makes the behavior of the confidence ellipsoid and the number of pulls of arms very complicated. In the proof of Lemma 12, we analyze the properties they satisfy for any round by considering optimization problems (only for the analysis) in (100) and (105), through which the algorithm is guaranteed to stop with the upper bound $U$ of the best mean near the optimal one.

Regarding the stopping rule.

We chose the confidence ellipsoid for the sub-Gaussian rewards to maximize the generality. As pointed out, our stopping rule is highly related to the use of the generalized likelihood ratio test used in the work such as Degenne and Koolen (2019), and the form of the confidence ellipsoid becomes almost the same as the one based on GLR test for the Gaussian model except the scale (such as the dependence on $t$ ) and the existence of the regularization.

Still, since our discussion related to the nonconvex optimization is very complicated even for Gaussian divergence (i.e. the squared gap between the parameters), it is highly nontrivial to obtain results for general models such as exponential families. Given this, we believe that it is reasonable to consider the setting where the confidence region is expressed as an ellipsoid rather than the general form by the GLR test as a first step of our BME problem. Among these settings, sub-Gaussian distributions with the confidence ellipsoid by [1] would be one of most standard nonparametric models and we believe that it has its own importance to consider this model rather than, e.g., general exponential families even though, as pointed out, the confidence region becomes a bit larger compared with the ones tailored for each family.

Minor comments.

Yes, this subsection is on the proof of Theorem 2. Thank you very much.

2025-08-07

Thank your for your additional clarifications. I think I will keep my score for now which I think is already high enough and reassess my evaluations once the discussion phase ends and factor in also the other reviews. I would like to still point out few comments about your rebuttal.

Please note that you haven't provided me yet with a rigorous answer about why the set of optimal allocation may be non unique or disconnected. Saying that this may be the case is not sufficient. I have also not found a rigorous proof about this in your paper (correct me if I am wrong). You need to add at the very least a construction of an example where the optimal allocation is not unique and disconnected.
About the stopping rule, arguing that you wanted to focus on the sub-gaussian case is not very convincing. Actually, your lower bounds are only derived for gaussian distributions and by the way, this should be made precise in your manuscript. If your focus is only the general sub-gaussian case, then why are your lower bound only for gaussian. This kind of inconsistency makes your argument very weak. Note that, at the same time, the reference [02] I mentioned also provides improved concentration bounds for the gaussian case (see their Theorem 9). Being at the very least able to provide very strong results for the Gaussian case would be a valuable strength for your work. Therefore, I think this remains a weakness, namely that your stopping rule does not exhibit a $log log(t)$ threshold given that the state-of-the-art now in other bandit problems already have that, especially when this actually matters a lot in practice.

2025-08-09

Thank you for your further comments.

Please note that you haven't provided me yet with a rigorous answer about why the set of optimal allocation may be non unique or disconnected. Saying that this may be the case is not sufficient. I have also not found a rigorous proof about this in your paper (correct me if I am wrong). You need to add at the very least a construction of an example where the optimal allocation is not unique and disconnected.

There are numerical examples where the solution of (P1) has multiple disconnected local minima. For example, it can be found for $\epsilon=0.1$ , $K=512$ , and $\mu_1=1$ and $\mu_i\approx 0.971497$ for $i\neq 1$ . In this case, the objective $f(U)$ has two local minima at $U\approx 1.16324$ and $U\approx 1.1746$ . For $\mu_i$ around the above value, the optimal solution discontinuously changes between these $U$ 's. (Please also see the table below)

U	Optimal value of (P2)
1.16	14419.9
1.16324	13937.7
1.16882	14055.8
1.1746	13937.6
1.17998	14251.5

This phenomenon simply comes from the fact that the feasible region is nonconvex. The Hessian of the left-hand side (LHS) of (2) has eigenvalues $r_i - \sqrt{r_i^2+4(U-\mu_i)^2}\le 0$ and $r_i+\sqrt{r_i^2+4(U-\mu_i)^2}\ge 0$ with respect to $(r_i,U)$ . This means that the LHS of (2) is not negative-semidefinite and thus the feasible region is nonconvex. In general, convexity rarely appears for this kind of cubic terms like $r_i(U-\mu_i)^2$ , while this term was linear (and thus convex) when only $\{r_i\}$ were the variables as in the pure exploration with a single correct answer.

In any case, since we have already constructed a more efficient algorithm that computes the optimal allocation only once, we believe that placing too much emphasis on the absence of a formal proof of non-applicability of the standard approach would not be constructive.

About the stopping rule, arguing that you wanted to focus on the sub-gaussian case is not very convincing. Actually, your lower bounds are only derived for gaussian distributions and by the way, this should be made precise in your manuscript.

Indeed the declaration was missing, and we will also write it in a general form by replacing the L2 distance with the corresponding KL divergence.

If your focus is only the general sub-gaussian case, then why are your lower bound only for gaussian.

Even if we construct an information-theoretic lower bound for sub-gaussian models, an algorithm that achieves such a bound is very non-trivial from the viewpoint of both optimization and statistics. The class of sub-gaussian belongs to the nonparametric models with centered moment conditions, which is known to be very difficult even for simple regret minimization [Baudry+2023]. On the other hand, it is very standard to pursue the gaussian lower bound for sub-gaussian distributions such as [1, 8].

[Baudry+2023] Baudry et al., "A General Recipe for the Analysis of Randomized Multi-Armed Bandit Algorithms", arxiv, 2023. https://arxiv.org/pdf/2303.06058

Note that, at the same time, the reference [02] I mentioned also provides improved concentration bounds for the gaussian case (see their Theorem 9). Being at the very least able to provide very strong results for the Gaussian case would be a valuable strength for your work. Therefore, I think this remains a weakness, namely that your stopping rule does not exhibit a threshold given that the state-of-the-art now in other bandit problems already have that, especially when this actually matters a lot in practice.

We agree that [02] is very useful for exponential family (thank you very much for suggesting this), and we will definitely add discussions based on the experiments at least for Gaussian case with the stopping rule from [02]. Still, while tighter stopping indeed matters in practice, the tighter bound in [02] is at the cost of the restriction to the reward exactly following exponential families, which does not seem to be always the case in practice. Of course the stopping rule [02] for distributions slightly different from gaussian would work well in practice, but its informality seems to be not so different from the use of the bound for sub-gaussian distributions with heuristically narrowed confidence region, which is also a typical heuristics like [5].

最终决定Accept (poster)

2025-09-17

The paper addresses the problem of estimating the mean reward of the best arm in a multi-armed bandit (MAB) setting, distinct from simply identifying the best arm. The authors establish an instance-dependent lower bound on the sample complexity for this problem. The author argues that straightforward application of standard MAB algorithms like Track-and-Stop cannot achieve sample optimality. They introduce a new algorithm that combines a confidence ellipsoid-based stopping condition with a two-phase sampling strategy. This design is specifically tailored to manage the challenges of the non-convexity. The proposed algorithm is simple, nearly free of hyperparameters, and achieves the instance-dependent, asymptotically optimal sample complexity.

All reviewers acknowledged that studying the estimation of the best mean (as opposed to its identification) is interesting problem in the pure exploration bandits literature. The paper's main strength is its rigorous theoretical contribution, including the derivation of a non-trivial lower bound and an algorithm proven to match it asymptotically. The handling of a problem with an infinite set of possible correct answers (any value within $\epsilon$ of the true best mean) was seen as significant.

The authors provided thorough and convincing rebuttals, addressing the technical concerns (like the proof error), clarifying ambiguities, and defending their contributions.