Instance-Optimal Pure Exploration for Linear Bandits on Continuous Arms
We propose a novel algorithm for a pure exploration problem with linear bandit feedback on a continuous arm set with an instance-dependent optimality.
摘要
评审与讨论
This paper studies -BAI for Bayesian linear bandits with Gaussian noise and Gaussian prior on compact and continuous arm sets. The metric of performance that is being minimized is the posterior probability of identifying an -optimal arm conditioned on the history of the observations. In particular, they consider instance-dependent asymptotic rate of convergence of this conditional posterior probability. This setting is challenging due to the infinite dimensionality of the space of policy (i.e., continuous distribution on the set of arms) and the non-smoothness of the objective. To alleviate these challenges, the authors re-parametrize the optimization problem with respect to the design matrix being induced by the sampling policy. As a consequence, they need to extract a policy from the optimized design matrix, which is done with projection and reconstruction following along the lines of the approximate Caratheodory problem. This requires solving quadratic and fractional quadratic optimization problem on the arm set. The authors provide an upper and lower bound on the conditional posterior probability of error. Moreover, they numerically compare their algorithm with other benchmarks on synthetic instances.
## update after rebuttal Including the discussions will improve the paper in its revised version, hence I raised my score towards weak accept.
给作者的问题
Several questions have been asked in the previous sections.
论据与证据
I. Tractable algorithm.
While PCMA aims at solving more tractable optimization problems, the discussion on the tractability of PCMA is not fully convincing based on Sections 5.2 and 5.3. To the best of my understanding, PCMA has both a large computational and space cost. In particular, it would be valuable to have a more detailed discussion on the
Quadratic Objective. What is the per-call computational and space cost of the quadratic objective used in Algorithm 2 ? More importantly, the authors mention that “some problems can be solved in a polynomial time, and some problems are NP-hard”. Is the quadratic objective solvable in polynomial time for the problems considered in this paper, or is it NP-hard ? Even if the computational cost is polynomial, this oracle is being called a superlinear number of times at each iteration, i.e., with . Therefore, it seems to be a major computational bottleneck.
Quadratic Fractional Objective. What is the per-iteration computational and space cost of the quadratic fractional objective (being called at each time step) ? What is the convergence rate, and how does it impact the choice of the “break” condition for this optimization procedure ? In particular, Appendix G.1 seems to suggest that at most 30 iterations are used, and refer to a quantity not defined in Section 5.3. Algorithm 2. The memory requirement seems also to be superlinear with time. Is it possible to maintain a sparser approximation or the policy ? If the discussion is not possible for general set of arms, then it would be enlightening to illustrate the discussion in some special cases such as (1) the unit sphere or (2) a spherical cap as considered in Section 7.
Based on the small scale of the experiments and the lack of numerical computational costs, I am wondering whether PCMA can be truly considered as a tractable and practical algorithm.
II. Instance-Dependent Optimality.
While the authors claim instance-dependent optimality, their actual statement doesn’t reflect what is commonly referred to as “Instance-Dependent Optimality”. Therefore, it would be better to be more precise and discuss the difference with what is usually referred to as asymptotic optimality.
Theorem 6.2 is a statement on the conditional posterior probability of misidentification, hence it doesn’t account for the randomness of the observations themselves.
Theorem 6.2 assumes knowledge of the unknown characteristic time in order to match the optimal rate. This is not a practical assumption, hence the theorem doesn’t hold for practical implementation
III. Recommendation Rule.
The authors state their result for a given stream of recommendation rule. However, to the best of my understanding, they have only theoretical guarantees (Section 6.3) and experiments (Section 7) for the greedy recommendation rule. The paper would gain in clarity by being specialized for the greedy recommendation rule. The greedy recommendation rule is by far the least computational expensive method, even though it is not asymptotically optimal on all instances.
Section 4.3 glimpses at a more sophisticated notion of recommendation rule, i.e., the (instantaneous) furthest answer in Jourdan & Degenne (2022). However, this choice is far from being tractable. Even with the approximated algorithms from PCMA, it would require an additional outer optimization loop for a non-convex optimization problem. On top of the intractability of this procedure, it will probably be challenging to show that Equation (7) holds for this choice.
IV. Stopping Rule.
To the best of my understanding, based on paragraph “Stopping Rule” in Section 3 and Section 4.2, the authors do not provide a clear and convincing discussion as regards the stopping rule. In particular, the authors should explicitly define what is the stopping rule that they are referring to, compare it to the literature on stopping rule, and explain whether this stopping rule has some theoretical guarantees.
In particular, while the authors only control the posterior probability conditioned on , proving -PAC also require controlling the randomness of the observations themselves, i.e., without the conditioning on . Therefore, comparing their upper bound to will not be enough to obtain an -PAC stopping rule, both theoretically and empirically.
The upper bound proposed in Lemma 4.2 does not seem to be truly non-asymptotic. It is based on a discretization of the space (Lemma D.1) and the different terms do not seem to be explicit.
方法与评估标准
See “Experimental Designs Or Analyses” section for details on the empirical evaluation.
理论论述
Lemma 4.1 seems unusual for the literature due to its dependency in on both sides of the equations. In particular, while both the sampling and the recommendation rules are not explicitly given and satisfy some general conditions, the upper and lower bounds are matching. This seems too good to be true. Could the authors provided more discussion ?
Lower bound from (2). While it could be independent of the sampling policy as in Theorem 2.1 in Li et al (2024) and Theorem 1 in Russo (2016), the analysis should most likely require more assumptions as regards the recommendation rule at each time. For example, a minimum requirement should be that the recommendation is -optimal for , which is satisfied by the greedy recommendation rule. It seems like this property is used in the equation 15 in Appendix F.2. Some statements should also be detailed more in the proof as they don’t seem to be trivial: (Line 756-757) “convergence of t in LHS of (18) is uniform [...] we can exchange lim and sup”.
Upper bound from (3). This seems to be very surprising that it holds for general sampling policy without assumptions on its asymptotic behavior. Intuitively, one should at least require convergence of the sampling policy. While this assumption is not made explicit in Lemma 4.1, it seems used in Appendix F.2: (Line 715) “we assume converges,”.
实验设计与分析
The empirical metric of performance used in the experiments appears to be unusual, and lack details on its practical implementation. Instead of considering the empirical proportion of error as a function of time, the authors consider the conditional posterior probability of misidentification. Since it is hard to compute, they consider as proxy the upper bound from Lemma 4.2, yet dropping the terms and . Due to the optimization over the continuous set of arms, it is not clear how is actually computed numerically. Are the authors using a discretization of the space, e.g., the one from Lemma D.1 ? Even for a discrete set, it is not clear how each term is numerically computed. Is it based on the Gaussian cumulative distribution discussed at the end of Section 4.2.
While promising, the experimental results provided in Section 7 and Appendix G seem to be preliminary. A more extensive empirical evaluation would be appreciated. In particular, the empirical claims would be strengthened by having:
- More than one hundred rounds for each run, and more than ten runs for each instance.
- Plots of the empirical proportion of error as a function of time.
- Plots of the actual computational cost (in terms of CPU time) for the proposed algorithms to highlight that it is tractable.
- Bayesian experiments in order to actually match the Bayesian setting considered in this paper.
- Number of samples needed for the approximated upper bound (from Lemma 4.2) on the conditional posterior probability of error to be lower than .
- Ablation studies to better understand the impact of the hyper-parameter of the algorithm. In particular, the experiments use , yet the theory require to depend on the unknown characteristic time (Theorem 6.2 and Propostion 6.3).
补充材料
Appendices A,B,C,D,E, F.1-3, F.2, F.3 and G in details. I didn’t check all the details in Appendices F.4-7.
与现有文献的关系
The authors wrote a wrong statement on the literature: “Jedra & Proutiere (2020) [...] showed that an optimal sampling policy is the round-robin manner using the orthonormal basis.” Jedra & Proutiere (2020) show that round-robin sampling yields an algorithm whose sample complexity is order-wise optimal. Order-wise means that it has the same scaling as the optimal characteristic time, however the lower and upper bound are not matching asymptotically. Therefore, round-robin is not an optimal sampling policy, only one that is “not too bad”.
遗漏的重要参考文献
To the best of my knowledge, most relevant literature is discussed. Nevertheless, more details could be added in order to compare the obtained results for continuous set of arms with the ones known for discrete set of arms.
其他优缺点
I. Lack of clarity.
In general, the paper would gain in clarity by considering a specific recommendation rule instead of studying a general formula under some assumptions. This obfuscates the discussion on the main challenges of the continuous set of arms. It would be valuable to clearly state what are the theoretical challenges in proving Theorem 6.2. Is a direct extension of the analysis for discrete sets ? Moreover, the paper would benefit from more examples to illustrate the bounds and the algorithms, e.g. special case of the unit sphere. Currently, it is difficult to parse everything in all its generality.
II. Confusing notation for the conditional posterior probability of misidentification.
To the best of my understanding, the only randomness in the conditional posterior probability of misidentification is with respect to since is given. The dependency in is not made explicit and seems hidden in the notation which should depend on and not on fixed at initialization.
其他意见或建议
Equation (16) in Appendix F.2 lacks a square on in the first term of the right hand side.
We sincerely appreciate the reviewer's thorough and insightful review and apologize for any statements that may have caused confusion.
Tractability of the algorithm
First, we note that our method is tractable in terms of the number of oracle calls. We do not specify algorithms (optimization oracles) for the quadratic and fractional quadratic objectives.
Computation Oracles
Please refer to the response to Reviewer 4oes (we note that is defined in Line 317 (right)).
Space complexity
In every iteration of Alg. 1, we approximate our sampling policy using Alg. 2. The choice of is motivated by the convergence rate of the Frank-Wolfe method (Lemma F.4). Under certain assumptions, the Frank-Wolfe method exhibits a faster convergence rate [Garber & Hazan, 2015], and the cardinality of the support of can be significantly reduced. However, this typically requires the set to be strongly convex, and we are unaware of any arm-sets that satisfy this assumption.
Instance-Dependent Optimality
We agree that our method is asymptotically optimal. However, as we discussed in the introduction, our analysis is instance-dependent unlike existing methods in the continuous arm setting (such as MVR). Theorem 6.2 assumes knowledge of the unknown characteristic time, however, as we remarked after Prop 6.2, we provided a convergence analysis in the case when in Prop D.2.
Recommendation Rule
We agree that our main result, Thm. 6.2, relies on assumption (7) regarding the recommendation rule, and we provide sufficient conditions for (7) only for the greedy recommendation rule. To improve clarity, in the introduction of the revised version of this paper, we will discuss this.
PAC with respect to the unconditional probability
As clarified in the paper, we consider the -BAI problem under the Bayesian reward model setting. Proving that the method is -PAC with respect to the unconditional probability is not a primary focus of this study. We believe there are several ways to theoretically validate a pure exploration algorithm. For instance, under a Bayesian setting, Russo 2016 also provided a similar analysis (asymptotic bound of the posterior error probability) to this study, but it is considered to be an important literature in this field.
Moreover, the core ideas of PMCA can be applied to existing algorithms (such as LεBAI (Jourdan & Degenne, 2022)), since they do not clarify the optimization method regarding the sampling distribution, and it would be possible to prove the extended algorithm is PAC.
Stopping Rule
For any explicitly computable upper bound of the conditional probability of misidentification and , we refer to the stopping rule by the stopping time defined as the minimum satisfying . Assuming , we make the statement of Lemma D.1 more explicit. Using notation of 2 of Lemma D.1, by the tail probability of chi squared distribution, we can take and is given as . Then, gives an explicit upper bound. As we discussed in Line 240, can be solved using the fractional quadratic objective.
Lemma 4.1
Both sides of Lemma 4.1 are defined asymptotically, therefore they do not depend on . In addition, unlike (Russo 2016), both sides depend on asymptotic behavior of general recommendation rule and (the mean of) general sampling rule rather than the optimal one (our analysis can also be regarded as a generalization of Eq. (1) of Glynn & Juneja (2004)). In Eq. (15) in Appendix, we use the standard fact on the normal distribution that unconditionally holds for the signature of . Moreover, in the proof (around Line 715), we assume a sub-sequence of converges, and we do not assume converges in the statement of Lemma 4.1. We hope this resolves your concern and if you have further questions, we would be happy to answer them.
Experiments
We appreciate your suggestions. We refer to the response to Reviewer huXi for CPU time of each method. We have conducted an ablation study of the parameter for the second () problem instance in Fig 2, where MVR is sub-optimal. The table shows the mean (standard deviation) of our evaluation metric. We found that while the original choice () outperforms the baselines, the optimal choice of is around .
| 0.0 | 1e-2 | 1e-1 | |
|---|---|---|---|
| -63.6 (1.4) | -64.1 (3.7) | -69.4 (2.3) |
References
- Garber & Hazan, Faster rates for the Frank-Wolfe method over strongly-convex sets, ICML 2015
I thank the authors for their thorough and detailed answers, as well as the additional experiments. At the time being, I am inclined to increase my score to weak accept.
Instance-Dependent Optimality. Given that Theorem 6.2 is obtained by using Proposition 6.3, could the authors discuss the challenges encountered when using Proposition D.2 to generalize Thm 6.2 when ? Do the authors believe that a finer analysis could show the convergence in Proposition D.2 for all sequence, i.e., replacing liminf by , or is it truly a limitation when taking ?
Miscellaneous. Could the authors explicitly state how is computed in their experiments ?
We sincerely appreciate the reviewer's further feedback.
Analysis for Proving Theorem 6.2 Under the Condition
Thank you for bringing this problem up. Based on your suggestion, we reconsidered this problem, and we believe a simple modification of our analysis can prove Theorem 6.2 even if . If we can confirm this rigorously, we will update our manuscript in the revision. Specifically, the revised statement of Proposition 6.3 would be as follows:
Proposition 6.3 (revised version) Let be the sampling rule of PMCA (with ), then we have .
We introduced the regularizer to ensure the sequence converges. However, as we discussed, this condition is not necessary. In the proof of Theorem 6.2, the condition is not necessary until Line 1159. We can prove the above proposition by the inequality at Line 1157. Then, Lemma 4.1 implies the main result (Theorem 6.2).
Proposition D.2
Do the authors believe that a finer analysis could show the convergence in Proposition D.2 for all sequence.
No, we believe that the condition for the step sizes is too general to derive such a result. We provide more details in the following and we explain the difficulty in deriving Theorem 6.2 from Proposition D.2. In Proposition D.2, we consider a general condition for the step sizes (i.e. and as in [Ruszczynski, A. P., 2006, Theorem 7.2]). This leads to a weak statement on the convergence result, that is, we have . We refer to this equality as Eq. (a). To prove our main result (Theorem 6.2), at least we need the following condition . We refer to this equality as Eq. (b). Equations (a) and (b) are slightly different because of and , where we recall that . If is a convergence sequence, Eq. (a) implies Eq. (b), but in general, we suppose the conclusion of Proposition D.2 is too weak to prove Theorem 6.2. The value of Proposition D.2, apart from the condition on , is that we have a convergence result under a general condition on the step sizes. We will clarify this in the appendix of the revised paper.
Miscellaneous
To compute in the experiments, we used the formula introduced at Line 234, which follows from Eq (15) and the monotonicity of the cumulative distribution function . Assume that holds for any (the greedy recommendation rule satisfies this condition). Then, by , we can compute by calling the oracle for the quadratic fractional objective. We will clarify this in the revision.
Thank you again for your valuable feedback to improve our manuscript.
This paper investigates a pure exploration problem with linear bandit feedback on continuous arm sets, aiming to identify an -optimal arm with high probability. Previous approaches for continuous arm sets have employed instance-independent methods, due to technical challenges such as the infinite dimensionality of the space of probability measures, and the non-smoothness of the objective function. This paper designs a novel and tractable algorithm that addresses these challenges by using a reparametrization of the sampling distribution and projected subgradient descent. However, this approach introduces new challenges related to the projection and reconstruction of the distribution from the reparametrization. This paper addresses these by focusing on the connection to the approximate Caratheodory problem. Compared to the original optimization problem on the infinite-dimensional space, the proposed method is tractable, requiring only the solution of quadratic and fractional quadratic problems on the arm set. This paper also provides empirical results.
给作者的问题
Please see the weaknesses above.
论据与证据
The claims made in this paper are supported by clear and convincing evidence.
方法与评估标准
The proposed methods and evaluation criteria make sense.
理论论述
The theoretical results look reasonable, but I didn’t go through every proof.
实验设计与分析
The experiments look reasonable.
补充材料
I didn’t read the supplementary material.
与现有文献的关系
This paper is relevant to the literature.
遗漏的重要参考文献
None
其他优缺点
Strengths:
- This paper proposes a tractable algorithm PMCA for Posterior error Minimization for a general Continuous Arm set based on the PSGD (projected subgradient descent method).
- This paper provides upper and lower bounds to show that the proposed algorithm is asymptotically optimal.
- The theoretical contribution of this paper is solid and interesting to the linear bandit community.
- Empirical results are provided to demonstrate the effectiveness of the proposed algorithm.
Weaknesses:
- The readability of this paper should be improved. This paper is dense and hard to follow.
- This paper provides asymptotically optimal results. It seems that the practical significance of asymptotically optimal optimality is not strong, because the number of times we sample an option is finite and we care more about the optimality of the final output option in real-world applications. Can the asymptotic results in this paper provide any insight for the finite-time results/optimality?
- More discussion on the challenges of generalization from the finite-arm set to the continuous arm set is needed.
- Can you give more concrete examples for the computational oracles used in the proposed algorithm? For example, in what subproblems such computational oracles exist, and what are the corresponding concrete computational oracles (algorithms)?
其他意见或建议
Please see the weaknesses above.
We sincerely appreciate the reviewer's thorough and insightful review. We will revise our manuscript based on your suggestions.
Readability
We appreciate reviewer's suggestion. In a revision, we will improve the readability of the paper (e.g. by making detailed statements in Prop 6.5 concise and refer to the Appendix for a more precise statement).
Asymptotic optimality
Lemma 4.2 provides a non-asymptotic upper bound for the posterior probability of misidentification that asymptotically matches the actual conditional probability. However, a more refined analysis guaranteeing non-asymptotic optimality remains an important avenue for future research. We will discuss this limitation in the Conclusion section of the revised paper.
More discussion on the challenges of generalization from the finite-armed setting
While it would be difficult to prove that generalizing from the finite-armed setting is impossible for any existing algorithm, the non-smooth optimization objective over the space of probability measures is a fundamental problem that any asymptotically optimal algorithm must address. To the best of our knowledge, even in the finite-armed setting, the efficient solution of this problem has not been thoroughly discussed. We can also view our contribution as an efficient algorithm for the finite-armed setting, particularly when dealing with a large number of arms.
Computational Oracles
Quadratic Objective
For example if the arm set is defined as an interval constraint with a quadratic function (this includes the case of the unit sphere), the quadratic objective can be solved in polynomial time by the Lagrange relaxation (Park&Boyd, 2017). The relaxed problem is a semi-definite problem (convex problem). If we use an interior point method (barrier method), to obtain an -optimal solution, the outer loop needs iterations and the convergence rate of the inner loop is linear convergence. Per iteration computational complexity is with space complexity (Boyd&Vandenberghe, 2004, Chapter 11.8.3).
In general, the problem can be NP-hard. We note that MVR (Vakili et al., 2021) has the same problem and similar to the MVR implementation (PosteriorStandardDeviation) in the BoTorch library, we use a non-linear solver (such as L-BFGS-B) with a randomly selected initial point in our experiment.
Quadratic Fractional Objective
Let be the objective we want to maximize over . As we briefly discussed in Sec. 5.2, if we use the Dinkelbach's algorithm, we call the quadratic objective oracle once in each iteration of the Dinkelbach's algorithm. More precisely, if we define and , then converges to the optimal value and its convergence rate is superlinear (the error goes to zero faster than any geometric sequence) (Schaible, 1976). Then one can obtain a solution of the original problem by . Due to the superlinear convergence rate, we use a limited number of iterations (i.e., 30) in our experiments.
References
- Boyd & Vandenberghe, Convex Optimization, Cambridge University Press, 2004
This paper investigates the problem of best arm identification (BAI) for Bayesian linear bandits, where the action set is assumed to be continuous. While existing investigations, e.g., [Jedra et. al.] establish optimal algorithms when the set of arms is finite, BAI for linear bandits under infinite actions is hitherto unexplored. This paper takes a step towards solving this problem, specifically, making the following key contributions:
- It introduces an instance-dependent measure (which I will call problem complexity in the rest of the review), a counterpart of its finite-armed setting, and shows that the asymptotic posterior probability of error for any algorithm decays exponentially at a rate proportional to the problem complexity, and no larger (converse result).
- It devises an algorithm which achieves this asymptotic error rate
There are some novel observations, which I would also like to highlight.
- The paper introduces a reparameterization to solve the infinite-dimensional problem (in the probability space) to an equivalent problem in the matrix space, (and hence finite dimensional)
- It introduces a projected gradient descent-based algorithm for BAI (which has been used in prior works), and the projection step is tackled using a novel connection with the approximate Caratheodary theorem.
Overall, the paper takes a step towards solving BAI in the linear bandit setting with continuous arm sets.
给作者的问题
I have the following questions for the authors:
- the authors mention that "..the posterior probability of misidentification is known to the learner...". How does the learner know ?
- What is the computational complexity of the reparameterization , since it involves an expectation operator?
- It seems that the expression of the sub-gradient in (6) is its gradient. How do the authors deal with the non-differentiability at points due to the inner supremum, say, in (4)?
论据与证据
Yes, I think that the claims and evidences are coherent and sufficient.
方法与评估标准
Yes, the experiment settings make sense.
理论论述
I did not have time to check the correctness of the proofs. I would be happy to take a look at any specific part, if any issue is raised.
实验设计与分析
This is a theoretical paper; having said that, the experimental settings suffice to bolster theoretical claims.
补充材料
I did not review the supplemental.
与现有文献的关系
- BAI is an important problem in the bandit literature, with various practical applications including A/B testing, clinical trials, recommender systems, etc. Investigating the continuous-armed setting enhances our theoretical understanding of BAI.
- Prior works establish optimal (fixed-confidence and fixed-budget) algorithms for linear bandits. Examples include [Jedra et. al.], [Vakili et. al.]. Most of these investigations consider a finite set of arms.
- This paper positions itself as the one which extends BAI to continuous arm sets, which has not been sufficiently investigated.
遗漏的重要参考文献
N/A
其他优缺点
I have already listed the strengths of this paper. A weakness of the paper is that the writing can be improved. Here are instances of typographical / grammatical inconsistencies:
- Line 95: It should be
- Lines 161-162: that estimates an -optimal arm ...
- Lines 220-222: the term is given as (an upper bound) the ...
- Lines 272-274: similarly should be Similarly
其他意见或建议
Answered above.
We sincerely appreciate the reviewer's thorough and insightful review. We will revise our manuscript based on your suggestions.
Posterior probability of misidentification
As we briefly discussed in Line 133 (right), the posterior probability is known to the learner, i.e., it is a -measurable variable because it is defined using the conditional probability . More concretely, we can rewrite . We note that conditioned on , the reward function is identified with . In addition, the distribution of is known due to the Bayesian setting and we provide it explicitly in Line 152 (left). Therefore, we can essentially compute the posterior probability of misidentification. We hope this resolves your concern. We will clarify the statements in Line 133 in the revision of this paper.
Computational Complexity
Since in each round, we add points to the support of the sampling distribution, the computational complexity of the reparametrization is given as , which is polynomial in . We appreciate your suggestion and will add the discussion on this in Section 5.3.
Subgradient
defined in Eq. (6) is a subgradient of the function defined in Eq. (4). We used a known fact that if the function is defined as the supremum of convex functions , then a subgradient of is given as a gradient of , where attains the supremum (Hiriart-Urruty & Lemarechal 2004, Chapter D, Lemma 4.4.1). We appreciate the reviewer's suggestion and will briefly explain this in the main paper.
This work studies the problem of pure exploration for linear bandits particularly with continuous arms. The paper begins by establishing a lower bound on the asymptotic posterior probability of misidentification, and then proposes a tractable algorithm, called PMCA, for minimizing the error probability. The authors provide theoretical analysis showing that the suggested algorithm finds the optimal sampling rule that achieves the asymptotic optimality for any given recommendation rule. The numerical experiments show that the algorithm outperforms MVR and the uniform sampling strategy, confirming the theoretical finding.
给作者的问题
None.
论据与证据
I agree that the suggested algorithm PMCA significantly reduces the computational burden required to optimize the sampling distribution.
However, the term “instance-dependent optimality” sounds considerably misleading. Algorithm 1 and Theorem 6.2 treat the recommendation rule as something given exogenously. Theorem 6.2 states that the sampling rule of PMCA achieves the asymptotic optimality for any given recommendation rule satisfying some regularity condition, without arguing the optimality of the recommendation rule.
I believe that, conventionally, a pure exploration algorithm corresponds to a combination of sampling rule and recommendation rule . It does not make sense to me that an algorithm can be said to be instance-optimal without having the recommendation rule specified.
I guess that the authors are aware of that it is technically challenging to guarantee that both of sampling rule and recommendation rule safely converge to their optimal combination, say . In this work, some assumptions are being made for one component in order to guarantee convergence of the other component (e.g., Assumption 6.1 for Theorem 6.2 vs. for Proposition 6.4). It was not discussed whether the assumptions can be satisfied simultaneously. Jedra & Proutiere (2020) had introduced the notion of forced exploration in order to break this mutual dependence. I believe the authors should explicitly discuss this gap.
方法与评估标准
The proposed algorithm PMCA looks completely sensible.
理论论述
I carefully read the statements provided in the main body, but not the proofs in the appendix. I gently believe that the proofs are rigorous.
实验设计与分析
- I believe the authors should report the computational time of the algorithms, because it is one of the main contributions of this paper.
- I hope to see performance/speed of the algorithm with discretization.
- In order to highlight the practical importance of PMCA, I would like to recommend the authors to find/test an application with many and densely distributed arms. Possibly, LLMs can help in direction (e.g., an arm corresponds a text's embedding).
补充材料
No supplementary materials to review.
与现有文献的关系
I am not sure how the key contributions can extend beyond the linear bandit literature.
遗漏的重要参考文献
None.
其他优缺点
Although I am skeptical about the claim “near optimal”, I do appreciate the ideas behind PMCA -- the use of PSGD method and the reparameterization trick. As suggested in Experimental Designs or Analyses, I believe that this algorithm has a great potential for the situations with many many arms.
其他意见或建议
- In Proposition 6.3, the term was used without any definition.
- Please see Experimental Designs or Analyses for my suggestions on experiments.
We sincerely appreciate the reviewer's thorough and insightful review. We will revise our manuscript based your on suggestions.
Instance-dependent optimality and recommendation rule
We agree that while an asymptotically optimal algorithm needs both optimal sampling and recommendation rules, we mainly focus on the optimization problem regarding the sampling rules, which we believe is the most challenging aspect in the continuous arm setting. More precisely, our main result, Theorem 6.2, relies on assumption (7) on the recommendation rule, and we show the greedy recommendation rule satisfies (7) under some assumptions. To improve clarity, in the introduction (or conclusion section) of the revised version of this paper, we will explicitly explain this more in detail.
Assumption 6.1 and the assumption for Proposition 6.4
Our method satisfies the condition on for Proposition 6.4 because we consider a mixture distribution of at Line 15 in Algorithm 1. Thus, we can focus on Assumption 6.1 and we believe that we already discussed the validity of the assumption in Sections 6.1 and 6.3. We will clarify this in the revision and appreciate the reviewer's suggestion.
Experiments
We sincerely appreciate reviewer's suggestion on the experiments. In the following table, we show cpu time for each algorithm for the left most instance in Figure 2. Regarding a method that discretizes the arm set, the efficiency of such a method can be arbitrary worse due to the discretization, however, we need different problem instances (higher dimensional problem instances) to demonstrate the inefficiency.
In the following table, the number represents time in seconds for running one experiment and shows the mean (std) over 10 repetitions. (w/ eval_time) represents the number includes cpu time for computing (the evaluation metric) and (w/o eval_time) represents the number excluding cpu time for .
The table shows our method took about 1.5 seconds while MVR took 0.2 seconds. This is natural since the instance-independent baselines (Uniform and MVR) do not solve the optimization objective over the space of probability measures. The experimental results show our method runs in a reasonable time.
| Uniform (w/ eval_time) | MVR (w/ eval_time) | Ours (w/ eval_time) | Uniform (w/o eval_time) | MVR (w/o eval_time) | Ours (w/o eval_time) |
|---|---|---|---|---|---|
| 1.2e+00 (1.9e-02) | 1.3e+00 (2.0e-02) | 2.6e+00 (4.3e-02) | 8.6e-03 (6.7e-05) | 1.9e-01 (6.4e-04) | 1.5e+00 (2.3e-02) |
Practical applications
As you mentioned, recent papers consider an application of bandits to LLMs [Li, Y., 2025, Nguyen, Quang H., et al., 2024, Jinnai & Ariu, 2024]. For instance, Jinnai & Ariu applied the Correlated Sequential-Halving to the text generation tasks (such as machine translation, text summarization, and image captioning tasks), where the objective is to find a text sequence with the best utility efficiently. By regarding an embedded sequence as an arm, we can formulate it as a BAI problem with a linear reward model. Since the set of possible sequences is exponentially large, our algorithm has potential application to this problem.
More generally, we can extend any linear bandit algorithm to Bayesian optimization algorithms by random Fourier features (or quadrature Fourier features) [Mutný, M., & Krause, A. 2019]. The continuous arm setting naturally arises in the Bayesian optimization setting and has numerous applications including material design and hyper-parameter optimization. As we discussed in the conclusion section, it is an important future study of this paper to directly extend our result to the Bayesian optimization setting (without such a reduction).
References
- Jinnai, Y., & Ariu, K. (2024). Hyperparameter-Free Approach for Faster Minimum Bayes Risk Decoding. In ACL (Findings).
- Li, Y. (2025). LLM Bandit: Cost-Efficient LLM Generation via Preference-Conditioned Dynamic Routing. arXiv preprint arXiv:2502.02743.
- Mutný, M., & Krause, A. (2019). Efficient high dimensional bayesian optimization with additivity and quadrature fourier features. Advances in Neural Information Processing Systems 31, 9005-9016.
- Nguyen, Quang H., et al. (2024). "MetaLLM: A High-performant and Cost-efficient Dynamic Framework for Wrapping LLMs." arXiv preprint arXiv:2407.10834 (2024).
Response to other comments
In Proposition 6.3, the term was used without any definition.
We thank the reviewer for pointing out this. In Proposition 6.3, is a typo and should be . We will correct this in the revision.
The paper investigates the problem of identifying an -optimal arm in linear bandits with a continuous action space. The analysis is conducted within a Bayesian framework, assuming a Gaussian prior over the unknown parameter. The objective is to minimize the posterior probability of error.
Due to the continuous nature of the action space, instance-specific analysis poses significant challenges. To address this, the authors employ a reparametrization of the sampling distribution combined with projected subgradient descent. They introduce PMCA, an algorithm that, for a given recommendation rule, provably achieves asymptotic optimality.
The contributions are substantial, as acknowledged by all reviewers. However, there are several areas for improvement, as outlined in the reviews and the authors’ rebuttal. In particular:
-
The authors should provide a more detailed discussion of the computational and memory complexities of PMCA, and report these metrics in the numerical experiments.
-
The use of the term instance-optimality may be misleading, as pointed out by Reviewers oQBX and huXi. This notion should be clarified.
-
The presentation could benefit from simplification and improved clarity. Reviewers have suggested, for instance, focusing on the greedy recommendation rule to enhance readability.
-
The computational oracles used by the algorithm should be exemplified and discussed in greater detail.
-
A clear and convincing discussion of the stopping rule is needed to complete the methodological picture.